UPDATE: With Zabbix 5.0 LTS recently released I have updated from the Zabbix 5.0 LTS Released section on wards.
Intro
Zabbix, just Zabbix, is an industry standard enterprise-class network, infrastructure, server and environment monitoring, reporting and alerting software system. It support distributed monitoring of up to hundreds of thousands of nodes with millions of items across multiple sites (assuming you have the compute, storage and db power to drive it anyway). Completely open source, free to use under a GPL v2 Licence with commercial support available from the Zabbix Company (Zabbix LLC). There is no software limits on how many hosts/items you can monitor with Zabbix. You’re only limited by your hardware, compute power and man hours.
Methodology and Usage
As an IT admin looking after a half-dozen sites over a 200km range (+ cloud services) its important to have ears everywhere. We utilize zabbix for most of our in house monitoring , data logging and alerting.
We have a single large full VM running on our primary server cluster (on Proxmox). Plus zabbix proxy lxc container at each site that has its own micro-server, also proxmox. Zabbix Proxys are able to collect data from all SNMP and Zabbix agent capable devices at remote sites and aggregate there data back to the primary server using a symmetrical PSK based encryption. Our larger sites are a part of a private WAN (so encryption internally is not necessary) but this encryption is useful for remote/obsecure sites/vehicles running on 4G links and our one NBN connected site.
Zabbix proxy’s receive a configuration file from the primary server listing the hosts the proxy is to query, the items to query and the intervals between checks. The zabbix-proxy then queries its hosts on behalf of the primary server and reports back live data to the primary. This does sometimes create a delay between configuration and first data as the proxy will only check for config updates hourly. But a “zabbix_proxy -R config_cache_reload” can be issued to trigger a forced update.
The zabbix proxy nodes also offer offline data caching (we currently have it configured for 1 week). If the link between the primary server and the proxy is dropped the proxy will continue to poll all its hosts using the latest config and cache data within a local database until the connection is re-established. It will remove data older then the max cache period if that time arises. Its very neat feature for short duration outages if you’re trying to find a RFO; and Awesome for longer outages as all your graphs wont have gaps. When the connection is re-established graphs for downed sites will take a few minutes to flush from the proxy to the main server and graphs will literally fill in the blanks as the data is received and filed by the main server.
Zabbix 5.0 LTS Released
Zabbix 5 has been released. We have yet to upgrade our internal server from 4.4 but I have upgraded my home instance.
For the home upgrade I chose to A) upgrade the host OS from Centos 7 to 8 and move the instance from a full VM (which had already been imported and converted from a VMWare VM) to a Proxmox LXC Container.
Steps:
1. Snapshot backup the full VM.
2. Build a new Centos 8 Container and install the small number of packages required.
3. Copy nginx config across and start all services, get the new installation to the point where it is complaining about no database being configured.
3. Shutdown zabbix service on old server and export/import the database across.
4. Start zabbix service on new box which will complete a database upgrade.
5. Check hosts, latest data and proxy nodes to confirm all online.
All-in-all relatively simply.
I attempted to in place upgrade from zabbix 4.4 to 5 on the original centos 7 full VM however had trouble with the php-fpm upgrade to 5.7 and inplace mariadb upgrade to 5.x.something to 11. Ended up blowing the broken full VM away in proxmox and restoring from the rollback backup loosing about an hour of data collection. New Centos 8 build and database migration ended up easier