Monthly Archives: May 2017

Icinga2 for distributed system monitoring

This is a short introduction to distributed system monitoring using Icinga2, a open source monitoring solution. Besides Linux, It runs on Windows, too, although Windows support is a bit limited.

Icinga typically monitors things using so-called monitoring plugins. These are just executable programs returning an exit code and some output to stdin, wrapped in some Icinga-specific configuration. Yes, every check results in a command invocation that starts a process. Worse, in many cases there’s even further overhead; what’s run is actually shell script (or a bat file), which in turn runs the executable. Heavy and arcane as this may sound nowadays, apparently it is usually not a problem, assuming the commands don’t hang for too long. So timeouts can be important. There are also a so-called passive checks which means that instead of Icinga running a check, an outside system would submit the result of some check to Icinga.

There are lots of ready-made monitoring plugins available. If you’re nevertheless sure you need to write your own from scratch, see the monitoring-plugins docs for guidance (the old Icinga1 docs provide a shorter explanation). There is also at least one very necessary check command missing: a built-in HTTP check for use on the Microsoft Windows platform. Finding and implementing that will be a topic of a future post.

Icinga2 can be deployed in a distributed manner, for example so that there are two differently configured Icinga2 instances: a master and a slave, that connect over a network. In such a case, the master always has the monitoring configuration, ie. definitions of hosts and services to monitor, how to monitor them, and what to do depending on the outcome. Icinga has its own rather extensive configuration language for defining the monitoring configuration.

There are two alternative options for a master-slave deployment:

  1. The master schedules the checks, but does not run them. Instead, each time there is a scheduled check coming up, it sends a command to the slave telling it to perform the check and pass back the results.
  2. The master distributes the monitoring configuration to the client, which handles the scheduling and monitoring checking on its own, while passing back the results to the master.

Icinga provides built-in support for the two instances to connect securely. Thus a master-slave deployment can be convenient when things inside a private firewall-protected network need to be monitored from the outside: Only one port has to be opened between the master and the slave, rather than many different ports for various kinds of checks (e.g. ping, HTTP etc).

The distributed configuration can also provide some tolerance of disconnects: If the second option (out of the two listed above) is used and network connection is lost between the master and the slave, the slave will keep monitoring things; after all, it has all the needed configuration that it received from the master, to do so. After the connection comes up again, the slave submits a so-called replay log to the master, which master uses to update itself, ie. add the check results it missed while it and the slave were disconnected from each other.

Distributed monitoring with Icinga2 is a large and complex topic; for more information, it’s best to read the official Icinga docs and then check the forums and google for specific questions. While Icinga2 docs are extensive, their style tends to that of a reference. Good tutorials can be hard to find on some topics. So getting things going can be daunting, especially in larger or otherwise more complex scenarios. Simple things are fairly easy to configure, but the configuration language can also be very arduous; it can be difficult to get things right. Thankfully nowadays Icinga provides fairly adequate and understandable error messages. The forums are helpful for some things, but if your question shows you haven’t carefully read and tried to understand the docs before asking, be prepared to be scolded by the main developer and politely instructed to go RTFM and come back after that.