Network monitoring with Nagios and OpenBSD

So now it's time to tell Nagios what to keep tabs on. Therefore, we must supply it with information about:

All this information is represented by means of objects, which are defined by a set of "define" statements, enclosed in curly braces and containing a variable number of newline-separated directives, in keyword/value form. Keywords are separated from values by whitespace and multiple values can be provided, separated by commas; indentation within statements is allowed.

To recap, the basic syntax of an object declaration can be represented as follows:

define object {
    keyword-1     value-1
    keyword-2     value-2,value-3,...
    [...]
    keyword-n     value-n
}

Object definitions can be split into any number of files: just remember to list them all in the main configuration file by using the cfg_file and/or cfg_dir directives.

3.1 Timeperiod definition

The timeperiod statement allows you to specify, for each day of the week, one or more time slots in which to run certain checks and/or notify certain people. Time intervals can't span across midnight and excluded days are simply omitted.

In the following example, all the timeperiod definitions are grouped together in a file named timeperiods.cfg stored in the /var/www/etc/nagios/ directory.

/var/www/etc/nagios/timeperiods.cfg

# A simple timeperiod including bank holidays. The 'timeperiod_name' and
# 'alias' directives are mandatory.
define timeperiod {
    timeperiod_name    bankholidays
    alias              Bank Holidays
    january 1          00:00-24:00	; New Year's Day
    january 6          00:00-24:00	; Epiphany
    may 1              00:00-24:00	; Labour's Day
    december 25        00:00-24:00	; Christmas
    december 26        00:00-24:00	; Boxing Day
}

# Timeperiod for normal work hours. Note that weekend days are simply omitted.
# The 'exclude' keyword allows you to subtract a timeperiod from another.
define timeperiod {
    timeperiod_name    workhours
    alias              Work Hours
    monday             09:00-18:00
    tuesday            09:00-18:00
    wednesday          09:00-18:00
    thursday           09:00-18:00
    friday             09:00-18:00
    exclude            bankholidays
}

# The following timeperiod includes all time outside normal work hours. The
# time slot between 6 p.m. and 9 a.m. must be split into two intervals, to avoid
# crossing midnight
define timeperiod {
    timeperiod_name    nonworkhours
    alias              Non-Work Hours
    sunday             00:00-24:00
    monday             00:00-09:00,18:00-24:00
    tuesday            00:00-09:00,18:00-24:00
    wednesday          00:00-09:00,18:00-24:00
    thursday           00:00-09:00,18:00-24:00
    friday             00:00-09:00,18:00-24:00
    saturday           00:00-24:00
}

# Most checks will probably run on a continuous basis
define timeperiod {
    timeperiod_name    always
    alias              Every Hour Every Day
    sunday             00:00-24:00
    monday             00:00-24:00
    tuesday            00:00-24:00
    wednesday          00:00-24:00
    thursday           00:00-24:00
    friday             00:00-24:00
    saturday           00:00-24:00
}

# The right timeperiod when you don't want to bother with notifications (e.g.
# during testing)
define timeperiod {
    timeperiod_name    never
    alias              No Time is a Good Time
}

# Some exceptions to the normal weekly time (see documentation for more examples)
define timeperiod {
    timeperiod_name    exceptions
    alias              Some random dates
    2008-12-15         00:00-24:00        ; December 15th, 2008
    friday 3           00:00-24:00        ; 3rd Friday of every month
    february -1        00:00-24:00        ; Last day in February of every year
    march 20 - june 21 00:00-24:00        ; Spring
    day 1 - 15         00:00-24:00        ; First half of every month
    2008-01-01 / 7     00:00-24:00        ; Every 7 days from Jan 1st, 2008
}

3.2 Command definition

The next step is to tell Nagios how to perform the various checks and send out notifications, by defining multiple command objects containing the actual commands for Nagios to run.

Command definitions are pairs of short names and command lines (both mandatory) and can contain macros. As we mentioned before, macros are variables, enclosed in "$" signs, that will be expanded to the appropriate value immediately prior to the execution of a command; macros allow you to keep command definitions generic and straightforward. A simple example will make this clear.

Suppose you want to monitor a web server with IP address "1.2.3.4"; you could then define a command such as the following:

define command {
    command_name    check-http
    command_line    /usr/local/libexec/nagios/check_http -I 1.2.3.4
}

This definition is correct and will certainly do the job. But what if you later decide to add a new web server? Would you find it convenient to define a new (almost identical) command, with only the IP address changed? It is way more efficient to take advantage of macros by writing a single generic command such as:

define command {
    command_name    check-http
    command_line    $USER1$/check_http -I $HOSTADDRESS$
}

and leave Nagios the responsibility to expand the built-in $HOSTADDRESS$ macro to the appropriate IP address, obtained from the host definition (see below). As you'll remember from the previous chapter, the $USER1$ macro holds the path to the plugins directory.

Now let's complicate things a bit! What if you want Nagios to check the availability of a particular URL on each web server? This URL may differ from server to server, so what we need now is a command definition that is still generic and yet server-specific! Though this may sound contradictory, once again Nagios solves this problem with macros: in fact, the $ARGn$ macros (where n is a number between 1 and 32 inclusive) act as placeholders for service-specific arguments that will be specified later within service definitions (see below for further details). Therefore, the above command definition would turn into:

define command {
    command_name    check-http
    command_line    $USER1$/check_http -I $HOSTADDRESS$ -u $ARG1$
}

In addition to the ones we have just seen, Nagios provides several other useful macros. Please refer to the documentation for a detailed list of all available macros and their validity context. Below is a sample set of command definitions.

/var/www/etc/nagios/commands.cfg

################################################################################
# Notification commands                                                        #
# There are no standard notification plugins; hence notification commands are  #
# usually custom scripts or mere command lines.                                #
################################################################################
define command {
    command_name    host-notify-by-email
    command_line    $USER1$/host_notify_by_email.sh $CONTACTEMAIL$
}

define command {
    command_name    notify-by-email
    command_line    $USER1$/notify_by_email.sh $CONTACTEMAIL$
}

define command {
    command_name    host-notify-by-SMS
    command_line    /usr/local/bin/sendsms $ADDRESS1$ "Nagios: Host $HOSTNAME$ ($HOSTADDRESS$)is in state: $HOSTSTATE$"
}

define command {
    command_name    notify-by-SMS
    command_line    /usr/local/bin/sendsms $ADDRESS1$ "Nagios: Service $SERVICEDESC$ on $HOSTALIAS$ is in state: $SERVICESTATE$"
}

################################################################################
# Check commands                                                               #
# The official Nagios plugins should handle most of your needs for host and    #
# service checks. Anyway, should they not, we will later discuss how to write  #
# custom plugins.                                                              #
################################################################################
define command {
    command_name    check-host-alive
    command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
}

define command {
    command_name    check-ssh
    command_line    $USER1$/check_ssh $HOSTADDRESS$
}

define command {
    command_name    check-http
    command_line    $USER1$/check_http -I $HOSTADDRESS$ -u $ARG1$
}

define command {
    command_name    check-smtp
    command_line    $USER1$/check_smtp -H $HOSTADDRESS$
}

define command {
    command_name    check-imap
    command_line    $USER1$/check_imap -H $HOSTADDRESS$
}

define command {
    command_name    check-dns
    command_line    $USER1$/check_dns -s $HOSTADDRESS$ -H $ARG1$ -a $ARG2$
}

define command {
    command_name    check-mysql
    command_line    $USER1$/check_mysql -H $HOSTADDRESS -u $USER2$ -p $USER3$
}

[...]

3.3 Contact definition

contact objects allow you to specify people who should be notified automatically when the alert conditions are met. Contacts are first defined individually and then grouped together in contactgroup objects, for easier management.

For the first time, in the following definitions, we will refer to previously defined objects. In fact, the values of the host_notification_period and service_notification_period directives must be timeperiod objects; and the values of the host_notification_command and service_notification_command directives must be command objects.

/var/www/etc/nagios/contacts.cfg

define contact {
# Short name to identify the contact
    contact_name                    john
# Longer name or description
    alias                           John Doe

# Enable notifications for this contact
    host_notifications_enabled      1
    service_notifications_enabled   1

# Timeperiods during which the contact can be notified about host and service
# problems or recoveries
    host_notification_period        always
    service_notification_period     always

# Host states for which notifications can be sent out to this contact
# (d=down, u=unreachable, r=recovery, f=flapping, n=none)
    host_notification_options       d,u,r

# Service states for which notifications can be sent out to this contact
# (w=warning, c=critical, u=unknown, r=recovery, f=flapping, n=none)
    service_notification_options    w,u,c,r

# Command(s) used to notify the contact about host and service problems
# or recoveries
    host_notification_commands      host-notify-by-email,host-notify-by-SMS
    service_notification_commands   notify-by-email,notify-by-SMS

# Email address for the contact
    email                           jdoe@kernel-panic.it

# Nagios provides 6 address directives (named address1 through address6) to
# specify additional "addresses" for the contact (e.g. a mobile phone number
# for SMS notifications)
    address1                        xxx-xxx-xxxx

# Allow this contact to submit external commands to Nagios from the CGIs
    can_submit_commands             1
}

# The following contact is split in two, to allow for different notification
# options depending on the timeperiod
define contact {
    contact_name                    danix@work
    alias                           Daniele Mazzocchio
    host_notifications_enabled      1
    service_notifications_enabled   1
    host_notification_period        workhours
    service_notification_period     workhours
    host_notification_options       d,u,r
    service_notification_options    w,u,c,r
    host_notification_commands      host-notify-by-email
    service_notification_commands   notify-by-email
    email                           danix@kernel-panic.it 
    can_submit_commands             1
}

define contact {
    contact_name                    danix@home
    alias                           Daniele Mazzocchio
    host_notifications_enabled      1
    service_notifications_enabled   1
    host_notification_period        nonworkhours
    service_notification_period     nonworkhours
    host_notification_options       d,u
    service_notification_options    c
    host_notification_commands      host-notify-by-email,host-notify-by-SMS
    service_notification_commands   notify-by-email,notify-by-SMS
    email                           danix@kernel-panic.it
    address1                        xxx-xxx-xxxx
    can_submit_commands             1
}

[...]

# All administrator contacts are grouped together in the "Admins"
# contactgroup
define contactgroup {
    contactgroup_name               Admins
    alias                           Nagios Administrators
    members                         danix@work,danix@home,john
}

[...]

3.4 Host definition

Now we have finally come to one of the most important aspects of Nagios configuration: the definition of the hosts (servers, workstations, devices, etc.) that we want to monitor. This will lead us to introduce one of the most powerful features of Nagios configuration: object inheritance. Note that, though we are discussing it now first, object inheritance applies to all Nagios objects; however, it's in the definition of hosts and services that you can get the most out of it.

In fact, configuring a host requires setting up quite a few parameters; and the value of these parameters will normally be the same for most hosts. Without object inheritance, this would mean wasting a lot of time typing the same parameters over and over again and eventually ending up with cluttered, overweight and almost unmanageable configuration files.

But luckily, Nagios is smart enough to save you a lot of typing by allowing you to define special template objects, whose properties can be "inherited" by other objects without having to rewrite them. Below is a brief example of how a template is created:

define host {
    name                            generic-host-template  # Template name

    check_command                   check-host-alive
    check_period                    always
    max_check_attempts              5
    notification_options            d,u,r

    register                        0                      # Don't register it!
}

As you can see, a template definition looks almost identical to a normal object definition. The only differences are:

To create an actual host object from a template, you simply need to specify the template name as the value of the use directive and make sure that all mandatory fields are either inherited or explicitely set:

define host {
    host_name                       hostname
    use                             generic-host-template
    alias                           alias
    address                         x.x.x.x
}

Well, now let's move from theory to practice and define two host templates for our servers. Note that the second one inherits from the first; this is possible because Nagios allows multiple levels of template objects.

/var/www/etc/nagios/generic-hosts.cfg

# The following is a template for all hosts in the LAN
define host {
# Template name
    name                            generic-lan-host

# Command to use to check the state of the host
    check_command                   check-host-alive

# Contact groups to notify about problems (or recoveries) with this host
    contact_groups                  Admins

# Enable active checks
    active_checks_enabled           1
# Time period during which active checks of this host can be made
    check_period                    always
# Number of times that Nagios will repeat a check returning a non-OK state
    max_check_attempts              3

# Enable the event handler
    event_handler_enabled           1

# Enable the processing of performance data
    process_perf_data               1

# Enable retention of host status information across program restarts
    retain_status_information       1
# Enable retention of host non-status information across program restarts
    retain_nonstatus_information    1

# Enable notifications
    notifications_enabled           1
# Time interval (in minutes) between consecutive notifications about the
# server being _still_ down or unreachable
    notification_interval           120
# Time period during which notifications about this host can be sent out
    notification_period             always
# Host states for which notifications should be sent out (d=down,
# u=unreachable, r=recovery, f=flapping, n=none)
    notification_options            d,u,r

# Don't register this definition: it's only a template, not an actual host
    register                        0
}

# DMZ hosts inherit all attributes from the generic-lan-host by means of the
# 'use' directive. The only difference is that Nagios has to go through the
# internal (CARP) firewalls to reach the DMZ servers, thus requiring the
# additional 'parents' directive.
define host {
    name                            generic-dmz-host

# The 'use' directive specifies the name of a template object that you want
# this host to inherit properties from
    use                             generic-lan-host

# This directive specifies the hosts that lie between the monitoring host
# and the remote host (more information here)
    parents                         fw-int

# This too is a template
    register                        0
}

Now we can take advantage of our templates to define the actual hosts in a few lines.

/var/www/etc/nagios/hosts/servers.cfg

# Configuration for host dns1.lan.kernel-panic.it
define host {
    use                             generic-lan-host
    host_name                       dns1
    alias                           LAN primary master name server
    address                         172.16.0.161

# Extended information (completely optional)
    notes                           This is the internal primary master name server (Bind 9.4.2-P2)
# URL with more information about this host
    notes_url                       http://www.kernel-panic.it/openbsd/dns/
# Image associated with this host in the status CGI; images must be placed in
# /var/www/nagios/images/logos/
    icon_image                      dns.png
# String used in the 'alt' tag of the icon_image
    icon_image_alt                  [dns]
# Image associated with this host in the statusmap CGI
    statusmap_image                 dns.gd2
}

# Configuration for host mail.kernel-panic.it
define host {
    use                             generic-dmz-host
    host_name                       mail
    alias                           Mail server
    address                         172.16.240.150
    notes                           This is the Postfix mail server (with IMAP(S) and web access)
    notes_url                       http://www.kernel-panic.it/openbsd/mail/
    icon_image                      mail.png
    icon_image_alt                  [Mail]
    statusmap_image                 mail.gd2
}

# Configuration for host proxy.kernel-panic.it
define host {
    use                             generic-dmz-host
    host_name                       proxy
    alias                           Proxy server
    notes                           This is the Squid proxy server
    notes_url                       http://www.kernel-panic.it/openbsd/proxy/
    icon_image                      proxy.png
    icon_image_alt                  [Proxy]
    statusmap_image                 proxy.gd2
}

[...]

/var/www/etc/nagios/hosts/firewalls.cfg

# Configuration for host fw-int.kernel-panic.it
define host {
    use                             generic-lan-host
    host_name                       fw-int
    alias                           Internal firewalls' CARP address
    address                         172.16.0.202
    notes                           Virtual CARP address of the internal firewalls
    notes_url                       http://www.kernel-panic.it/openbsd/carp/
    icon_image                      fw.png
    icon_image_alt                  [FW]
    statusmap_image                 fw.gd2
}

# Configuration for host mickey.kernel-panic.it
define host {
    use                             generic-lan-host
    host_name                       mickey
    alias                           Internal Firewall #1
    address                         172.16.0.200
    notes                           Internal firewall (first node of a two-nodes CARP cluster)
    notes_url                       http://www.kernel-panic.it/openbsd/carp/
    icon_image                      fw.png
    icon_image_alt                  [FW]
    statusmap_image                 fw.gd2
}

[...]

Hosts can optionally be grouped together with the hostgroup statement, which has no effect on monitoring, but simply allows you to display the hosts in groups in the CGIs.

/var/www/etc/nagios/hosts/hostgroups.cfg

# Domain Name Servers
define hostgroup {
    hostgroup_name                  DNS
    alias                           Domain Name Servers
    members                         dns1,dns2,dns3,dns4
    notes                           Our internal Domain Name Servers, running Bind 9.4.2-P2
}

# Firewalls
define hostgroup {
    hostgroup_name                  firewalls
    alias                           CARP Firewalls
    members                         mickey,minnie,donald,daisy,fw-int,fw-ext
    notes                           Our CARP-enabled firewalls (both virtual and physical addresses)
}

# Web servers
define hostgroup {
    hostgroup_name                  WWW
    alias                           Web Servers
    members                         www1,www2
    notes                           Our corporate web servers, running Apache 1.3
}

3.5 Service definition

Configuring the services to monitor is much like configuring hosts: object inheritance can save you a lot of typing and you can group services together with the optional servicegroup statement. Below is the definition of our service template:

/var/www/etc/nagios/generic-services.cfg

define service {
# Template name
    name                            generic-service

# Services are normally not volatile
    is_volatile                     0

# Contact groups to notify about problems (or recoveries) with this service
    contact_groups                  Admins

# Enable active checks
    active_checks_enabled           1
# Time period during which active checks of this service can be made
    check_period                    always
# Time interval (in minutes) between "regular" checks, i.e. checks that
# occur when the service is in an OK state or when the service is in a non-OK
# state, but has already been re-checked max_check_attempts number of times
    normal_check_interval           5
# Time interval (in minutes) between non-regular checks
    retry_check_interval            1
# Number of times that Nagios will repeat a check returning a non-OK state
    max_check_attempts              3
# Enable service check parallelization for better performance
    parallelize_check               1
# Enable passive checks
    passive_checks_enabled          1

# Enable the event handler
    event_handler_enabled           1

# Enable the processing of performance data
    process_perf_data               1

# Enable retention of service status information across program restarts
    retain_status_information       1
# Enable retention of service non-status information across program restarts
    retain_nonstatus_information    1

# Enable notifications
    notifications_enabled           1
# Time interval (in minutes) between consecutive notifications about the
# service being _still_ in non-OK state
    notification_interval           120
# Time period during which notifications about this service can be sent out
    notification_period             always
# Service states for which notifications should be sent out (c=critical,
# w=warning, u=unknown, r=recovery, f=flapping, n=none)
    notification_options            w,u,c,r

    register                        0
}

Now, before moving to services definitions, we should complete our discussion on passing service-specific arguments to commands by means of the $ARGn$ macros. As you'll remember, these macros act as placeholders: they expand to the nth argument passed to the command in the service definition; for instance, a command definition such as the following expects to be passed two arguments:

define command {
    command_name                    some-command
    command_line                    $USER1$/check_something $ARG1$ $ARG2$
}

Therefore, to configure a service check to use the above command, we will need to assign the check_command variable a string containing the command's short name followed by the arguments, separated by "!" characters. E.g.:

define service {
    service_description             some-service
    check_command                   some-command!arg-1!arg-2
    [...]
}

/var/www/etc/nagios/services/services.cfg

# Secure Shell service
define service {
    use                             generic-service
    service_description             SSH
# Short name(s) of the host(s) that run this service. If a service runs on all
# hosts, you may use the '*' wildcard character
    host_name                       *
    check_command                   check-ssh
# This directive is a possible alternative to using the members directive in
# service groups definitions
    servicegroups                   ssh-services
# Extended information
    notes                           Availability of the SSH daemon
    notes_url                       http://www.openssh.org/
    icon_image                      ssh.png
    icon_image_alt                  [SSH]
}

# Web service
define service {
    use                             generic-service
    service_description             WWW
    host_name                       www1,www2
    check_command                   check-http!/index.html
    notes                           Availability of the corporate web sites
    notes_url                       http://www.apache.org/
    icon_image                      www.png
    icon_image_alt                  [WWW]
}

define service {
    use                             generic-service
    service_description             WWW
    host_name                       mail
    check_command                   check-http!/webmail/index.html
    notes                           Availability of the web access to the mail server
    notes_url                       http://www.squirrelmail.org/
    icon_image                      www.png
    icon_image_alt                  [WWW]
}

[...]

Just like hosts, services can be grouped together with the servicegroup directive:

/var/www/etc/nagios/services/servicegroups.cfg

define servicegroup {
    servicegroup_name               www-services
    alias                           Web Services
# The 'members' directive requires a comma-separated list of host and
# service pairs, e.g. 'host1,service1,host2,service2,...'
    members                         www1,WWW,www2,WWW,mail,WWW
}

define servicegroup {
    servicegroup_name               dns-services
    alias                           Domain Name Service
    members                         dns1,DNS,dns2,DNS,dns3,DNS,dns4,DNS
}

# The members of the following servicegroup are specified with the
# 'serviecegroups' directive in the 'SSH' service definition
define servicegroup {
    servicegroup_name               ssh-services
    alias                           Secure Shell Service
}

[...]

Well, the bulk of the work is over now: the last step is configuring the web interface and then we will finally be able to set our Nagios server to work!