nagios 目录结构

假定 nagios 的 root 目录为:
ROOT = /opt/nagios

nagios 的整个目录结构如下:

1. resource.cfg 中定义了一些宏:
$ cat $ROOT/etc/resource.cfg
$USER1$=/usr/local/nagios/libexec

2. commands.cfg 中定义各种命令:
$ cat $ROOT/etc/objects/commands.cfg
define command{
        command_name    xxx
        command_line       xxx
        }

3. contacts.cfg 中定义了联系人:
cat $ROOT/etc/objects/contacts.cfg
define contact{
        name                                        critical-daytime
        host_notifications_enabled      1  
        …
        register                                     0   
       }  

define contact{
        contact_name                    jaseywang-critical-daytime
         …
        email                                  jaseywang # gmail.com
        }  

define contactgroup{
        contactgroup_name       jaseywang-daytime
        alias                               jaseywang-daytime
        members                       jaseywang-critical-daytime
        }

上面的 register 表示可以继承,并不是一个真正的可以使用的 contact。

4. 定义时间段:
$ cat $ROOT/etc/objects/timeperiods.cfg
define timeperiod{
        timeperiod_name 24×7
        alias              24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday         00:00-24:00
        tuesday         00:00-24:00
        wednesday   00:00-24:00
        thursday        00:00-24:00
        friday             00:00-24:00
        saturday        00:00-24:00
        }  

contactgroup(analytics-daytime) <- contact(wangyuxi-critical-daytime) <- contact(critical-daytime) <- timeperiod(daytime:10:00 – 23:00)
contactgroup 由众多的 contact 构成,cotact 定义了联系邮件等方式,而其继承过来的 contact 定义了一些诸如 host/service 通知的之间段,通知使用的命令等通用选项。时间算的选项则通过 timeperiod 来定义。

5. host.cfg, service,cfg 里面分别定义 host 以及 service 模板:
$ cat $ROOT/etc/objects/host.cfg
define host{
        name                            generic-host
        …
        register                        0   
        }  

define host{
    name                    xxx
    use                       generic-host
    contact_groups    xxx
    register                0   
    }  

define hostescalation{
    name                         xxx
    hostgroup_name       xxx
    first_notification         3
    last_notification         0
    notification_interval   30
    contact_groups          xxx
    }

上面的定义了 host 的通用模板,包括一个 generic-host 类型,而 xxx 则继承(use)自 generic-host, hostescalation 则定义了警报发出的时间选项。

6. service 的跟 host 类似:
$ cat $ROOT/etc/object/service.cfg
define service{
        name                            generic-service
        register                         0  
        …
        }  

define service{
        name                          xxx
        use                             generic-service
        contact_groups          xxx
        register                      0
    }

7. 还可以定义 hostgroup, servicegroup 等,这个可以把有类似的 host, service 的给汇聚起来:
$ cat $ROOT/etc/object/hostgroups.cfg
define hostgroup{
        hostgroup_name  network
        alias                     network servers
        members         xxxxxxxxxxxxxxxxxxxxxxxxxx, \
                                yyyyyyyyyyyyyyyyyyyyyyyyyy
        }  

$ cat $ROOT/etc/object/servicegroup.cfg
define servicegroup{
    servicegroup_name   xxx
    alias                           yyy
    members             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, \
                                yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
    }  

对于 nagios 来收,最小的单位是 host/service,host 默认就是使用 ping 来检查主机的存活,可以通过 Host dependency definitions 来实现,这个在 host 的 down 或者是 unreachable 状态比较有用。sercice 的几个依赖无非就是 check_command、check_period、notification_period、contacts、contact_groups 这几个,把他们定义好了就可以了。

关于报警时间的一些理解。在 host 上有如下几个相关的参数,services 的类似。

1. 通过 check_command 来检查主机的存活:
check_command:     This directive is used to specify the short name of the command that should be used to check if the host is up or down. Typically, this command would try and ping the host to see if it is "alive". The command must return a status of OK (0) or Nagios will assume the host is down. If you leave this argument blank, the host will not be actively checked. Thus, Nagios will likely always assume the host is up (it may show up as being in a "PENDING" state in the web interface). This is useful if you are monitoring printers or other devices that are frequently turned off. The maximum amount of time that the notification command can run is controlled by the host_check_timeout option.

2. 正常检查时间间隔(2 min):
check_interval:     This directive is used to define the number of "time units" between regularly scheduled checks of the host. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

3. 非正常状态下的,进行重新检查的间隔时间,如果试了 max_check_attempts 之后状态依然没变,则使用 check_interval 的值作为时间间隔(1 min):
retry_interval:     This directive is used to define the number of "time units" to wait before scheduling a re-check of the hosts. Hosts are rescheduled at the retry interval when they have changed to a non-UP state. Once the host has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

4. host 返回非正常状态时,重新执行 check command 的次数,设置为 1 的话,不会做 retry,而是直接 alert(3):
max_check_attempts:     This directive is used to define the number of times that Nagios will retry the host check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the host check. Note: If you do not want to check the status of the host, you must still set this to a minimum value of 1. To bypass the host check, just leave the check_command option blank.

5. down 之后的通知时间间隔(5 min):
notification_interval:     This directive is used to define the number of "time units" to wait before re-notifying a contact that this service is still down or unreachable. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this host – only one problem notification will be sent out.

一个完整的过程如下:
1. 在 check_interval 的时间间隔内的状态是 ok 的
2. 如果 host 进入了 NON OK 的状态,但是还没有超过 max_check_attempts,则进入 SOFT NON OK 状态,此时检查的时间间隔为 retry_interval
3. 当 host 的 NON OK 状态超过 max_check_attempts 后,会进入 HARD NON OK 状态,这以后的检查时间间隔变为 check_interval
4. 如果 first_notification_delay 设置了的话,可以修改第一次发出通知的时间,0 的话代表检测到问题立即发送