HP Insight Control Software for Linux User Manual

Page 241

Service: Environment

Status Information: Node sensor status

A warning or critical message indicates that one or more monitored sensors reported that a threshold
was exceeded.

Correct the condition.

Service: Load Average

Status Information: Node Load Ave: x/y/z QueLen: n

A warning or critical message indicates that load average thresholds for the specific managed system
were exceeded.

Thresholds can be set on a per-managed system, per-class, or per-system basis in the
nagios_vars.ini

file. These values are specific to the site and depend on site load.

If the load average thresholds are reasonable, monitor for excessive activity on the managed system.

Service: Nagios Monitor

Status Information: Nagios status information

Typically, the status of Nagios, the number of Nagios services located, and the last time the Nagios
status log was updated.

A warning or critical message indicates that one or more of the Nagios monitor processes either failed
or reported error conditions that can degrade monitoring.

Ensure that the managed system can communicate with the CMS.

Service: Nodeinfo

Status Information: Node process status total/user/zombie , uptime

Displays the total number of processes, the number of user processes, the number of Zombie processes,
and the uptime for the Nagios host.

A warning or critical message indicates that thresholds for the specific managed system were exceeded.

Thresholds can be set on a per-managed system, per-class or per-system basis in the
nagios_vars.ini

file. These values are specific to the site and depend on site load.

If thresholds are reasonable, monitor for excessive activity on the managed system.

Service: Supermon Metrics Monitor

Status Information: Supermon node metrics retrieval status

Reports the status of the Supermon service and the number of systems from which it collected metrics
data.

A warning or critical message indicates that one or more systems were not accessible during metrics
collection or a Nagios service_check_timeout interval timed-out.

These messages can occur if metrics collection cannot be completed in a reasonable time; examine the
/opt/hptc/nagios/etc/nagios.cfg

file for the value of the service_check_timeout

parameter.

The default works best for configurations with fewer than 256 managed systems.

Increase the value of the service_check_timeout parameter to solve the problem for configurations
with more managed systems.

Also, run the following command to verify that the supermond service is running on the CMS:

# /etc/init.d/supermond status

Loss or time-outs of this service can cause per-managed system warnings for nodeinfo, load
average

and system free space.

A non-timeout warning or critical message indicates some monitored managed systems are not
responding; this is normal if the managed systems are down or otherwise disabled.

Service: Syslog Alert Monitor

Status Information: Status of consolidated.log syslog monitoring

25.14 Nagios Troubleshooting

241