HP Insight Control Software for Linux User Manual

Page 223

A warning or critical message indicates that thresholds for the specific managed system were exceeded.

Thresholds can be set on a per-managed system, per-class or per-system basis in the nagios_vars.ini
file. These values are specific to the site and depend on site load.

If thresholds are reasonable, monitor for excessive activity on the managed system.

Service: Supermon Metrics Monitor

Status Information: Supermon node metrics retrieval status

Reports the status of the Supermon service and the number of systems from which it collected metrics
data.

A warning or critical message indicates that one or more systems were not accessible during metrics
collection or a Nagios service_check_timeout interval timed-out.

These messages can occur if metrics collection cannot be completed in a reasonable time; examine the
/opt/hptc/nagios/etc/nagios.cfg

file for the value of the service_check_timeout

parameter.

The default works best for configurations with fewer than 256 managed systems.

Increase the value of the service_check_timeout parameter to solve the problem for configurations
with more managed systems.

Also, run the following command to verify that the supermond service is running on the CMS:

# /etc/init.d/supermond status

Loss or time-outs of this service can cause per-managed system warnings for nodeinfo, load average
and system free space.

A non-timeout warning or critical message indicates some monitored managed systems are not
responding; this is normal if the managed systems are down or otherwise disabled.

Service: Syslog Alert Monitor

Status Information: Status of consolidated.log syslog monitoring

Reports the number of new records processed in the /hptc_cluster/adm/logs/consolidated.log
file.

A warning or critical message occurs when there is insufficient time to process a huge volume of messages
before the Nagios service_check_timeout period expires.

Nagios examines the recent incoming consolidated log messages and issues a warning or critical
message if the incoming rate since last interval exceeds a configured number of records. The default
values are 2 for warnings and 20 for critical. See /opt/hptc/nagios/libexec/
check_syslogalerts

for details.

No specific action is required unless the service times out. In that case, an excessive number of syslog
messages is collected across the system; this is more than the plug-in can process in the
service_check_timeout

period. See the /opt/hptc/nagios/etc/nagios.cfg file for the

value of the service_check_timeout parameter.

To solve the problem, run the following command on the system reporting the error:

# /opt/hptc/nagios/libexec/check_syslogalerts –domain node:nagios_monitor –nsca

Otherwise, wait for the nightly log to roll over.

Service: Syslog Alerts

Status Information: Node Syslog alerts information

Reports the number of alerts in a specified period of time and allows you to access the most recent log.

A warning or critical message indicates that one or more rules defined in the /opt/hptc/nagios/
etc/syslogAlertRules

file matches the specified system's consolidated log file.

Take the appropriate action based on the message.

Service: System Event Log

Status Information: Node Syslog alerts information

A warning or critical message indicates that one or more rules defined in the /opt/hptc/nagios/
etc/selRules

file matches the specified system's firmware System Event Log.

Take the appropriate action based on the System Event Log message.

Service: System Free Space

25.14 Nagios Troubleshooting 223