beautypg.com

1 predictive self-healing tools, Predictive self-healing tools – FUJITSU SPARC ENTERPRISE M8000 User Manual

Page 65

background image

Chapter 2

Product Overview and Troubleshooting

2-27

You can find more detailed descriptions of Solaris OS Predictive Self-Healing at the
website below:

http://www.sun.com/bigadmin/features/articles/selfheal.html

Predictive self-healing is an architecture and methodology for automatically
diagnosing, reporting, and handling software and hardware fault conditions. This
new technology lessens the time required to debug a hardware or software problem
and provides the administrator and technical support with detailed data about each
fault.

2.6.1

Predictive Self-Healing Tools

In Solaris OS, the fault manager runs in the background. If a failure occurs, the
system software recognizes the error and attempts to determine what hardware is
faulty. The software also takes steps to prevent that component from being used
until it has been replaced. Some of the specific activities the software takes include:

Receives telemetry information about problems detected by the system software

Diagnoses the problems

Initiates pro-active self-healing activities. For example, the fault manager can
disable faulty components.

The state of a FRU, group of FRUs, or part of a FRU, that has been isolated
because a fault was detected. The isolation is usually done to prevent possibly
faulty components from affecting other system components. The part that is
isolated is not always the faulty part alone; a normal part may be degraded to
isolate the faulty part. If a function required for the operation of the system is
degraded, a system failure may result.

When possible, causes the faulty FRU to provide an LED indication of a fault in
addition to populating the system console messages with more details

TABLE 2-8

shows a typical message generated when a fault occurs. The message

appears on your console and is recorded in the /var/adm/messages file.

Note –

The message in

TABLE 2-8

indicates that the fault has already been diagnosed.

Any corrective action that the system can perform has already taken place. If your
server is still running, it continues to run.