beautypg.com

Processor run-time deconfiguration (cpu-gard) – IBM RS/6000 44P User Manual

Page 82

background image

It also uses the hardware error detection logic in the processor to capture run-time
recoverable and irrecoverable error indications. The firmware uses the error signatures
in the hardware to analyze and isolate the error to a specific processor.

The processors that are deconfigured remain off-line for subsequent reboots until the
faulty processor hardware is replaced.

This function allows usersto manually deconfigure or re-enable a previously
deconfigured processor through the Service Processor menu. The user can also enable
or disable this function through the Service Processor.

Processor Run-Time Deconfiguration (CPU-Gard)

Processor run-time deconfiguration allows for the dynamic removal of CPUs from the
system configuration. The objective is to minimize system failures or data integrity
exposures due to a faulty processor. The processor to be removed is the one that has
experienced repeated run-time recoverable internal errors (over a predefined threshold).

The function uses the hardware error detection logic in the processor to capture
run-time recoverable error indications. The firmware uses the error signatures in the
hardware to analyze and isolate the error to a specific CPU. The firmware also
maintains error-threshold information.

When an internal recoverable error for a processor reaches a predefined threshold, the
firmware notifies the AIX operating system. The AIX operating system migrates all
software processes and interrupts to another processor and puts the faulty processor in
stop state.

CPUs that are deconfigured at run time remain off-line for subsequent reboots through
the CPU Boot Time Deconfiguration function, until the faulty CPU hardware is replaced.
The user can also enable or disable this function via the AIX system management
function.

Memory Boot-Time Deconfiguration (Memory Repeat-Gard)

Memory boot time deconfiguration allows for the removal of a memory segment or
DIMM from the system configuration at boot time. The objective is to minimize system
failures or data integrity exposure due to faulty memory hardware. The hardware
resource(s) to be removed are the ones that experienced the following failures:

v

A boot-time test failure.

v

Run-time recoverable errors over threshold prior to the current boot phase.

v

Run-time irrecoverable errors prior to the current boot phase.

This function uses firmware Power-On Self-Test (POST) to discover and isolate memory
hardware failures during boot time. It also uses the hardware error detection logic in the
memory controller to capture run-time recoverable and irrecoverable error indications.
The firmware uses the error signatures in the hardware to analyze and isolate the error
to the specific memory segment or DIMM.

64

44P Series Model 170 User’s Guide