beautypg.com

Next steps77, Next steps – Kontron S4600 SEL Troubleshooting User Manual

Page 87

background image

System Event Log Troubleshooting Guide for EPSD

Platforms Based on Intel

®

Xeon

®

Processor E5 4600/2600/2400/1600/1400 Product Families

Memory Subsystem

Revision 1.1

Intel order number G90620-002

77

Byte

Field

Description

[5:4]

– 10b = OEM code in Event Data 3

[3:0]

– Event Trigger Offset as described in Table 64

15

Event Data 2

[7:2]

– Reserved. Set to 0.

[1:0]

– Rank on DIMM

0-3 = Rank number

16

Event Data 3

[7:5]

– Socket ID

0-3 = CPU1-4

[4:3]

–Channel

0-3 = Chan A-D for Socket

[2:0] DIMM

0-2 = DIMM 1-3 on Channel

Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps

Event Trigger Offset

Description

Next Steps

Hex

Description

01h

Uncorrectable ECC
Error

An uncorrectable (multi-bit) ECC error has occurred. This
is a fatal issue that will typically lead to an OS crash
(unless memory has been configured in a RAS mode).
The system will generate a CATERR# (catastrophic error)
and an MCE (Machine Check Exception Error).

While the error may be due to a failing DRAM chip on the
DIMM, it can also be cause by incorrect seating or
improper contact between socket and DIMM, or by bent
pins in the processor socket.

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify

contacts are clean.

4. Inspect the processor socket this DIMM is connected to for

bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure.

For multiple occurrences, replace the DIMM.

00h

Correctable ECC
Error threshold
reached

There have been too many (10 or more) correctable ECC
errors for this particular DIMM since last boot. This event
in itself does not pose any direct problems because the
ECC errors are still being corrected. Depending on the
RAS configuration of the memory, the IMC may take the
affected DIMM offline.

Even though this event doesn't immediately lead to problems, it
can indicate one of the DIMM modules is slowly failing. If this
error occurs more than once:

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify

contacts are clean.

4. Inspect the processor socket this DIMM is connected to for

bent pins, and if found, replace the board.

This manual is related to the following products: