Next steps77, Next steps – Kontron S4600 SEL Troubleshooting User Manual
Page 87

System Event Log Troubleshooting Guide for EPSD
Platforms Based on Intel
®
Xeon
®
Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Revision 1.1
Intel order number G90620-002
77
Byte
Field
Description
[5:4]
– 10b = OEM code in Event Data 3
[3:0]
– Event Trigger Offset as described in Table 64
15
Event Data 2
[7:2]
– Reserved. Set to 0.
[1:0]
– Rank on DIMM
0-3 = Rank number
16
Event Data 3
[7:5]
– Socket ID
0-3 = CPU1-4
[4:3]
–Channel
0-3 = Chan A-D for Socket
[2:0] DIMM
0-2 = DIMM 1-3 on Channel
Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Description
Next Steps
Hex
Description
01h
Uncorrectable ECC
Error
An uncorrectable (multi-bit) ECC error has occurred. This
is a fatal issue that will typically lead to an OS crash
(unless memory has been configured in a RAS mode).
The system will generate a CATERR# (catastrophic error)
and an MCE (Machine Check Exception Error).
While the error may be due to a failing DRAM chip on the
DIMM, it can also be cause by incorrect seating or
improper contact between socket and DIMM, or by bent
pins in the processor socket.
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify
contacts are clean.
4. Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure.
For multiple occurrences, replace the DIMM.
00h
Correctable ECC
Error threshold
reached
There have been too many (10 or more) correctable ECC
errors for this particular DIMM since last boot. This event
in itself does not pose any direct problems because the
ECC errors are still being corrected. Depending on the
RAS configuration of the memory, the IMC may take the
affected DIMM offline.
Even though this event doesn't immediately lead to problems, it
can indicate one of the DIMM modules is slowly failing. If this
error occurs more than once:
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify
contacts are clean.
4. Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.