The document discusses CPU and memory architecture, error handling, and troubleshooting for Opteron processors. It covers cache and memory organization, error reporting banks, correctable and uncorrectable error types, machine check exceptions, memory addressing including interleaving and the memory hole, and provides examples of error messages from Linux and Solaris systems.
Semelhante a CPU Memory Errors & HandlingTITLE Analyzing Opteron CPU Memory Issues TITLE Troubleshooting Opteron Memory ErrorsTITLE Opteron CPU Cache Memory Errors
Semelhante a CPU Memory Errors & HandlingTITLE Analyzing Opteron CPU Memory Issues TITLE Troubleshooting Opteron Memory ErrorsTITLE Opteron CPU Cache Memory Errors (20)
29. Another example of sync flood error - not so friendly - 1501 | 04/10/2007 | 04:18:02 | OEM #0x12 | | Asserted 1601 | OEM record e0 | 00004800001111002000000000 1701 | OEM record e0 | 10ab0000000810000006040012 1801 | OEM record e0 | 10ab0000001111002011110020 1901 | OEM record e0 | 1800000000f60000010005001b 1a01 | OEM record e0 | 180000000000000000dffe0000 1b01 | OEM record e0 | 1900000000f200002000020c0f 1c01 | OEM record e0 | 1a00000000f200001000020c0f 1d01 | OEM record e0 | 1b00000000f200003000020c0f 1e01 | OEM record e0 | 80004800001111032000000000
30.
31. Linux machine check exception example CPU 0: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b600000000000185 at 0000000000000940 Kernel panic: CPU context corrupt The above is from kernel: 2.4.21-27.0.1.ELsmp #1 SMP
45. Memory Hole address range without remapping Node address range displayed at boot. Each Node has 4GB node 0 has “lost” memory (a 4G address range would be 000000000000000-00000000ffffffff) Memory hole exists between dfffffff and fffffff =20000000 [root@va64-x4100f-gmp03 log]# pwd /var/log [root@va64-x4100f-gmp03 log]# grep -i Bootmem mess* Bootmem setup node 0 000000000000000-00000000dfffffff Bootmem setup node 1 0000000100000000-00000001ffffffff
46. Address range with memory remapping around hole (hoisting) In this case we do not lose the memory. RAM addressing is remapped around the memory hole so address range on Mode 0 grows by 20000000 base + limit of node 1 grows by 20000000 Bootmem setup node 0 0000000000000000-000000011fffffff Bootmem setup node 1 0000000120000000-000000021fffffff
48. Red Hat 3 Update 2 kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel: ECC syndrome bits e307 kernel: extended error chipkill ecc error kernel: link number 0 kernel: dram scrub error kernel: corrected ecc error kernel: error address valid kernel: error enable kernel: previous error lost kernel: error address 00000000cf31f8f0
49. Later Red Hat 3 example kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8
50. Example of Red Hat 3 GART error CPU 3: Silent Northbridge MCE Northbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0
51. Example of EDAC output EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label "": k8_edac EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
52. MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 4 northbridge TSC e169139a35188 ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d422400040080a13 MCGSTATUS 0 Suse mcelog example kernel 2.6.16.27
53. Further Suse mcelog example MCE 31 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 1 instruction cache TSC 3e2dc434cdb5 ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0
54. ECC ( non chipkill example) CPU 2 4 northbridge TSC 3da2afa1102b ADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d418c00000000a13 MCGSTATUS 0
55. Confusing EDAC example note two MC numbers reporting. eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8, syndrome 0xf4, row 0, channel 1, label "": k8_edac eaebe242 kernel: MC1: CE - no information available: k8_edac Error Overflow set eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
56.
57. fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288 , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
58. # fmdump TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 13441a52-c465-629b-ca9d-fc77b0e66354 -------- ---------------------------------------------------------------------- # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354 TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
59. Example of FMA detecting CPU error Solaris handles machine check exception and FMA information is available on reboot
60. SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME: SOURCE: SunOS, REV: 5.10 Generic_118855-14 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. Thu Jan 4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry REC-ACTION: Save the error summary below in case telemetry cannot be saved [Thu Jan 4 21:43:21 2007] [Thu Jan 4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401 detector=[ > > version=0 scheme= "hc" hc-list=[...] ] bank-status=b60000000002017a bank-number=2 addr=5a0c addr-valid=1 ip=0 privileged=1 ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401
61. System now panics and then reboots panic[cpu1]/thread=fffffe800032fc80: Unrecoverable Machine-Check Exception dumping to /dev/dsk/c0t0d0s1, offset 860356608,
62. SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1, Severity Major EVENT-TIME: Fri Jan 5 10:11:10 MET 2007 PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.net SOURCE: eft, REV: 1.16 EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8d DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-67 for more information. RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace affected CPU. Use fmdump -v -u <EVENT_ID> to identify the module.
63. #>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME UUID SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100% fault.cpu.amd.l2cachetag Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1