SlideShare uma empresa Scribd logo
1 de 76
CPU and Memory Events [email_address]
Topics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
CPU Architecture
 
Opteron Processor Overview
Dual Core Opteron
Cache and Memory
Cache  Organisation
Cache Details ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Translation Look Aside Buffer ,[object Object],[object Object],[object Object],[object Object]
Traditional Northbridge
Opteron Northbridge ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Opteron server overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Error Reporting Banks
Opteron Error Reporting Banks ,[object Object],[object Object],[object Object],[object Object],[object Object]
Error Reporting Bank Registers ,[object Object],[object Object],[object Object],[object Object]
Role of registers ,[object Object],[object Object],[object Object],[object Object]
Decoding Mci Status Registers ,[object Object],[object Object],[object Object],[object Object]
Decoding Mci Status Registers Cont ,[object Object],[object Object],[object Object],[object Object]
CHIPKILL + SYNDROMES ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Portion of chipkill syndrome table 128 bit memory word
64 bit memory word ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
64 bit word ECC syndrome table
Error Types and handling
Correctable ECC errors ,[object Object],[object Object],[object Object],[object Object]
Handling Uncorrectable errors ,[object Object],[object Object],[object Object]
Sync Flood ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
001 | 01/03/2007 | 21:43:00 | OEM #0x12 |  | Asserted 2101 | OEM record e0 | 00000000040f0c0200400000f2 2201 | OEM record e0 | 01000000040000000000000000 2301 | 01/03/2007 | 21:43:15 | Memory | Uncorrectable ECC | Asserted | CPU 1 DIMM 0 2401 | 01/03/2007 | 21:43:15 | Memory | Memory Device Disabled | Asserted | CPU 1 DIMM 0 2501 | 01/03/2007 | 21:43:18 | Memory p1.d1.fail | Predictive Failure Asserted 2601 | 01/03/2007 | 20:43:12 | System Firmware Progress | Motherboard initialization | Asserted Sync Flood example SEL
Another example of sync flood error - not so friendly - 1501 | 04/10/2007 | 04:18:02 | OEM #0x12 |  | Asserted 1601 | OEM record e0 | 00004800001111002000000000 1701 | OEM record e0 | 10ab0000000810000006040012 1801 | OEM record e0 | 10ab0000001111002011110020 1901 | OEM record e0 | 1800000000f60000010005001b 1a01 | OEM record e0 | 180000000000000000dffe0000 1b01 | OEM record e0 | 1900000000f200002000020c0f 1c01 | OEM record e0 | 1a00000000f200001000020c0f 1d01 | OEM record e0 | 1b00000000f200003000020c0f 1e01 | OEM record e0 | 80004800001111032000000000
Machine check exception ,[object Object],[object Object]
Linux machine check exception example CPU 0: Machine Check Exception: 0000000000000004  CPU 0: Machine Check Exception: 0000000000000004  Bank 0: b600000000000185 at 0000000000000940  Kernel panic: CPU context corrupt  The above is from kernel: 2.4.21-27.0.1.ELsmp #1 SMP
Machine check exception example Solaris WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0ifying DMI Pool Data .... sched: #mc Machine check pid=0, pc=0xfffffffffb8233ea, sp=0xfffffe8000293ad8, eflags=0x216 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> cr2: 8073c62 cr3: d3a7000 cr8: c rdi: ffffffff812dadf0 rsi: ffffffff815f4df0 rdx:  1000 rcx:  42  r8:  1  r9:  1 rax: fffffe8000293c80 rbx: ffffffff81282e00 rbp: fffffe8000293b10 r10:  1 r11:  1 r12:  0 r13: ffffffff81282e00 r14: ffffffff81283318 r15: fffffe800025db40 fsb: ffffffff80000000 gsb: ffffffff81034000  ds:  43 es:  43  fs:  0  gs:  1c3 trp:  12 err:  0 rip: fffffffffb8233ea cs:  28 rfl:  216 rsp: fffffe8000293ad8
Memory Addressing and Interleaving
Example of a DIMM layout
Contiguous addressing versus Interleaving ,[object Object],[object Object],[object Object],[object Object],http://systems-tsc/twiki/pub/Products/SunFireX4100FaqPts/OpteronMemInterlvNotes.pdf
Interleaving ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Rev F DIMM Interleave Addresses
Example of addressing ,[object Object],[object Object],[object Object],[object Object],[object Object]
Simplified addressing – no interleave ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simplified addressing - interleave ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Memory/PCI Hole ,[object Object],[object Object],[object Object],[object Object],[object Object]
Affect of memory hole on address ranges ,[object Object],[object Object]
Technique to discover memory ranges on CPU for Linux systems ,[object Object],[object Object],[object Object]
 
Memory Hole address  range    without remapping Node address range displayed at boot. Each Node has 4GB  node 0 has “lost” memory (a 4G address range would be 000000000000000-00000000ffffffff) Memory hole exists between dfffffff and fffffff =20000000 [root@va64-x4100f-gmp03 log]# pwd /var/log [root@va64-x4100f-gmp03 log]# grep -i Bootmem mess* Bootmem setup node 0 000000000000000-00000000dfffffff Bootmem setup node 1 0000000100000000-00000001ffffffff
Address range with memory remapping around hole (hoisting) In this case we do not lose the memory. RAM addressing is remapped around the memory hole so address range on Mode 0 grows by 20000000 base + limit of node 1 grows by 20000000 Bootmem setup node 0 0000000000000000-000000011fffffff Bootmem setup node 1 0000000120000000-000000021fffffff
Some examples of error reporting
Red Hat 3 Update 2  kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel:  ECC syndrome bits e307 kernel:  extended error chipkill ecc error kernel:  link number 0 kernel:  dram scrub error kernel:  corrected ecc error kernel:  error address valid kernel:  error enable kernel:  previous error lost kernel:  error address 00000000cf31f8f0
Later Red Hat 3 example kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8
Example of Red Hat 3 GART error CPU 3: Silent Northbridge MCE Northbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0
Example of EDAC output EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label &quot;&quot;: k8_edac EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 4 northbridge TSC e169139a35188 ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d422400040080a13 MCGSTATUS 0 Suse mcelog example kernel 2.6.16.27
Further Suse mcelog example MCE 31 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 1 instruction cache TSC 3e2dc434cdb5 ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0
ECC ( non chipkill example) CPU 2 4 northbridge TSC 3da2afa1102b ADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d418c00000000a13 MCGSTATUS 0
Confusing EDAC example note two MC numbers reporting. eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8, syndrome 0xf4, row 0, channel 1, label &quot;&quot;: k8_edac eaebe242 kernel: MC1: CE - no information available: k8_edac Error Overflow set eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
FMA information examples ,[object Object],# fmdump -v -u 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 TIME  UUID  SUNW-MSG-ID Feb 18 15:42:41.1662 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 AMD-8000-3K 100%  fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=3 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3
fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288  , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
# fmdump  TIME  UUID  SUNW-MSG-ID  Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K  # fmadm faulty  STATE RESOURCE / UUID  -------- ----------------------------------------------------------------------  degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1  13441a52-c465-629b-ca9d-fc77b0e66354  -------- ----------------------------------------------------------------------  # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354  TIME  UUID  SUNW-MSG-ID  Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K  100%  fault.memory.dimm_ck  Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1  Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1  FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
Example of FMA detecting CPU error Solaris handles machine check exception and FMA information is available on reboot
SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME: SOURCE: SunOS, REV: 5.10 Generic_118855-14 DESC: Errors have been detected that require a reboot to ensure system integrity.  See http://www.sun.com/msg/SUNOS-8000-0G for more information. Thu Jan  4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry REC-ACTION: Save the error summary below in case telemetry cannot be saved [Thu Jan  4 21:43:21 2007] [Thu Jan  4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401 detector=[ > > version=0 scheme= &quot;hc&quot; hc-list=[...] ] bank-status=b60000000002017a bank-number=2 addr=5a0c addr-valid=1 ip=0 privileged=1 ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401
System now panics and then reboots panic[cpu1]/thread=fffffe800032fc80: Unrecoverable Machine-Check Exception dumping to /dev/dsk/c0t0d0s1, offset 860356608,
SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1, Severity Major EVENT-TIME: Fri Jan  5 10:11:10 MET 2007 PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.net SOURCE: eft, REV: 1.16 EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8d DESC: The number of errors associated with this CPU has exceeded acceptable levels.  Refer to http://sun.com/msg/AMD-8000-67 for more information. RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace affected CPU.  Use fmdump -v -u <EVENT_ID> to identify the module.
#>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME  UUID  SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100%  fault.cpu.amd.l2cachetag Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1
Some programs and utilities
HERD ,[object Object],[object Object],[object Object],[object Object],[object Object],http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD
mcelog ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
mcat ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Newisys decoder ,[object Object],[object Object]
X64 Memory Replacement Policy
X64 Memory Replacement Policy ,[object Object],[object Object],[object Object],[object Object],02195
Three rules to change DIMMs  – I can't count ,[object Object],[object Object],[object Object],[object Object]
Glossary of terms
Glossary of terms ,[object Object],[object Object]
Glossary of terms ,[object Object],[object Object]
Glossary of terms ,[object Object]
 

Mais conteúdo relacionado

Mais procurados

Automotive and Electronics System
Automotive and Electronics SystemAutomotive and Electronics System
Automotive and Electronics SystemGiriraj Mannayee
 
Chapter 1 Introducing Hardware
Chapter 1 Introducing HardwareChapter 1 Introducing Hardware
Chapter 1 Introducing HardwareApril Lorraine
 
Introduction to Microprocessor & Code
Introduction to Microprocessor & CodeIntroduction to Microprocessor & Code
Introduction to Microprocessor & CodeAvijit Adhikary
 
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals [ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals naresh1992
 
Shashank hardware workshop final
Shashank hardware workshop finalShashank hardware workshop final
Shashank hardware workshop finaltechbed
 
Motherboard components
Motherboard componentsMotherboard components
Motherboard componentsJins Mathew
 
Cpu speed, wordlength.8
Cpu speed, wordlength.8Cpu speed, wordlength.8
Cpu speed, wordlength.8myrajendra
 
computer archtecture lab, computer hardware , problem and solutons in computer
computer archtecture lab, computer hardware , problem and solutons in computercomputer archtecture lab, computer hardware , problem and solutons in computer
computer archtecture lab, computer hardware , problem and solutons in computerGS Kosta
 
Origin of Microprocessor and Classification of Microprocessor
Origin of Microprocessor and  Classification of Microprocessor Origin of Microprocessor and  Classification of Microprocessor
Origin of Microprocessor and Classification of Microprocessor Vijay Kumar
 
To study about motherboard & its compponents
To study about motherboard & its compponentsTo study about motherboard & its compponents
To study about motherboard & its compponentsViral Parmar
 
Richard_Baker-Intel_I-32_Processor_Architecture_Overview
Richard_Baker-Intel_I-32_Processor_Architecture_OverviewRichard_Baker-Intel_I-32_Processor_Architecture_Overview
Richard_Baker-Intel_I-32_Processor_Architecture_OverviewRichard Baker
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessorankitnav1
 
Chapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUChapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUaskme
 
Chapter 5 Questions
Chapter 5 QuestionsChapter 5 Questions
Chapter 5 Questionsguest1689620
 

Mais procurados (20)

Automotive and Electronics System
Automotive and Electronics SystemAutomotive and Electronics System
Automotive and Electronics System
 
Chapter 1 Introducing Hardware
Chapter 1 Introducing HardwareChapter 1 Introducing Hardware
Chapter 1 Introducing Hardware
 
Chapter 2: Microprocessors
Chapter 2: MicroprocessorsChapter 2: Microprocessors
Chapter 2: Microprocessors
 
IMD 203 - Ch03
IMD 203 - Ch03IMD 203 - Ch03
IMD 203 - Ch03
 
Automotive electronics
Automotive electronicsAutomotive electronics
Automotive electronics
 
Introduction to Microprocessor & Code
Introduction to Microprocessor & CodeIntroduction to Microprocessor & Code
Introduction to Microprocessor & Code
 
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals [ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals
[ PPT ] NS _ppt 4..ppt microprocesser and microcontroller fundamentals
 
Memory interfacing of microcontroller 8051
Memory interfacing of microcontroller 8051Memory interfacing of microcontroller 8051
Memory interfacing of microcontroller 8051
 
Shashank hardware workshop final
Shashank hardware workshop finalShashank hardware workshop final
Shashank hardware workshop final
 
Motherboard components
Motherboard componentsMotherboard components
Motherboard components
 
Cpu speed, wordlength.8
Cpu speed, wordlength.8Cpu speed, wordlength.8
Cpu speed, wordlength.8
 
computer archtecture lab, computer hardware , problem and solutons in computer
computer archtecture lab, computer hardware , problem and solutons in computercomputer archtecture lab, computer hardware , problem and solutons in computer
computer archtecture lab, computer hardware , problem and solutons in computer
 
Origin of Microprocessor and Classification of Microprocessor
Origin of Microprocessor and  Classification of Microprocessor Origin of Microprocessor and  Classification of Microprocessor
Origin of Microprocessor and Classification of Microprocessor
 
To study about motherboard & its compponents
To study about motherboard & its compponentsTo study about motherboard & its compponents
To study about motherboard & its compponents
 
Richard_Baker-Intel_I-32_Processor_Architecture_Overview
Richard_Baker-Intel_I-32_Processor_Architecture_OverviewRichard_Baker-Intel_I-32_Processor_Architecture_Overview
Richard_Baker-Intel_I-32_Processor_Architecture_Overview
 
motherboard
motherboardmotherboard
motherboard
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessor
 
Chapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPUChapter 4 Microprocessor CPU
Chapter 4 Microprocessor CPU
 
System Unit
System UnitSystem Unit
System Unit
 
Chapter 5 Questions
Chapter 5 QuestionsChapter 5 Questions
Chapter 5 Questions
 

Semelhante a CPU Memory Errors & HandlingTITLE Analyzing Opteron CPU Memory Issues TITLE Troubleshooting Opteron Memory ErrorsTITLE Opteron CPU Cache Memory Errors

Hardware Management Module
Hardware Management ModuleHardware Management Module
Hardware Management ModuleAero Plane
 
Session01_Intro.pdf
Session01_Intro.pdfSession01_Intro.pdf
Session01_Intro.pdfRahnerJames
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxsyed rafi
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversSatpal Parmar
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Jagadisha Maiya
 
memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesGauravDaware2
 
Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper finalparth bera
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
 
Chp1 68000 microprocessor copy
Chp1 68000 microprocessor   copyChp1 68000 microprocessor   copy
Chp1 68000 microprocessor copymkazree
 
ARM Cortex-M3 Training
ARM Cortex-M3 TrainingARM Cortex-M3 Training
ARM Cortex-M3 TrainingRaghav Nayak
 
MPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsMPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsPremier Farnell
 
Assembly programming
Assembly programmingAssembly programming
Assembly programmingOmar Sanchez
 
Chapter_2_Embedded Systems Design_introduction_ARM.pdf
Chapter_2_Embedded Systems Design_introduction_ARM.pdfChapter_2_Embedded Systems Design_introduction_ARM.pdf
Chapter_2_Embedded Systems Design_introduction_ARM.pdfEngrNoumanMemon
 

Semelhante a CPU Memory Errors & HandlingTITLE Analyzing Opteron CPU Memory Issues TITLE Troubleshooting Opteron Memory ErrorsTITLE Opteron CPU Cache Memory Errors (20)

x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
Hardware Management Module
Hardware Management ModuleHardware Management Module
Hardware Management Module
 
Internal memory
Internal memoryInternal memory
Internal memory
 
Session01_Intro.pdf
Session01_Intro.pdfSession01_Intro.pdf
Session01_Intro.pdf
 
8051 presentation
8051 presentation8051 presentation
8051 presentation
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
 
memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memories
 
Register & Memory
Register & MemoryRegister & Memory
Register & Memory
 
Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Chp1 68000 microprocessor copy
Chp1 68000 microprocessor   copyChp1 68000 microprocessor   copy
Chp1 68000 microprocessor copy
 
486 or 80486 DX Architecture
486 or 80486 DX Architecture486 or 80486 DX Architecture
486 or 80486 DX Architecture
 
ARM Cortex-M3 Training
ARM Cortex-M3 TrainingARM Cortex-M3 Training
ARM Cortex-M3 Training
 
MPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsMPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III Processors
 
Assembly programming
Assembly programmingAssembly programming
Assembly programming
 
unit-2.pptx
unit-2.pptxunit-2.pptx
unit-2.pptx
 
Chapter_2_Embedded Systems Design_introduction_ARM.pdf
Chapter_2_Embedded Systems Design_introduction_ARM.pdfChapter_2_Embedded Systems Design_introduction_ARM.pdf
Chapter_2_Embedded Systems Design_introduction_ARM.pdf
 
Introduction to intel 8086 part1
Introduction to intel 8086 part1Introduction to intel 8086 part1
Introduction to intel 8086 part1
 

Mais de Aero Plane

Platform Disk Support 2
Platform Disk Support 2Platform Disk Support 2
Platform Disk Support 2Aero Plane
 
X64 Workshop Linux Information Gathering
X64 Workshop Linux Information GatheringX64 Workshop Linux Information Gathering
X64 Workshop Linux Information GatheringAero Plane
 
Io Architecture
Io ArchitectureIo Architecture
Io ArchitectureAero Plane
 
Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2Aero Plane
 
Driving The Platform 2
Driving The Platform 2Driving The Platform 2
Driving The Platform 2Aero Plane
 
Advanced Diagnostics 2
Advanced Diagnostics 2Advanced Diagnostics 2
Advanced Diagnostics 2Aero Plane
 

Mais de Aero Plane (6)

Platform Disk Support 2
Platform Disk Support 2Platform Disk Support 2
Platform Disk Support 2
 
X64 Workshop Linux Information Gathering
X64 Workshop Linux Information GatheringX64 Workshop Linux Information Gathering
X64 Workshop Linux Information Gathering
 
Io Architecture
Io ArchitectureIo Architecture
Io Architecture
 
Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2
 
Driving The Platform 2
Driving The Platform 2Driving The Platform 2
Driving The Platform 2
 
Advanced Diagnostics 2
Advanced Diagnostics 2Advanced Diagnostics 2
Advanced Diagnostics 2
 

CPU Memory Errors & HandlingTITLE Analyzing Opteron CPU Memory Issues TITLE Troubleshooting Opteron Memory ErrorsTITLE Opteron CPU Cache Memory Errors

  • 1. CPU and Memory Events [email_address]
  • 2.
  • 4.  
  • 9.
  • 10.
  • 12.
  • 13.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Portion of chipkill syndrome table 128 bit memory word
  • 22.
  • 23. 64 bit word ECC syndrome table
  • 24. Error Types and handling
  • 25.
  • 26.
  • 27.
  • 28. 001 | 01/03/2007 | 21:43:00 | OEM #0x12 | | Asserted 2101 | OEM record e0 | 00000000040f0c0200400000f2 2201 | OEM record e0 | 01000000040000000000000000 2301 | 01/03/2007 | 21:43:15 | Memory | Uncorrectable ECC | Asserted | CPU 1 DIMM 0 2401 | 01/03/2007 | 21:43:15 | Memory | Memory Device Disabled | Asserted | CPU 1 DIMM 0 2501 | 01/03/2007 | 21:43:18 | Memory p1.d1.fail | Predictive Failure Asserted 2601 | 01/03/2007 | 20:43:12 | System Firmware Progress | Motherboard initialization | Asserted Sync Flood example SEL
  • 29. Another example of sync flood error - not so friendly - 1501 | 04/10/2007 | 04:18:02 | OEM #0x12 | | Asserted 1601 | OEM record e0 | 00004800001111002000000000 1701 | OEM record e0 | 10ab0000000810000006040012 1801 | OEM record e0 | 10ab0000001111002011110020 1901 | OEM record e0 | 1800000000f60000010005001b 1a01 | OEM record e0 | 180000000000000000dffe0000 1b01 | OEM record e0 | 1900000000f200002000020c0f 1c01 | OEM record e0 | 1a00000000f200001000020c0f 1d01 | OEM record e0 | 1b00000000f200003000020c0f 1e01 | OEM record e0 | 80004800001111032000000000
  • 30.
  • 31. Linux machine check exception example CPU 0: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b600000000000185 at 0000000000000940 Kernel panic: CPU context corrupt The above is from kernel: 2.4.21-27.0.1.ELsmp #1 SMP
  • 32. Machine check exception example Solaris WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0ifying DMI Pool Data .... sched: #mc Machine check pid=0, pc=0xfffffffffb8233ea, sp=0xfffffe8000293ad8, eflags=0x216 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> cr2: 8073c62 cr3: d3a7000 cr8: c rdi: ffffffff812dadf0 rsi: ffffffff815f4df0 rdx: 1000 rcx: 42 r8: 1 r9: 1 rax: fffffe8000293c80 rbx: ffffffff81282e00 rbp: fffffe8000293b10 r10: 1 r11: 1 r12: 0 r13: ffffffff81282e00 r14: ffffffff81283318 r15: fffffe800025db40 fsb: ffffffff80000000 gsb: ffffffff81034000 ds: 43 es: 43 fs: 0 gs: 1c3 trp: 12 err: 0 rip: fffffffffb8233ea cs: 28 rfl: 216 rsp: fffffe8000293ad8
  • 33. Memory Addressing and Interleaving
  • 34. Example of a DIMM layout
  • 35.
  • 36.
  • 37. Rev F DIMM Interleave Addresses
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.  
  • 45. Memory Hole address range without remapping Node address range displayed at boot. Each Node has 4GB node 0 has “lost” memory (a 4G address range would be 000000000000000-00000000ffffffff) Memory hole exists between dfffffff and fffffff =20000000 [root@va64-x4100f-gmp03 log]# pwd /var/log [root@va64-x4100f-gmp03 log]# grep -i Bootmem mess* Bootmem setup node 0 000000000000000-00000000dfffffff Bootmem setup node 1 0000000100000000-00000001ffffffff
  • 46. Address range with memory remapping around hole (hoisting) In this case we do not lose the memory. RAM addressing is remapped around the memory hole so address range on Mode 0 grows by 20000000 base + limit of node 1 grows by 20000000 Bootmem setup node 0 0000000000000000-000000011fffffff Bootmem setup node 1 0000000120000000-000000021fffffff
  • 47. Some examples of error reporting
  • 48. Red Hat 3 Update 2 kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel: ECC syndrome bits e307 kernel: extended error chipkill ecc error kernel: link number 0 kernel: dram scrub error kernel: corrected ecc error kernel: error address valid kernel: error enable kernel: previous error lost kernel: error address 00000000cf31f8f0
  • 49. Later Red Hat 3 example kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8
  • 50. Example of Red Hat 3 GART error CPU 3: Silent Northbridge MCE Northbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0
  • 51. Example of EDAC output EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label &quot;&quot;: k8_edac EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
  • 52. MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 4 northbridge TSC e169139a35188 ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d422400040080a13 MCGSTATUS 0 Suse mcelog example kernel 2.6.16.27
  • 53. Further Suse mcelog example MCE 31 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 1 instruction cache TSC 3e2dc434cdb5 ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0
  • 54. ECC ( non chipkill example) CPU 2 4 northbridge TSC 3da2afa1102b ADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d418c00000000a13 MCGSTATUS 0
  • 55. Confusing EDAC example note two MC numbers reporting. eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8, syndrome 0xf4, row 0, channel 1, label &quot;&quot;: k8_edac eaebe242 kernel: MC1: CE - no information available: k8_edac Error Overflow set eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
  • 56.
  • 57. fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288 , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
  • 58. # fmdump TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 13441a52-c465-629b-ca9d-fc77b0e66354 -------- ---------------------------------------------------------------------- # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354 TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
  • 59. Example of FMA detecting CPU error Solaris handles machine check exception and FMA information is available on reboot
  • 60. SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME: SOURCE: SunOS, REV: 5.10 Generic_118855-14 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. Thu Jan 4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry REC-ACTION: Save the error summary below in case telemetry cannot be saved [Thu Jan 4 21:43:21 2007] [Thu Jan 4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401 detector=[ > > version=0 scheme= &quot;hc&quot; hc-list=[...] ] bank-status=b60000000002017a bank-number=2 addr=5a0c addr-valid=1 ip=0 privileged=1 ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401
  • 61. System now panics and then reboots panic[cpu1]/thread=fffffe800032fc80: Unrecoverable Machine-Check Exception dumping to /dev/dsk/c0t0d0s1, offset 860356608,
  • 62. SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1, Severity Major EVENT-TIME: Fri Jan 5 10:11:10 MET 2007 PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.net SOURCE: eft, REV: 1.16 EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8d DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-67 for more information. RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace affected CPU. Use fmdump -v -u <EVENT_ID> to identify the module.
  • 63. #>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME UUID SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100% fault.cpu.amd.l2cachetag Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1
  • 64. Some programs and utilities
  • 65.
  • 66.
  • 67.
  • 68.
  • 70.
  • 71.
  • 73.
  • 74.
  • 75.
  • 76.