Mais conteúdo relacionado
Semelhante a Crash Dump Analysis 101 (20)
Crash Dump Analysis 101
- 1. CRASH DUMP
ANALYSIS 101
JOHN S. HOWARD
JOHN.HOWARD@NEXENTA.COM
1 © Copyright Nexenta 2012
- 2. AGENDA
!
Terminology
!
Core Dumps and Crash Dumps
!
C Language Basics
!
The Mechanism of a Panic
! mdb Overview
!
Basic Crash Dump Analysis
2 © Copyright Nexenta 2012
- 3. PROCESS, THREAD, LWP
! Process
! A program in execution
! May be comprised of threads or LWPs
! Thread
! The smallest unit of scheduling
! Shared address space and resources
! Light Weight Process (LWP)
! A many-to-1 mapping of user threads to a kernel thread
! Provides user-level multitasking
3 © Copyright Nexenta 2012
- 4. INTERRUPTS AND TRAPS
! I nterrupts are asynchronous messages notifying the kernel of
external device events
! Some interrupts are handled as traps
! Traps are synchronous messages, essentially a software
interrupt
! Bus errors are issued to a processor when referencing a
location that can’t be resolved or located
4 © Copyright Nexenta 2012
- 5. HANGS, CRASHES, AND PANICS
! Hang
! Potentially limited or no forensic information
! System up, but unresponsive
! Crash
! Potentially limited forensic information
! System down or rebooted
! Panic
! Maximum potential forensic information
! System down or rebooted
5 © Copyright Nexenta 2012
- 6. FORENSIC INFORMATION SOURCES
! Forensic Information Sources
! Console
! syslog, typically logged to
/var/adm/messages
! Core file or crash dump
6 © Copyright Nexenta 2012
- 7. CORE FILE
!
A dump of the contents of all memory allocated to the
process
!
Inert and static record of state
!
Process core files are dumped to the working directory by
default
!
Core file properties managed via coreadm
!
Requires the same libraries to be read
7 © Copyright Nexenta 2012
- 8. CRASH DUMP
! A dump of the contents of all memory allocated to the kernel
! Inert and static record of state
! Written to the pre-specified dump device or swap partition
! Written “backwards”
! Reading requires the same OS version
! Kernel core file facility managed via dumpadm
8 © Copyright Nexenta 2012
- 9. DUMPADM
! dumpadm with no options shows current settings
# dumpadm
!Dump content: kernel pages!
!Dump device: /dev/zvol/dsk/rpool/dump (dedicated)!
!Savecore directory: /var/crash/myhost!
!Savecore enabled: yes!
! To force a crash dump:
# savecore -L
! Note that savecore does not quiesce system, so memory contents
are changing
# uadmin 5 0
# reboot -dn
9 © Copyright Nexenta 2012
- 10. PANIC
! Kernel detected inconsistency
! Protect by exiting
! Three major tasks to be performed in a system panic:
! record information about the panic in memory (making it
part of the crash dump)
! synchronize the file systems to preserve user file data
! generate the crash dump
10 © Copyright Nexenta 2012
- 11. C PROGRAMMING LANGUAGE DATATYPES
! Built-ins
! int, float,char
! struct
! A grouping of data
! union
! variant records
! All constituent data items are overlaid
! typedef
! Pointers
! A reference to a memory location
11 © Copyright Nexenta 2012
- 12. C DATATYPES EXAMPLES
int ap;!
char buf[128];!
int *user = sr;!
typedef struct smb_mtype {!
! !char! !*mt_name;!
! !int ! !mt_namelen;!
! !int ! !mt_flags;!
} smb_mtype_t
12 © Copyright Nexenta 2012
- 14. C FUNCTION EXAMPLES
Declaration
static void smb_tree_log(smb_request_t *, const char *, !
const char *, ...);!
Definition
smb_tree_log(smb_request_t *sr, const char *sharename,!
const char *fmt, ...)
{
.
.
.
}!
14 © Copyright Nexenta 2012
- 15. PANIC()
! panic(),
cmn_err()
! Common entry points for vpanic()
! Responsible for providing panic information
! die()
! vpanic()
! Assembly language function for saving register state
! ASSERT(condition)
! Halts execution of the kernel if condition is false
! Evaluated and executed only when the DEBUG compilation
symbol is defined
! VERIFY(condition)
! Similar to ASSERT, but active even when DEBUG isn’t defined
! Stack will contain assfail() near top
15 © Copyright Nexenta 2012
- 16. EXAMPLE 1: PANIC STRING
panic[cpu1]/thread=ffffff000e4e7c60:
BAD TRAP: type=e (#pf Page fault)
rp=ffffff000e4e77c0 addr=0 occurred in module
"unix" due to a NULL pointer dereference
16 © Copyright Nexenta 2012
- 17. EXAMPLE 1: STACK TRACE
ffffff000e4e76a0 unix:die+dd ()
ffffff000e4e77b0 unix:trap+177b ()
ffffff000e4e77c0 unix:cmntrap+e6 ()
ffffff000e4e78c0 unix:strcasecmp+16 ()
ffffff000e4e7a50 smbsrv:smb_tree_log+b3 ()
ffffff000e4e7a90 smbsrv:smb_tree_connect_core+14a ()
ffffff000e4e7ac0 smbsrv:smb_tree_connect+35 ()
ffffff000e4e7ae0 smbsrv:smb_com_tree_connect_andx+16 ()
ffffff000e4e7b80 smbsrv:smb_dispatch_request+4a9 ()
ffffff000e4e7bb0 smbsrv:smb_session_worker+6c ()
ffffff000e4e7c40 genunix:taskq_d_thread+b1 ()
ffffff000e4e7c50 unix:thread_start+8 ()
17 © Copyright Nexenta 2012
- 18. MDB – MODULAR DEBUGGER
! Extensible utility for low-level debugging and editing
! On live kernel:
# mdb -k
# mdb -kw to edit (VERY
DANGEROUS)
! On a core file:
mdb syseventd.core.125
! On a crash dump:
# mdb -k unix.3 vmcore.3
18 © Copyright Nexenta 2012
- 19. ANALYZE-CRASH.SH
! Extracts the crash dump from the dump device
(savecore -vf filename) if necessary
! Scripted mdb commands for basic crash information:
! Panic string and registers
! dmesg buffer
! Stack
! Thread list
! Executed automatically by the NMC `support` command
(NS 3.1.2 and later)
19 © Copyright Nexenta 2012
- 20. HAVE I SEEN THIS BEFORE?
! Footprints
! Known problem or new?
! Redmine
! Search illumos Hg issues
https://www.illumos.org/issues/
! SunSolve is gone, however “We Sun Solve” is rescuing
the data from SunSolve.Sun.COM
http://wesunsolve.net/bsearch
! illumos Source browser
http://src.illumos.org/source/
20 © Copyright Nexenta 2012
- 21. EXAMPLE 1: PANIC STRING
panic[cpu1]/thread=ffffff000e4e7c60:
BAD TRAP: type=e (#pf Page fault)
rp=ffffff000e4e77c0 addr=0 occurred in module
"unix" due to a NULL pointer dereference
21 © Copyright Nexenta 2012
- 22. EXAMPLE 1: STACK TRACE
ffffff000e4e76a0 unix:die+dd ()
ffffff000e4e77b0 unix:trap+177b ()
ffffff000e4e77c0 unix:cmntrap+e6 ()
ffffff000e4e78c0 unix:strcasecmp+16 ()
ffffff000e4e7a50 smbsrv:smb_tree_log+b3 ()
ffffff000e4e7a90 smbsrv:smb_tree_connect_core+14a ()
ffffff000e4e7ac0 smbsrv:smb_tree_connect+35 ()
ffffff000e4e7ae0 smbsrv:smb_com_tree_connect_andx+16 ()
ffffff000e4e7b80 smbsrv:smb_dispatch_request+4a9 ()
ffffff000e4e7bb0 smbsrv:smb_session_worker+6c ()
ffffff000e4e7c40 genunix:taskq_d_thread+b1 ()
ffffff000e4e7c50 unix:thread_start+8 ()
22 © Copyright Nexenta 2012
- 23. EXAMPLE 2: PANIC INFO
panic[cpu5]/thread=ffffff000fd72c60:
BAD TRAP: type=0 (#de Divide error) rp=ffffff000fd72a40 addr=ffffff02da92e900
sched:
#de Divide error
addr=0xffffff02da92e900
pid=0, pc=0xfffffffff7ad977b, sp=0xffffff000fd72b30, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: fffffd7fff2a60c8
cr3: 5000000
cr8: c
rdi: ffffff02d282e840 rsi: 0 rdx: 0
rcx: 64 r8: ffffff000fd72c60 r9: 0
rax: 0 rbx: 0 rbp: ffffff000fd72b90
r10: 0 r11: ffffff02f46e8264 r12: ffffff02da316338
r13: ffffff02da3163d0 r14: ffffff02d5061a50 r15: ffffff02da92e900
fsb: 0 gsb: ffffff02da9a1540 ds: 4b
es: 4b fs: 0 gs: 1c3
trp: 0 err: 0 rip: fffffffff7ad977b
cs: 30 rfl: 10246 rsp: ffffff000fd72b30
ss: 38
23 © Copyright Nexenta 2012
- 24. EXAMPLE 2: STACK
ffffff000fd72920 unix:die+10f ()
ffffff000fd72a30 unix:trap+1555 ()
ffffff000fd72a40 unix:cmntrap+e6 ()
ffffff000fd72b90 cpudrv:cpudrv_monitor+1cb ()
ffffff000fd72c40 genunix:taskq_thread+285 ()
ffffff000fd72c50 unix:thread_start+8 ()
syncing file systems...
done
dumping to /dev/zvol/dsk/syspool/dump, offset 65536, content: kernel + curproc
STACK
---
ffffff000fd72b90 cpudrv_monitor+0x1cb(ffffff02da316338)
ffffff000fd72c40 taskq_thread+0x285(ffffff02da859140)
ffffff000fd72c50 thread_start+8()
24 © Copyright Nexenta 2012
- 25. EXAMPLE 2: THREAD LIST
ffffff000fd72c60 fffffffffbc2dbf0 0 0 60 0
PC: panicsys+0x9b TASKQ: cpudrv_cpudrv_monitor
stack pointer for thread ffffff000fd72c60: ffffff000fd726e0
xc_insert+0x36()
0xffffff0200000000()
cpudrv_monitor+0x1cb()
taskq_thread+0x285()
thread_start+8()
25 © Copyright Nexenta 2012
- 26. EXAMPLE 2: SOURCE
CODE
From cpudrv_monitor()
1109 /*
1110 * Adjust counts based on the delay added by timeout and taskq.
1111 */
1112 idle_cnt = (idle_cnt * cur_spd->quant_cnt) / tick_cnt;
1113 user_cnt = (user_cnt * cur_spd->quant_cnt) / tick_cnt;
1114
26 © Copyright Nexenta 2012
- 27. HARDWARE, FIRMWARE, OR SOFTWARE?
!
Crash dumps are inconclusive on hardware errors
!
Correlate to fmdump output
!
PCI-X panics are the most common hardware caused panic
!
PCI Vendor Database http://pcidatabase.com
!
KB Article: “Understanding and decoding PCI(-X) Express
Fatal Error panics”
27 © Copyright Nexenta 2012
- 28. EXAMPLE 3: PANIC STRING
AND STACK TRACE
panic[cpu7]/thread=ffffff005cbdbc60:
pcieb-3: PCI(-X) Express Fatal Error. (0x101)
ffffff005cbdbbb0 pcieb:pcieb_intr_handler+228 ()
ffffff005cbdbc00 unix:av_dispatch_autovect+7c ()
ffffff005cbdbc40 unix:dispatch_hardint+33 ()
ffffff005cbaba80 unix:switch_sp_and_call+13 ()
ffffff005cbabad0 unix:do_interrupt+b8 ()
ffffff005cbabae0 unix:_interrupt+b8 ()
ffffff005cbabbd0 unix:i86_mwait+d ()
ffffff005cbabc20 unix:cpu_idle_mwait+f1 ()
ffffff005cbabc40 unix:idle+114 ()
ffffff005cbabc50 unix:thread_start+8 ()
28 © Copyright Nexenta 2012
- 29. IDENTIFYING THE PCI-X
COMPONENT
Mar 30 2011 00:53:53.606674454 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0xbcd565541a801401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,3408@1
(end detector)
bdf = 0x8
device_id = 0x3408
vendor_id = 0x8086
29 © Copyright Nexenta 2012
- 30. IDENTIFYING THE VENDOR
Device ID Chip Description Vendor ID Vendor Name
0x3408 Intel 7500 Chipset PCIe Root Port 0x8086 Intel Corporation
device-path = /pci@0,0/pci8086,3408@1
device-path = /pci@0,0/pci8086,3408@1/pci108e,484c@0
device-path = /pci@0,0/pci8086,3408@1/pci108e,484c@0,1
If no entries in neither the PCI vendor database nor
`/usr/share/hwdata/pci.ids` then grep
`/etc/path_to_inst`:
"/pci@0,0/pci8086,3408@1" 0 "pcie_pci"
"/pci@0,0/pci8086,3408@1/pci108e,484c@0" 0 "igb"
"/pci@0,0/pci8086,3408@1/pci108e,484c@0,1" 1 "igb“
igb is the intel Gigabit NIC driver
30 © Copyright Nexenta 2012
- 31. DETERMINE DRIVER AND
PACKAGE DETAILS
# dpkg -S igb | grep '/kernel’
sunwigb: /var/lib/dpkg/alien/sunwigb/reloc/kernel/drv/igb.conf
sunwigb: /kernel/drv/amd64/igb
sunwigb: /var/lib/dpkg/alien/sunwigb/reloc/kernel/drv
sunwigb: /kernel/drv/igb
sunwigb: /var/lib/dpkg/alien/sunwigb/reloc/kernel
Examine the package details:
# dpkg -l sunwigb
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-f/Unpacked/Failed-cfg/Half-inst/t-aWait/T-pend
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name Version Description
+++-=======================-======================-======================================
ii sunwigb 5.11.134-31-8234-1 Intel 82575 1Gb PCI Express NIC
Driver
31 © Copyright Nexenta 2012
- 32. A PCI-X CONCLUSION, OF SORTS
! Searching redmine for “igb driver” will find a bug, but also
check for any Intel 82575 gigabit issues
! Next, determine:
! Is the driver is down revision?
! Is the firmware is down revision?
! If the driver and firmware are current, then this is most likely
a hardware problem
! CDA is inconclusive for proving hardware failures
32 © Copyright Nexenta 2012