These slides were presented during technical event at my organization. It focuses on overview to find a root cause of the unexpected system down events. It is mainly useful for Linux or Unix system administrators. Here, I tried to cover all aspects of the topic. It took me more than 2 hours to present these slides, but one can also cover these slides within short time-span. Gray background of slides is implemented to hide the company logo and to preserve the confidentially of private template. However, The Knowledge is not restricted :)
3. Why RCA is important
Business Impact
Loss of money due to outages.
Disruption in availability of services.
Risk of re-occurrence of the issue.
Finding the culprit behind the scene.
Security breach or human error.
4. General Approach (Non-Technical)
The RCA is a method of problem solving.
There can be more than one root cause behind the issue.
Purpose is to identify solution or workaround, to prevent
reocurrence at lowest cost and simplest way.
RCFA (Root Cause Failure Analysis) recognizes that complete
prevention of recurrence by one corrective action is not always
possible.
Famous methods/tools - 5 whys, Pareto analysis, Cause and
effect model etc.
5. Technical Approach (Basic)
Whether unexpected reboot is effect of some planned activity ?
Was there any recent configuration changes (sw/hw) ?
What does my recent logs suggest ?
Any unususal behaviour (or logs) spotted ? (console)
Is there some relation between the occurence of the events ?
Do we have a reliable power source ? (UPS)
6. Step Forward
Is it a virtual or physical system ?
Check logs recorded by hyper-visor and/or hardware. (mcelog,
IML logs, ASR events, hyper-visior utilization etc.)
Is this part of some cluster ? Any fencing event recorded ?
Try to find whether its a real OS issue; Or its related to
application/network/storage ?
Is this result of some malecious script running on the system ?
Is there any anti-virus installed and running on the system ?
What all panic parameters are set on the system ?
7. Deep Diving
Is there any known bug with running kernel ? (search bugzilla)
Is this issue reproducible on demand ? Any possible
workaround ?
Does replica of system exhibit similar behaviour ? Compare
initramfs of replicas to find out any differences.
Is there some known issue with the combination of the OS
version and a perticular application running on the system ?
Any sign of abnormal resource utilization near the event ?
Whether complete/partial dump is captured for the reboot ?
Check vmcore-dmesg.txt logs and try to find known issues on
vendor portal.
8. Server Hung Scenario
Do not confuse it with application hang scenario. Do all
checks. There is no standard defination for OS hung situation.
Some facts regarding crash vs hang situations:
• Crash is often immidiately follow a problem in kernel space. Like : Programming
error, Defective hardware, Unsupported operation etc.
• During crash oops messages are displayed and it helps in diagnosis
• Crash or panic is easier to troubleshoot. It provides stack trace and panic task
details.
• System hang are more subtle. It can be the result of simply temporary
performance issue caused by inefficient algorithms or as complicated as dead
locks.
• No oops messages displayed on console, dont know what thread caused hang.
Hence it makes hang issue more complicated to analyze.
Take a snapshot of virtual guest and extract memory dump.
Or trigger panic using available panic techniques, make sure
panic initiate’s the memory dump mechanism.
9. User Initiated
The "exiting on signal 15" message is the last message that syslog
service emits during normal shutdown.
The presence of this message in the messages file indicates a directed
shutdown of the system. Either from a user or a program.
Is there any system health monitoring software running which may issue
the 'shutdown' command ? For ex :
• Automatic system recovery software.
• Hardware monitoring tools.
• UPS software with shutdown capability etc.
How to find which user it was -- set audit rules or use script.
Check secure logs & bash history of users for shutdown event.
10. Cluster Initiated
Cluster reboots system using fencing mechanism. Common clustering
softwares are : Oracle clusterware, VCS & RedHat Cluster etc.
Unlike many common thoughts, high-availability is not the highest priority
of an HA cluster, but only the 2nd one.
There are two classes of fencing methods, one - which disables a node
itself, the other - disallows access to resources such as shared disks.
Cluster can fall victim to conditions called Split Brain and Amnesia.
Clusters use a process called “STONITH” in order to correct the issue;
this simply means the healthy nodes kill the sick node.
I/O fencing is one of the important feature of VCS, whereas Oracle-RAC
simply gives the message - "Please Reboot" to the sick node. The node
bounces itself and rejoins the cluster. RedHat cluster uses fence device
configuration to handle fencing events.
One can also set fence delay to allow OS to capture vmcore for fencing
events.
11. Hardware Faults
The most common hardware errors that are captured on the system are:
• Memory errors or Error Correction Code (ECC) problems.
• Inadequate cooling / processor over-heating.
• System bus errors. Cache errors in the processor or hardware.
• Firmware bugs, EDAC and NMI’s.
The kernel does the immediate actions (like killing processes etc.) and
mcelog decodes the errors.
The mcelog is the user space backend for logging machine check errors
reported by the hardware to the kernel.
• Seen MCE error : HARDWARE ERROR. This is *NOT* a software problem!”
12. Panic Parameters
These are used to deliberately panic system, when certain
conditions are met. It is necessary for debugging purpose
• 1) kernel.hung_task_panic
• 2) kernel.softlockup_panic
• 3) vm.panic_on_oom: This parameter will panic the kernel on oom-killer
events and capture a vmcore if kdump service is running as expected.
• 4) kernel.panic_on_io_nmi
• 5) kernel.unknown_nmi_panic: It utilizes NMI switch capability to force a
kernel panic on a hung system. This feature makes use of the computer's NMI
switch to trigger a panic.
• 6) kernel.panic_on_oops
• 7) kernel.panic_on_unrecovered_nmi
• 8) kernel.nmi_watchdog: The NMI watchdog monitors system interrupts and
initiates a reboot if the system appears to have hung.
• 9) kernel.panic_on_stackoverflow
• 10) kernel.panic [secs]
13. Panic Strings
These panic strings explain cause of the panic. But it is not always
sufficient to determine the actual cause.
When a kernel panic occurs, the system usually displays a message on
the console and all the system activity stops’
• Kernel BUG at net/sunrpc/sched.c:695!
• BUG: unable to handle kernel paging request at xxxxx
• BUG: unable to handle kernel NULL pointer dereference at xxxxx / (null)
• divide error: 0000 [#1] SMP
• Kernel panic – not syncing: softlockup: hung tasks / hung_task: blocked tasks
• Kernel panic – not syncing: Watchdog detected hard LOCKUP on cpu 0
• Kernel panic – not syncing: out of memory, panic_on_oom is selected
• Kernel panic – not syncing: Out of memory and no killable processes..
• Kernel panic – not syncing: An NMI occurred, please see the Integrated Management Log for
details.”
• Kernel panic – not syncing: NMI IOCK error: Not continuing / NMI: Not continuing / nmi watchdog
• Kernel panic – not syncing: Fatal Machine check
• Kernel panic – not syncing: Attempted to kill init !
• Kernel panic – not syncing: GAB: Port h halting system due to client process failure
14. Kernel logging
Syslog is a standard logging facility. It collects messages of various
programs and services including the kernel, and stores them, depending
on setup, in a bunch of log files typically under /var/log.
The “/var/log/messages” aims at storing valuable, non-debug and non-
critical messages. This log should be considered the "general system
activity" log.
Administrators use log rotation facility to maintain historical data. One
can also change the logging level based on the requirement of the setup.
# Common call traces seen in messages are :
• OOM-killer and memory stats.
• Softlockup logs for various cores.
• Page allocation failures.
• Segfaults : Signifies an error in one particular process.
kernel: fmg[6335]: segfault at 0xffffd2dc rip 0xffffd2dc rsp 00000000ffffd1bc errorX
• Trap divide error : Application crash due to “divide by zero”
kernel: nmupm[2792] trap divide error rip:804a39a rsp:ffa4eb24 error:X
15. OOM call traces
The out_of_memory function is called when the system memory
(including swap) has been fully allocated to a point where regular system
activities cannot be performed until some of that memory is freed.
The mm/oom_kill.c terminate one or more processes based on badness()
score; which follows an algorithm that does not kill any innocent task.
<snip/>
Node 0 DMA: 3*4kB 2*8kB 2*16kB 3*32kB 2*64kB 2*128kB ... 3*4096kB = 15132kB
Node 0 DMA32: 452*4kB ..
Node 0 Normal: 13315*4kB .. <<<
[..]
Free swap = 0kB <<<
Total swap = 8388604kB
[..]
kernel: httpd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 <<<
kernel:
kernel: Call Trace:
[<ffffffff800c3a6a>] out_of_memory+0x8e/0x2f5
[<ffffffff8000f2eb>] __alloc_pages+0x245/0x2ce
[<ffffffff80012a62>] __do_page_cache_readahead+0x95/0x1d9
</snip>
16. D-state call traces
These messages serve as a warning that something may not be
operating optimally. They do not necessarily indicate a serious problem
and any blocked processes should eventually proceed when the system
recovers.
The “khungtaskd” has the ability to detect tasks stuck in D-state (
Uninterruptible Sleep (UN) ) longer than a specified time period and
results in following type of message in system log:
<snip/>
INFO: task syslogd:2643 blocked for more than 120 seconds. <<<
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <<<
syslogd D ffff81000237eaa0 0 2643 1 2646 2634
(NOTLB) <<<
ffff8101352c3d88 0000000000000086 ffff8101352c3d98 ffffffff80063ff8
0000000000001000 0000000000000009 ffff81013d2c57e0 ffff810102ac1820
0000340b30708992 0000000000000571 ffff81013d2c59c8 000000010000089f
Call Trace: <<<
[<ffffffff80063ff8>] thread_return+0x62/0xfe
[...]
[<ffffffff8005e28d>] tracesys+0xd5/0xe0
</snip>
17. Soft-lockup call traces
Soft lockups are situations in which the kernel's scheduler subsystem has
not been given a chance to perform its job.
It can be caused by defects in the kernel, by hardware issues or by
extremely high workloads.
<snip/>
kernel: BUG: soft lockup - CPU#7 stuck for 206s! [sosreport:14372] <<<
kernel: Modules linked in: rpcsec_gss_krb5 nfsd..vsock(U) ipv6 .. vmware_balloon .. vmxnet3 ..
dm_mod [last unloaded: speedstep_lib] <<<
[..]
/440BX Desktop Reference Platform
kernel: RIP: 0010:[<ffffffff81162cbd>] [<ffffffff81162cbd>] s_show+0x1ad/0x330 <<<
kernel: RSP: 0018:ffff8801e482fd98 EFLAGS: 00000202
kernel: RAX: 0000000000000000 RBX: ffff8801e482fe18 RCX: ffff88043febfb80 <<<
kernel: RDX: 0000000000000000 RSI: 00000000000036a7 RDI: ffff88043febfb60
[...]
kernel: <d> 00000000000036a7 ffff880437830f00 ffff8801e482fe18 ffff88031e3f1640
kernel: Call Trace:
kernel: [<ffffffff8119db87>] ? seq_read+0x267/0x3f0 <<<
kernel: [<ffffffff81054c30>] ? __dequeue_entity+0x30/0x50 .....
</snip>
18. Page allocation failures
The kernel frequently needs to allocate chunks of memory for the
temporary storage of data and structures. Sometimes allocations
demands many physically contiguous pages which may not always be
available. In times like this memory allocator may choose to fail the
allocation request.
Common cause are memory-crunch, memory-fragmentation, memory-
zone exhausted and drivers with different service routines.
• Usual workaround is to check the value of vm.min_free_kbytes and double it. Also
setting vm.zone_reclaim_mode to 0 can help to avoid memory congestion issues .
</snip>
kernel: swapper: page allocation failure. order:2, mode:0x20 <<<
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-220.4.1.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ> [<ffffffff81123daf>] ? __alloc_pages_nodemask+0x77f/0x940
kernel: [<ffffffff8115dc62>] ? kmem_getpages+0x62/0x170
kernel: [<ffffffff8115e87a>] ? fallback_alloc+0x1ba/0x270
kernel: [<ffffffff8115e2cf>] ? cache_grow+0x2cf/0x320
kernel: [<ffffffff8115e5f9>] ? ____cache_alloc_node+0x99/0x160 ...
</snip>
19. SysRq
It is a 'magical' key combo that you can hit, and to which the kernel will
respond regardless of whatever else it is doing, even if the console is
unresponsive.
The sysrq key is one of the best (and sometimes the only) way to
determine what a machine is really doing. It is useful when a server
appears to be "hung" or for diagnosing elusive, transient, kernel-related
problems.
For security reasons, SysRq key is disabled by default.
• Because enabling sysrq gives someone with physical console access an extra
abilities. It is recommended to disable it when not troubleshooting a problem or
to ensure that physical console access is properly secured.
There are several sysrq events(and ways) that can be triggered once the
sysrq facility is enabled.
• # echo h > /proc/sysrq-trigger
Commonly used options are :
• m - dump info about memory allocation
• t - dump thread state information
• c - intentionally crash the system
20. Kdump
Kdump is mechanism that uses kexec to capture the crash dump. Crash
dump is also known as “vmcore” it can be captured using -
kdump/diskdump/netdump/xendump/LKCD/vmss2core etc.
kexec is a fastboot mechanism that allows booting a Linux kernel from
the context of an already running kernel without going through the BIOS.
Crash dump captures the state of the kernel at the moment of panic. It is
a snapshot of the physical memory at the time of crash.
• Vmcore can be collected by using following methods :
• Automatically when kernel panics (parameters) or oops. It can be due to Bug in
kernel or in third party driver. In case of memory corruption and hardware problems
• Manually when admin uses sysrq, NMI switch or by taking snapshot.
• Limitations of vmcore: Not useful for analysing healthy system; It cant capture
historical logs; It is complex and requires expertise to analysis it.
• Configuring kdump and starting service is not sufficient, testing kdump is must.
Also find out supported and unsupported kdump target for perticular OS vendor.
• There are multiple factors that affect vmcore generation, ex : Clustering, HP-
systems, Bonding, Network-cards/modules, virtualization etc.
21. Bugs
A software bug is a failure or flaw in a program that produces undesired
or incorrect results. It’s an error that prevents the application from
functioning as it should.
There are many reasons for software bugs. Most common reason is
human mistakes in software design and coding.
The BUG_ON() function acts similar to panic, but is called by intentional
code meant to check abnormal conditions.
The vmcore and vmcore-dmesg.txt helps to identify bugs. Bugs can be
in any software, but bug in device drivers or in kernel can cause outages.
A kernel bug example is - divide by zero in find_busiest_group() function
causing kernel panic in RHEL6 kernels.
A deadlock bug in “vmtoolsd” causing system hung - is an example of
external software bug leading to system panic condition.
22. Preparing for Future
Configure kdump on all systems. It has no side effects.
Configure audit rules based on business requirements.
Properly configure the cluster setting and test it.
Tune system as per guidelines of Application vendor.
Be ready with backup plan.
Patch regularly.