TSRT Crashes

Troubleshooting
Communications Manager
Crashes, Cores, Service
Restarts

Nikhil Phansalkar, Adam Frankel
Cisco Unified Communications

Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 1

Overview
In this Presentation we will focus on troubleshooting the following issues:

 Service Crashes
• Identify and debug coredumps
•Troubleshoot services not starting up properly
• Common issues that trigger service failures (Licensing, DNS etc)

 Server Crashes
• Symptoms of hardware failure
• File system corruption
• Kernel Panic
• Using netdump to troubleshoot Kernel Panic
• ASR (Automatic Server Recovery)
• IMM (Integrated Management Module)

 Case Studies


Identifying an Application Core
How to determine that a coredump has occurred on a system ?
Here are the typical symptoms of a coredump:
 Server remained up, but service was temporarily affected.
 An alert generated from RTMT about a core file being generated.
 A message in Eventviewer – Application log.


Identifying an Application Core
How to determine which application has generated the coredump file ?
Right click on the alert and select Alert Detail. This will show which application generated the core,
the time of the core, and the server that had the core.

Use the CLI command to list all cores present on the system.:
utils core list [for CUCM ver 5.x, 6.x]
utils core active list [for CUCM ver 7.x and later]

In the above examples, it’s the CCM application that generated the coredump.

Generating Backtrace
Use the following CLI command to generate a backtrace:
utils core analyze <CoreFilename> [for CUCM ver 5.x, 6.x]
utils core active analyze <CoreFilename> [for CUCM ver 7.x and later]

Option-1: Generate the backtrace using the CLI command in the customer
environment. The core analysis may cause momentary increase in CPU
utilization. For busy systems, it is advised to run this command during off-hours.
Option-2: Generate the backtrace on a lab server.
 Download and retrieve the core file from the production system.
 Upload the core file to /var/log/active/core on a lab server (requires root
access). The lab server should be running the exact same CUCM version.
 Execute the CLI command on the lab server.


Search Topic
Using the first 4 to 6 lines of the backtrace to formulate a search string for
Topic. Consider the following backtrace:

As a starting point, the following search string can be used:
_STL::list PickupMemberDnTable::findSubscribedMemberDnList PickupMonitoring::sendNotifyReq


Review Results of Topic Search
Check if there are any known bugs applicable to the customer’s CUCM version.


Troubleshoot Unresolved Coredumps
If the backtrace does not match an existing bug, then the following data
should be collected for analysis:
 Event Viewer-SystemLog
 Event Viewer-ApplicationLog
 RIS DataCollector PerfmonLog
 Logs (set to Detailed/Debug trace level) for the service that generated the
coredump. It’s a good idea to get CallManager logs even if its not the
application that crashed.
 Coredump file (required to submit an escalation to BU).


Troubleshoot Unresolved Coredumps
 The logs will provide an indication of the system activity prior to the crash.
 The intention is to isolate any unique events or errors that may have been a
factor in triggering the coredump.
 If the coredump has occurred multiple times, check for repeating patterns of
any particular event/error. Identifying the circumstances leading up to the
coredump typically expedites the resolution of these issues.
 Finally, open an escalation with the Business Unit. Use the template on the
escalation page to ensure that you have collected all the required information.
 If its not a known issue, then most likely you could be the proud submitter of
a new software defect!


Intentional Coredumps: Resource Starvation
An CallManager service may generate a coredump intentionally. This could be due to:
 High CPU utilization on the system. Thus CCM may get not access to the CPU
resources and may crash itself on purpose in order to recover from that state.
 This also can indicate some thread that the CCM is trying to use is blocked and thus
CCM crashes to attempt to get it out of this state.


Intentional Coredumps : Resource Starvation


Intentional Core Dumps: Due to Mem Leak
 Sometimes, a memory leak may trigger a coredump.
 This is because due to OS limitation, any individual process can allocate
max 3 Gb memory.
 If the process tries to allocate memory beyond this limit, an intentional
coredump will be generated.
 Refer next slide to see what the backtrace will look like in this situation.


Intentional Coredumps: Due to Mem Leak
backtrace
===================================
#0 0x00a157a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x01276825 in raise () from /lib/tls/libc.so.6
#2 0x01278289 in abort () from /lib/tls/libc.so.6
#3 0x0050d58b in __gnu_cxx::__verbose_terminate_handler () from
/usr/local/cm/lib/libstlport.so.5.1
#4 0x0050b2a1 in __cxxabiv1::__terminate () from /usr/local/cm/lib/libstlport.so.5.1
#5 0x0050b2d6 in std::terminate () from /usr/local/cm/lib/libstlport.so.5.1
#6 0x0050b41f in __cxa_throw () from /usr/local/cm/lib/libstlport.so.5.1
#7 0x0050b86c in operator new () from /usr/local/cm/lib/libstlport.so.5.1
#8 0x0a06bb2d in SdlProcessBase::operator new (size=102700) at SdlProcessBase.cpp:105
#9 0x0a0014e2 in H245SessionManager::create (parentId={mSdlProcessName = 0x0, mSdlNodeId =
4, mSdlAppId = 100, mSdlProcessNumber = 150, mSdlProcessInstance = 2629},
vH245TerminalType=H245_Gateway, vH245TransportConnectionMode=H245Client,
vH245IpAddress=404699044, vH245IpPort=40076, vTCPTos=96, vPassThruMSD=false, vTCSTimeout=10,
vFastStartInd=0, vFsAudioOutgoingLCN=0, vFsAudioIncomingLCN=0, pktCaptureContext=0xbffab74d
"", allowTCPKeepAlivesForH323=true) at ProcessH245SessionManager.cpp:221
#10 0x08a5629c in H245Interface::start_Transition (this=0xbff99008, s=@0x5c70990) at
/vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:123
#11 0x08a99354 in H245Interface::fireSignal (this=0xbff99008, sdlSignal=@0x5c70990) at
/vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:175
#12 0x0a06c904 in SdlProcessBase::inputSignal (this=0xbff99008, rSignal=0x5c70990,
traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0, lowPriority=0,
veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at SdlProcessBase.cpp:397
#13 0x0a0746ce in SdlRouter::callProcess (this=0xe225ac0, _sdlSignal=0x5c70990,
_deleteSignal=@0x36b8d07, _traceType=SdlSystemLog::SignalRouterThread, _hp=0, _np=0, _lp=0,
_vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:371
#14 0x0a0740f3 in SdlRouter::scheduler (sdlRouter=0xe225ac0) at SdlRouter.cpp:281
#15 0x05514bd7 in ACE_OS_Thread_Adapter::invoke (this=0xfe57a30) at OS_Thread_Adapter.cpp:94
#16 0x054d5087 in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137
#17 0x00db73cc in start_thread () from /lib/tls/libpthread.so.0
#18 0x0131a96e in clone () from /lib/tls/libc.so.6


Troubleshoot Intentional Coredumps
 Intentional coredumps typically generate similar backtraces.
 Searching topic may yield several several hits. But, they may not always be
pertinent to the issue you are troubleshooting.
 Remember: intentional coredump is a symptom of some other problem.
 If you see an intentional coredump, retrieving and analyzing PerfMonLogs is
crucial to figure out the CPU/Memory utilization prior to the coredump since
that will lead you to root cause.


Services Not Starting

 A service not starting is different from a service crash.
Often times the service never started on system boot.
 Some Possible Culprits
• Licensing
• Database
• Disk Space
• services.conf corruption
• Software defect


Services Not Starting

 Perform a “utils service list”
via CLI
Is the service deactivated?
Is the service “Commanded
out of service”?
Is the service in a
“[STOPPED]” state?

 Make an assessment as to
which service(s) is
expected to be started but
is not


Licensing

 If CCM is not starting, verify License Unit Report that
SW_Feature License is loaded and sufficient NODE
Licenses are available


Verify disk space

 ‘show status’ will display disk usage for active, inactive,
and common partitions
 Verify that none are above 97% disk usage

 Some services require disk space on the active partition
to start and on the common partition for logging
purposes


Symptoms of DB Problems

 If multiple services will not start and no logs are being
written, there may be a problem with Informix
 Verify if “A Cisco DB” has started
 Run ‘show tech dbstateinfo’
• Determine if Informix is online (first line
• Find #RSAM to compare the number of db sessions and used
DB memory per user, similar to ‘onstat –g ses’

 Check informix logs for DB errors
activelog cm/log/informix/ccm.log


Symptoms of DB Problems

Check for any user with excess sessions open or if any single session is
using excess DB memory. This may identify a process that needs to be
investigated further.


Informix/DNS
 CSCsw88022 -Database should still start and function
when DNS is unavailable. This is fixed as of 7.1(1) as
sqlhosts no longer uses dns
 If “dns” is present in the
“hosts” line of the
/etc/nsswitch.conf then
Informix relies on DNS to
startup properly (pre 7.1)
 Check ‘utils network host [fqdn/ip]’
Make sure that external resolution resolves properly for all CUCM
servers, forward and reverse.


Services Deactivated After Reboot
 The ‘services.conf’ is located in /usr/local/platform/conf
 It contains a list of which services to activate on boot

 If the disk is full this file might be recreated as a zero byte file.
This will cause all services to be deactivated on startup.
 Remedy the disk situation
 Restore the services.conf from another server or lab server of same
version as a workaround
 After service is restored advise customer to rebuild corrupted node

Troubleshoot Server Freezes

Problem Symptoms:
 The server was running fine for a number of minutes, months, or years
and then suddenly stops responding.
 The server cannot be accessed via the web, ssh, or the console.
 All CUCM services stopped responding.


Troubleshoot Server Freezes

 Check the console for any messages. Eg:
EXT3-fs error (device sda6) in start_transaction: Journal has aborted

 The errors may also be written to Eventviewer-SystemLog. But, this can
only be viewed after system reboot. Note that it may not capture all
messages displayed on the console.
Note: you can access the console using iLO (on HP servers) or using IMM (on supported
IBM servers).

 Reboot the server. A recovery disc may be required to ensure that the file
system has fully recovered.
 Check for hardware issues.
 If none of the above reveals the cause, then enable netdump using the CLI
to gather information for subsequent failures.


dmesg

 dmesg (for "display message") can be used to print the message buffer of
the kernel.
 This contains diagnostic messages (example: when I/O devices encounter
errors).The messages are typically displayed to the console. But, the
console output can quickly get overwritten.
 If filesystem becomes readonly, syslog messages are no longer written to
syslog file on disk. But, the messages will still exist in kernel memory.
 dmesg provides a mechanism to review these messages at a later time.
 Currently, this command has to be executed from root. There is an
enhancement defect CSCtc59353 to get this information directly from the
admin CLI.


Hardware Problems: Server Self
Diagnostics
Power on Self Test (POST)
 During boot up, server will test all hardware for functionality
 Failure of any device results in POST which is displayed on screen, audible error
(beeps), or an amber/red light being displayed
 Hard drives have indicator light green is normal running state, amber or red
indicates a problem
 Inspect hardware report for SMART errors. This may occur if disk has a large
number of bad sectors. In this case light may still be green.
 Lights on front of server, and on the motherboard can help indicate failing
hardware
If there is a red or amber light on front of server, run vendor diagnostic to get
more details


Vendor Diagnostics (HP/IBM)

 IBM and HP require bootable hardware diagnostics discs to be
run.
 IBM Servers require DSA
 HP Servers require Smart Start
 Detailed Steps are provided in the email templates on TAC-Wiki
 http://tac-wiki/Communications_Manager_Hardware_failure


File System Issues

A forced reboot or hard reset can cause damage to the file systems that will
prevent the server from booting. This can also be caused due to a
firmware bug or a hardware problem (eg: bad hard drive).
Symptoms:
 Server does not boot completely. Console may indicate:
*** An error occured during the file system check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell.
Give root password for maintenance
(or type Control-D to continue):

 Server displays file system related errors on boot:
EXT3-fs error (device ...) in start_transaction: Journal has aborted

 Server indicates a manual file system check (FSCK) is required


File System Issues
Resolution:
 Boot the server using the CUCM recovery disk.
 Execute the automatic and manual file system check.
 It is always suggested to use the latest recovery disc regardless of product
version.
Note: Prior to CUCM 6.1.4 and CUCM 7.0.2, the recovery disk contained manual [m] and
automatic [f] fsck options. The automatic option [f] was not effective and sometimes did not
resolve the issue. The manual option [m] worked fine in all cases. Starting with CUCM 6.1.4
onwards & CUCM 7.0.2 onwards, the fsck logic was enhanced and recovery CD menu was
updated to contain the automatic option only [refer CSCsu08170].

 Not all file system corruptions can be fixed. You might have to fresh install
and execute a DRS restore.
 If the system is still experiencing issues, this points to hardware failure.
Install new hard drives and then perform a fresh install with DRS recovery.
 A frequently observed bug is CSCta73022. If /common partition is affected,
BU recommends rebuilding the server.


Kernel Panic

 A kernel panic is an action taken by an operating system upon detecting
an internal fatal error from which it cannot safely recover.
 Typically caused by attempts by the operating system to read an invalid or
non-permitted memory address are a common source of kernel
 In many cases, the operating system could continue operation after
memory violations have occurred. However, the system is in an unstable
state and rather than risking security breaches and data corruption, the
operating system stops to prevent further damage and facilitate diagnosis
of the error.
 A kernel panic may also occur as a result of a hardware failure or a bug in
the operating system.
 This is similar to Windows "Bug Check" (aka: "Blue Screen of Death").
 IPVMS, CSA and FIOR are the Cisco kernel modules that may cause
Kernel Panic. You can try disabling them as a workaround.


Netdump
 Use netdump to troubleshoot kernel panic issues.
 Netdump uses UDP port 6666.
 Contains information that indicates where the kernel panicked.
 Utilizes a client-server model.
 Does not work with NIC-teaming enabled.


Configuring Netdump
Configure the Netdump server
2. Login to the server designated as the netdump server.
3. Start the netdump server:
utils netdump server start

4. Enter the following command for all the netdump client machines:
utils netdump server add-client <Ip-Addr-of-netdump-client>

5. Enter the following command to verify status of the netdump server:
utils netdump server status

6. Use the following command to verify the clients on the list:
utils netdump server list-clients


Configuring Netdump


Configuring Netdump
Configure the Netdump client
2. Login to the server designated as the netdump client.
3. Start the netdump client:
utils netdump client start <Ip-Addr-of-netdump-server>

4. Enter the following command to verify status of the netdump client:
utils netdump client status


Configuring Netdump
Verify that the client and server are communicating.
 After configuring the netdump server and netdump client, execute the
following command on the netdump server:
file list activelog crash/

 You should see a new sub-directory which has the client IP address and
the date-timestamp when it started:

admin:file list activelog crash/
<dir> 14.48.60.80-2010-03-05-11:30
<dir> magic
<dir> scripts
dir count = 3, file count = 0
admin:

 A new sub-directory will be created each time the netdump client is
restarted.


Netdump: Example
!!DO NOT TRY THIS IN A PRODUCTION ENVIRONMENT!!

On netdump client machine, trigger a kernel panic:

The console displays:


Netdump: Example
The netdump diagnostic information gets stored in a sub-directory at the
/var/crash location on the netdump server:

Contents of the log file:


ASR: Automatic Server Recovery

 Applicable only to HP servers. Enabled by default.
 ASR is implemented via HP ASM driver (Advanced System Management).
 ASR is implemented via a 10 minute countdown timer .
 During regular operation, the ASM driver frequently resets this timer to
prevent it from counting down to zero.
 If the timer counts down to 0, it is assumed that the operating system is
locked up and the system automatically attempts to reboot.
 Need to collect IML logs from the system (IML: Integrated Management
Log) using the following command:
file view system-management-log

ID Severity Initial Time Update Time Count
-------------------------------------------------------------
0000 Critical 20:44 04/02/2007 20:44 04/02/2007 0001
LOG: ASR Detected by System ROM


IMM: Integrated Management Module

 Newer IBM servers such as the 7835-I3 and the 7845-I3 include IBM’s
IMM.
 IMMs have an OS Watchdog feature that is similar to HP’s ASRs. This
feature is disabled by default.
 Refer to CSCte05285 which tracks the enhancement request to include
the server recovery functionality into the new IBM servers.
 You can access IMM using its own Ethernet port (labelled ‘System Mgmt’).



The IMM is set initially with a user name of USERID and password of
PASSW0RD (with a zero, not the letter O).


HP vs. IBM

HP IBM

Enabled in all HP servers by Supported in newer IBM servers only
Automated default. [7835-I3 and 7845-I3] via IMM. Disabled
Recovery by default.
To view corresponding logs: To view corresponding logs: <TBD>
‘file view system-management-log’

In-depth vendor
Diagnostics Smartstart –CD (bootable) DSA-CD (bootable)
(requires
downtime)
High-level system CLI commands: CLI commands:
diagnostics (does utils create report hardware utils create report hardware
not require utils diagnose test utils diagnose test
show hardware show hardware
downtime) show environment show environment


Case study-1
 TAC case: 611181361.
 Problem Description: Customer created TAC case to investigate following
alarm:
04/06/2009 20:38:26.455 LPM|GenAlarm: AlarmName = CoreDumpFileFound, DeviceName = fm11d-bq50vcm1,
AlarmMsg = CoreDumpFileFound
TotalCoresFound : 1
CoreDetails : The following lists up to 6 cores dumped by corresponding applications.
Core1 : Unknown (core.3733.11.showtechCCMDB.s.1239075504)
AppID : Cisco Log Partition Monitoring Tool


Case study-1
 Backtrace: #33 0x080668a4 in execute_command ()
#0 0x080ba54c in glob_filename () #34 0x08067ed2 in execute_command_internal ()
#1 0x080ba5a2 in glob_filename () #35 0x08066fde in execute_command_internal ()
#2 0x080ba5a2 in glob_filename () #36 0x080668a4 in execute_command ()
#3 0x080ba5a2 in glob_filename () #37 0x08067ed2 in execute_command_internal ()
#4 0x080ba5a2 in glob_filename () #38 0x08066fde in execute_command_internal ()
#6 0x080ba5a2 in glob_filename () #40 0x08068e94 in execute_command_internal ()
#7 0x080ba5a2 in glob_filename () #41 0x08066f6d in execute_command_internal ()
#9 0x080823b2 in shell_glob_filename () #43 0x0805c969 in reader_loop ()
#10 0x0807ed3d in expand_words_shellexp () #44 0x0805ae9b in main ()
#11 0x0807f26c in expand_words_shellexp ()
#12 0x0807ec19 in expand_words ()
#13 0x08069766 in execute_command_internal ()
#14 0x08066d9c in execute_command_internal ()
#15 0x08094822 in parse_and_execute ()
#16 0x0807b3b2 in command_substitute ()
#17 0x0807e223 in pat_subst ()
#18 0x08079700 in cond_expand_word ()
#19 0x080797c1 in cond_expand_word ()
#20 0x08079819 in expand_string_unsplit ()
#21 0x08079478 in string_rest_of_args ()
#22 0x08078f8c in
strip_trailing_ifs_whitespace ()
#23 0x08079029 in do_assignment ()
#24 0x0807f2b4 in expand_words_shellexp ()
#25 0x0807ec19 in expand_words ()
#26 0x08069766 in execute_command_internal ()
#27 0x08066d9c in execute_command_internal ()
#28 0x08067f09 in execute_command_internal ()
#29 0x08066fde in execute_command_internal ()
#30 0x080668a4 in execute_command ()
#31 0x08067ed2 in execute_command_internal ()
#32 0x08066fde in execute_command_internal ()


Case study-1
 The backtrace contained strings such as ‘execute_command_internal’,
‘parse_and_execute’ , ‘expand_words_shellexp’.
 This most likely meant that the coredump was related to a CLI command.
 Next, retrieved and analyzed following traces:
- Cisco CallManager Admin
- IPT Platform CLI Logs

 The IPT Platform CLI logs revealed that the “show tech locales” was the
last CLI command executed just prior to the coredump occurrence.
 Topic search did not yield any known bugs.
 An escalation was submitted to Business Unit.
 CSCsz24566 was then filed. It was eventually resolved by the BU.


Case study-2
 TAC case: 612476435.
 Problem Description: CallManager service coredumps every 2 and half
days
admin:utils core active list

Size Date Core File Name
=================================================================
2009-09-13 08:03:25 core.9800.6.ccm.1252843074
2009-09-15 15:58:52 core.2497.6.ccm.1253044183
2009-09-18 00:03:38 core.3564.6.ccm.1253245847
2009-09-20 08:00:16 core.6676.6.ccm.1253447596
2009-09-22 16:00:18 core.8282.6.ccm.1253649103

 Backtrace:
#0 0x001627a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00d64815 in raise () from /lib/tls/libc.so.6
#2 0x00d66279 in abort () from /lib/tls/libc.so.6
#3 0x084c4e7a in preabort () at ProcessCMProcMon.cpp:101
#4 0x084c4e92 in IntentionalAbort (reason=0xa9fdbdc "CallManager's timers appear
incorrect. This may be due to CPU or blocked function. Attempting to restart
CallManager.") at ProcessCMProcMon.cpp:106
#5 0x084c66c3 in CMProcMon::verifySdlTimerServices () at ProcessCMProcMon.cpp:843
#6 0x084c7035 in CMProcMon::callManagerMonitorThread (cmProcMon=0xec122d0) at
ProcessCMProcMon.cpp:439
#7 0x0107e5fb in ACE_OS_Thread_Adapter::invoke (this=0xf3ef3b8) at
OS_Thread_Adapter.cpp:94
#8 0x01040cbf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137
#9 0x002dc3cc in start_thread () from /lib/tls/libpthread.so.0
#10 0x00e061ae in clone () from /lib/tls/libc.so.6


Case study-2
 The backtrace indicates that its an intentional coredump.
 Hence, need to review the Perfmon data next to check for
• CPU Utilization
• Memory Leaks

 The CPU utilization looks steady prior to the coredump.


Case study-2
 The %VM Used counter appears to be high

 The VMSize for CCM is high. Also, note how
the line slopes upwards. Signifies increasing
memory usage over time.

=> Data points to a CCM memory leak.


Case study-2
 Escalation was submitted to the Business Unit (BU).

 Filed a software defect CSCtc70568 with BU recommendation.

 High level analysis of why CCM coredump’ed:
Due to the memory leak, an internal data structure became large in size. A
new entry was subsequently added to this data structure. The data
structure had to be re-sized to accommodate the new element. The re-size
operation took a long time and the CallManager service coredump’ed as a
result of that.

 CSCtc70568 ended up being marked as a duplicate of CSCsx25778.


Commonly Found Crash Defects

 CSCsv49493 – 7828-H3 goes down with journal aborted error
 CSCta73022 –7835-I2/7845-I2 file system read-only mode journal aborted
error
 CSCtb89163 – CER defect for above
 CSCtb79203 – 7845H server read only
 CSCte19556 – Core while deleting H323 Gateway part of RG
 CSCtd58872 – Cdcc to check the return value from getSideGivenCI
prevent CCM core
 CSCte44391 – kpml message over 24 character causes ccm coredump
 CSCsl74589 – HardwareFailureAlert is raised due to iLO 2 Comm Error
 CSCsl01006 – CCM core when making call while updating pickup group
 CSCsk21012 – process core due to File size limit exceeded


Q/A

 Questions?


TSRT Crashes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TSRT Crashes

Similar to TSRT Crashes (20)

More from ashiesh0007

More from ashiesh0007 (10)

TSRT Crashes