SlideShare a Scribd company logo
1 of 62
Troubleshooting
          Communications Manager
          Crashes, Cores, Service
          Restarts



          Nikhil Phansalkar, Adam Frankel
          Cisco Unified Communications




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   1
Overview
         In this Presentation we will focus on troubleshooting the following issues:

          Service Crashes
            • Identify and debug coredumps
            •Troubleshoot services not starting up properly
            • Common issues that trigger service failures (Licensing, DNS etc)

          Server Crashes
            • Symptoms of hardware failure
            • File system corruption
            • Kernel Panic
            • Using netdump to troubleshoot Kernel Panic
            • ASR (Automatic Server Recovery)
            • IMM (Integrated Management Module)


          Case Studies




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   2
Identifying an Application Core
         How to determine that a coredump has occurred on a system ?
         Here are the typical symptoms of a coredump:
            Server remained up, but service was temporarily affected.
            An alert generated from RTMT about a core file being generated.
            A message in Eventviewer – Application log.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   3
Identifying an Application Core
         How to determine which application has generated the coredump file ?
         Right click on the alert and select Alert Detail. This will show which application generated the core,
         the time of the core, and the server that had the core.




         Use the CLI command to list all cores present on the system.:
                     utils core list                                                 [for CUCM ver 5.x, 6.x]
                     utils core active list                                          [for CUCM ver 7.x and later]




                  In the above examples, it’s the CCM application that generated the coredump.
Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                              4
Generating Backtrace
           Use the following CLI command to generate a backtrace:
                       utils core analyze <CoreFilename>                                 [for CUCM ver 5.x, 6.x]
                       utils core active analyze <CoreFilename>                          [for CUCM ver 7.x and later]

           Option-1: Generate the backtrace using the CLI command in the customer
             environment. The core analysis may cause momentary increase in CPU
             utilization. For busy systems, it is advised to run this command during off-hours.
           Option-2: Generate the backtrace on a lab server.
            Download and retrieve the core file from the production system.
            Upload the core file to /var/log/active/core on a lab server (requires root
             access). The lab server should be running the exact same CUCM version.
            Execute the CLI command on the lab server.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                                  5
Search Topic
         Using the first 4 to 6 lines of the backtrace to formulate a search string for
         Topic. Consider the following backtrace:




         As a starting point, the following search string can be used:
         _STL::list PickupMemberDnTable::findSubscribedMemberDnList PickupMonitoring::sendNotifyReq

Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                6
Review Results of Topic Search
         Check if there are any known bugs applicable to the customer’s CUCM version.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   7
Troubleshoot Unresolved Coredumps
        If the backtrace does not match an existing bug, then the following data
        should be collected for analysis:
         Event Viewer-SystemLog
         Event Viewer-ApplicationLog
         RIS DataCollector PerfmonLog
         Logs (set to Detailed/Debug trace level) for the service that generated the
        coredump. It’s a good idea to get CallManager logs even if its not the
        application that crashed.
         Coredump file (required to submit an escalation to BU).




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   8
Troubleshoot Unresolved Coredumps
         The logs will provide an indication of the system activity prior to the crash.
         The intention is to isolate any unique events or errors that may have been a
        factor in triggering the coredump.
         If the coredump has occurred multiple times, check for repeating patterns of
        any particular event/error. Identifying the circumstances leading up to the
        coredump typically expedites the resolution of these issues.
         Finally, open an escalation with the Business Unit. Use the template on the
        escalation page to ensure that you have collected all the required information.
         If its not a known issue, then most likely you could be the proud submitter of
        a new software defect!




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential     9
Intentional Coredumps: Resource Starvation
         An CallManager service may generate a coredump intentionally. This could be due to:
          High CPU utilization on the system. Thus CCM may get not access to the CPU
         resources and may crash itself on purpose in order to recover from that state.
          This also can indicate some thread that the CCM is trying to use is blocked and thus
         CCM crashes to attempt to get it out of this state.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential            10
Intentional Coredumps : Resource Starvation




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   11
Intentional Core Dumps: Due to Mem Leak
          Sometimes, a memory leak may trigger a coredump.
          This is because due to OS limitation, any individual process can allocate
         max 3 Gb memory.
          If the process tries to allocate memory beyond this limit, an intentional
         coredump will be generated.
          Refer next slide to see what the backtrace will look like in this situation.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential    12
Intentional Coredumps: Due to Mem Leak
         backtrace
         ===================================
         #0 0x00a157a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
         #1 0x01276825 in raise () from /lib/tls/libc.so.6
         #2 0x01278289 in abort () from /lib/tls/libc.so.6
         #3 0x0050d58b in __gnu_cxx::__verbose_terminate_handler () from
         /usr/local/cm/lib/libstlport.so.5.1
         #4 0x0050b2a1 in __cxxabiv1::__terminate () from /usr/local/cm/lib/libstlport.so.5.1
         #5 0x0050b2d6 in std::terminate () from /usr/local/cm/lib/libstlport.so.5.1
         #6 0x0050b41f in __cxa_throw () from /usr/local/cm/lib/libstlport.so.5.1
         #7 0x0050b86c in operator new () from /usr/local/cm/lib/libstlport.so.5.1
         #8 0x0a06bb2d in SdlProcessBase::operator new (size=102700) at SdlProcessBase.cpp:105
         #9 0x0a0014e2 in H245SessionManager::create (parentId={mSdlProcessName = 0x0, mSdlNodeId =
         4, mSdlAppId = 100, mSdlProcessNumber = 150, mSdlProcessInstance = 2629},
         vH245TerminalType=H245_Gateway, vH245TransportConnectionMode=H245Client,
         vH245IpAddress=404699044, vH245IpPort=40076, vTCPTos=96, vPassThruMSD=false, vTCSTimeout=10,
         vFastStartInd=0, vFsAudioOutgoingLCN=0, vFsAudioIncomingLCN=0, pktCaptureContext=0xbffab74d
         "", allowTCPKeepAlivesForH323=true) at ProcessH245SessionManager.cpp:221
         #10 0x08a5629c in H245Interface::start_Transition (this=0xbff99008, s=@0x5c70990) at
         /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:123
         #11 0x08a99354 in H245Interface::fireSignal (this=0xbff99008, sdlSignal=@0x5c70990) at
         /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:175
         #12 0x0a06c904 in SdlProcessBase::inputSignal (this=0xbff99008, rSignal=0x5c70990,
         traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0, lowPriority=0,
         veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at SdlProcessBase.cpp:397
         #13 0x0a0746ce in SdlRouter::callProcess (this=0xe225ac0, _sdlSignal=0x5c70990,
         _deleteSignal=@0x36b8d07, _traceType=SdlSystemLog::SignalRouterThread, _hp=0, _np=0, _lp=0,
         _vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:371
         #14 0x0a0740f3 in SdlRouter::scheduler (sdlRouter=0xe225ac0) at SdlRouter.cpp:281
         #15 0x05514bd7 in ACE_OS_Thread_Adapter::invoke (this=0xfe57a30) at OS_Thread_Adapter.cpp:94
         #16 0x054d5087 in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137
         #17 0x00db73cc in start_thread () from /lib/tls/libpthread.so.0
         #18 0x0131a96e in clone () from /lib/tls/libc.so.6


Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                  13
Troubleshoot Intentional Coredumps
          Intentional coredumps typically generate similar backtraces.
          Searching topic may yield several several hits. But, they may not always be
         pertinent to the issue you are troubleshooting.
          Remember: intentional coredump is a symptom of some other problem.
          If you see an intentional coredump, retrieving and analyzing PerfMonLogs is
         crucial to figure out the CPU/Memory utilization prior to the coredump since
         that will lead you to root cause.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   14
Services Not Starting

           A service not starting is different from a service crash.
            Often times the service never started on system boot.
           Some Possible Culprits
                  • Licensing
                  • Database
                  • Disk Space
                  • services.conf corruption
                  • Software defect




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   15
Services Not Starting

           Perform a “utils service list”
            via CLI
                  Is the service deactivated?
                  Is the service “Commanded
                   out of service”?
                  Is the service in a
                   “[STOPPED]” state?

           Make an assessment as to
            which service(s) is
            expected to be started but
            is not


Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   16
Licensing

           If CCM is not starting, verify License Unit Report that
            SW_Feature License is loaded and sufficient NODE
            Licenses are available




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   17
Verify disk space

           ‘show status’ will display disk usage for active, inactive,
            and common partitions
           Verify that none are above 97% disk usage




           Some services require disk space on the active partition
            to start and on the common partition for logging
            purposes



Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   18
Symptoms of DB Problems

           If multiple services will not start and no logs are being
            written, there may be a problem with Informix
           Verify if “A Cisco DB” has started
           Run ‘show tech dbstateinfo’
                  • Determine if Informix is online (first line
                  • Find #RSAM to compare the number of db sessions and used
                    DB memory per user, similar to ‘onstat –g ses’

           Check informix logs for DB errors
                  activelog cm/log/informix/ccm.log



Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   19
Symptoms of DB Problems




                  Check for any user with excess sessions open or if any single session is
                  using excess DB memory. This may identify a process that needs to be
                  investigated further.




Presentation_ID      © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential    20
Informix/DNS
           CSCsw88022 -Database should still start and function
          when DNS is unavailable. This is fixed as of 7.1(1) as
          sqlhosts no longer uses dns
            If “dns” is present in the
             “hosts” line of the
             /etc/nsswitch.conf then
             Informix relies on DNS to
             startup properly (pre 7.1)
            Check ‘utils network host [fqdn/ip]’
           Make sure that external resolution resolves properly for all CUCM
           servers, forward and reverse.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   21
Services Deactivated After Reboot
           The ‘services.conf’ is located in /usr/local/platform/conf
           It contains a list of which services to activate on boot




           If the disk is full this file might be recreated as a zero byte file.
            This will cause all services to be deactivated on startup.
           Remedy the disk situation
           Restore the services.conf from another server or lab server of same
            version as a workaround
           After service is restored advise customer to rebuild corrupted node
Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   22
Troubleshoot Server Freezes

          Problem Symptoms:
           The server was running fine for a number of minutes, months, or years
            and then suddenly stops responding.
           The server cannot be accessed via the web, ssh, or the console.
           All CUCM services stopped responding.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   23
Troubleshoot Server Freezes

           Check the console for any messages. Eg:
                  EXT3-fs error (device sda6) in start_transaction: Journal has aborted

           The errors may also be written to Eventviewer-SystemLog. But, this can
            only be viewed after system reboot. Note that it may not capture all
            messages displayed on the console.
                  Note: you can access the console using iLO (on HP servers) or using IMM (on supported
                   IBM servers).

           Reboot the server. A recovery disc may be required to ensure that the file
            system has fully recovered.
           Check for hardware issues.
           If none of the above reveals the cause, then enable netdump using the CLI
            to gather information for subsequent failures.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                    24
dmesg

           dmesg (for "display message") can be used to print the message buffer of
            the kernel.
           This contains diagnostic messages (example: when I/O devices encounter
            errors).The messages are typically displayed to the console. But, the
            console output can quickly get overwritten.
           If filesystem becomes readonly, syslog messages are no longer written to
            syslog file on disk. But, the messages will still exist in kernel memory.
           dmesg provides a mechanism to review these messages at a later time.
           Currently, this command has to be executed from root. There is an
            enhancement defect CSCtc59353 to get this information directly from the
            admin CLI.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   25
Hardware Problems: Server Self
          Diagnostics
   Power on Self Test (POST)
    During boot up, server will test all hardware for functionality
    Failure of any device results in POST which is displayed on screen, audible error
     (beeps), or an amber/red light being displayed
    Hard drives have indicator light green is normal running state, amber or red
     indicates a problem
    Inspect hardware report for SMART errors. This may occur if disk has a large
     number of bad sectors. In this case light may still be green.
    Lights on front of server, and on the motherboard can help indicate failing
     hardware
                  If there is a red or amber light on front of server, run vendor diagnostic to get
                  more details




Presentation_ID     © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential              26
Vendor Diagnostics (HP/IBM)

          IBM and HP require bootable hardware diagnostics discs to be
           run.
          IBM Servers require DSA
          HP Servers require Smart Start
          Detailed Steps are provided in the email templates on TAC-Wiki
          http://tac-wiki/Communications_Manager_Hardware_failure




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   27
File System Issues

          A forced reboot or hard reset can cause damage to the file systems that will
            prevent the server from booting. This can also be caused due to a
            firmware bug or a hardware problem (eg: bad hard drive).
          Symptoms:
           Server does not boot completely. Console may indicate:
                  *** An error occured during the file system check.
                  *** Dropping you to a shell; the system will reboot
                  *** when you leave the shell.
                  Give root password for maintenance
                  (or type Control-D to continue):

           Server displays file system related errors on boot:
                  EXT3-fs error (device ...) in start_transaction: Journal has aborted

           Server indicates a manual file system check (FSCK) is required




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   28
File System Issues
          Resolution:
           Boot the server using the CUCM recovery disk.
           Execute the automatic and manual file system check.
           It is always suggested to use the latest recovery disc regardless of product
            version.
                  Note: Prior to CUCM 6.1.4 and CUCM 7.0.2, the recovery disk contained manual [m] and
                   automatic [f] fsck options. The automatic option [f] was not effective and sometimes did not
                   resolve the issue. The manual option [m] worked fine in all cases. Starting with CUCM 6.1.4
                   onwards & CUCM 7.0.2 onwards, the fsck logic was enhanced and recovery CD menu was
                   updated to contain the automatic option only [refer CSCsu08170].

           Not all file system corruptions can be fixed. You might have to fresh install
            and execute a DRS restore.
           If the system is still experiencing issues, this points to hardware failure.
            Install new hard drives and then perform a fresh install with DRS recovery.
           A frequently observed bug is CSCta73022. If /common partition is affected,
            BU recommends rebuilding the server.

Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                            29
Kernel Panic

           A kernel panic is an action taken by an operating system upon detecting
            an internal fatal error from which it cannot safely recover.
           Typically caused by attempts by the operating system to read an invalid or
            non-permitted memory address are a common source of kernel
           In many cases, the operating system could continue operation after
            memory violations have occurred. However, the system is in an unstable
            state and rather than risking security breaches and data corruption, the
            operating system stops to prevent further damage and facilitate diagnosis
            of the error.
           A kernel panic may also occur as a result of a hardware failure or a bug in
            the operating system.
           This is similar to Windows "Bug Check" (aka: "Blue Screen of Death").
           IPVMS, CSA and FIOR are the Cisco kernel modules that may cause
            Kernel Panic. You can try disabling them as a workaround.

Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential    30
Netdump
           Use netdump to troubleshoot kernel panic issues.
           Netdump uses UDP port 6666.
           Contains information that indicates where the kernel panicked.
           Utilizes a client-server model.
           Does not work with NIC-teaming enabled.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   31
Configuring Netdump
          Configure the Netdump server
          2. Login to the server designated as the netdump server.
          3. Start the netdump server:
                  utils netdump server start

          4. Enter the following command for all the netdump client machines:
                  utils netdump server                               add-client <Ip-Addr-of-netdump-client>

          5. Enter the following command to verify status of the netdump server:
                  utils netdump server status

          6. Use the following command to verify the clients on the list:
                  utils netdump server list-clients




Presentation_ID    © 2006 Cisco Systems, Inc. All rights reserved.    Cisco Confidential                      32
Configuring Netdump




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   33
Configuring Netdump
          Configure the Netdump client
          2. Login to the server designated as the netdump client.
          3. Start the netdump client:
                  utils netdump client start <Ip-Addr-of-netdump-server>

          4. Enter the following command to verify status of the netdump client:
                  utils netdump client status




Presentation_ID    © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   34
Configuring Netdump
          Verify that the client and server are communicating.
           After configuring the netdump server and netdump client, execute the
            following command on the netdump server:
                  file list activelog crash/



           You should see a new sub-directory which has the client IP address and
            the date-timestamp when it started:

                   admin:file list activelog crash/
                   <dir>   14.48.60.80-2010-03-05-11:30
                   <dir>   magic
                   <dir>   scripts
                   dir count = 3, file count = 0
                   admin:



           A new sub-directory will be created each time the netdump client is
            restarted.


Presentation_ID     © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   35
Netdump: Example
                         !!DO NOT TRY THIS IN A PRODUCTION ENVIRONMENT!!

          On netdump client machine, trigger a kernel panic:




          The console displays:




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   36
Netdump: Example
          The netdump diagnostic information gets stored in a sub-directory at the
            /var/crash location on the netdump server:




          Contents of the log file:




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   37
ASR: Automatic Server Recovery

           Applicable only to HP servers. Enabled by default.
           ASR is implemented via HP ASM driver (Advanced System Management).
           ASR is implemented via a 10 minute countdown timer .
           During regular operation, the ASM driver frequently resets this timer to
            prevent it from counting down to zero.
           If the timer counts down to 0, it is assumed that the operating system is
            locked up and the system automatically attempts to reboot.
           Need to collect IML logs from the system (IML: Integrated Management
            Log) using the following command:
                  file view system-management-log

                  ID   Severity       Initial Time      Update Time       Count
                  -------------------------------------------------------------
                  0000 Critical       20:44 04/02/2007 20:44 04/02/2007 0001
                  LOG: ASR Detected by System ROM




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   38
IMM: Integrated Management Module

           Newer IBM servers such as the 7835-I3 and the 7845-I3 include IBM’s
            IMM.
           IMMs have an OS Watchdog feature that is similar to HP’s ASRs. This
            feature is disabled by default.
           Refer to CSCte05285 which tracks the enhancement request to include
            the server recovery functionality into the new IBM servers.
           You can access IMM using its own Ethernet port (labelled ‘System Mgmt’).




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   39
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   40
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   41
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   42
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   43
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   44
IMM: Integrated Management Module

          The IMM is set initially with a user name of USERID and password of
            PASSW0RD (with a zero, not the letter O).




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   45
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   46
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   47
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   48
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   49
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   50
IMM: Integrated Management Module




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   51
HP vs. IBM

                                                                               HP                            IBM

                                                 Enabled in all HP servers by              Supported in newer IBM servers only
                  Automated                      default.                                  [7835-I3 and 7845-I3] via IMM. Disabled
                  Recovery                                                                 by default.
                                                 To view corresponding logs:               To view corresponding logs: <TBD>
                                                 ‘file view system-management-log’


              In-depth vendor
                Diagnostics                              Smartstart –CD (bootable)                    DSA-CD (bootable)
                  (requires
                 downtime)
            High-level system                                        CLI commands:                      CLI commands:
            diagnostics (does                    utils create report hardware              utils create report hardware
               not require                       utils diagnose test                       utils diagnose test
                                                 show hardware                             show hardware
               downtime)                         show environment                          show environment




Presentation_ID    © 2006 Cisco Systems, Inc. All rights reserved.    Cisco Confidential                                             52
Case study-1
          TAC case: 611181361.
          Problem Description: Customer created TAC case to investigate following
           alarm:
         04/06/2009 20:38:26.455 LPM|GenAlarm: AlarmName = CoreDumpFileFound, DeviceName = fm11d-bq50vcm1,
            AlarmMsg = CoreDumpFileFound
         TotalCoresFound : 1
         CoreDetails : The following lists up to 6 cores dumped by corresponding applications.
         Core1 : Unknown (core.3733.11.showtechCCMDB.s.1239075504)
         AppID : Cisco Log Partition Monitoring Tool




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                       53
Case study-1
          Backtrace:                                                                          #33   0x080668a4   in   execute_command ()
                  #0 0x080ba54c in glob_filename ()                                            #34   0x08067ed2   in   execute_command_internal   ()
                  #1 0x080ba5a2 in glob_filename ()                                            #35   0x08066fde   in   execute_command_internal   ()
                  #2 0x080ba5a2 in glob_filename ()                                            #36   0x080668a4   in   execute_command ()
                  #3 0x080ba5a2 in glob_filename ()                                            #37   0x08067ed2   in   execute_command_internal   ()
                  #4 0x080ba5a2 in glob_filename ()                                            #38   0x08066fde   in   execute_command_internal   ()
                  #5 0x080ba5a2 in glob_filename ()                                            #39   0x080668a4   in   execute_command ()
                  #6 0x080ba5a2 in glob_filename ()                                            #40   0x08068e94   in   execute_command_internal   ()
                  #7 0x080ba5a2 in glob_filename ()                                            #41   0x08066f6d   in   execute_command_internal   ()
                  #8 0x080ba5a2 in glob_filename ()                                            #42   0x080668a4   in   execute_command ()
                  #9 0x080823b2 in shell_glob_filename ()                                      #43   0x0805c969   in   reader_loop ()
                  #10 0x0807ed3d in expand_words_shellexp ()                                   #44   0x0805ae9b   in   main ()
                  #11 0x0807f26c in expand_words_shellexp ()
                  #12 0x0807ec19 in expand_words ()
                  #13 0x08069766 in execute_command_internal                              ()
                  #14 0x08066d9c in execute_command_internal                              ()
                  #15 0x08094822 in parse_and_execute ()
                  #16 0x0807b3b2 in command_substitute ()
                  #17 0x0807e223 in pat_subst ()
                  #18 0x08079700 in cond_expand_word ()
                  #19 0x080797c1 in cond_expand_word ()
                  #20 0x08079819 in expand_string_unsplit ()
                  #21 0x08079478 in string_rest_of_args ()
                  #22 0x08078f8c in
                    strip_trailing_ifs_whitespace ()
                  #23 0x08079029 in do_assignment ()
                  #24 0x0807f2b4 in expand_words_shellexp ()
                  #25 0x0807ec19 in expand_words ()
                  #26 0x08069766 in execute_command_internal                              ()
                  #27 0x08066d9c in execute_command_internal                              ()
                  #28 0x08067f09 in execute_command_internal                              ()
                  #29 0x08066fde in execute_command_internal                              ()
                  #30 0x080668a4 in execute_command ()
                  #31 0x08067ed2 in execute_command_internal                              ()
                  #32 0x08066fde in execute_command_internal                              ()




Presentation_ID    © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential                                                                54
Case study-1
          The backtrace contained strings such as ‘execute_command_internal’,
           ‘parse_and_execute’ , ‘expand_words_shellexp’.
          This most likely meant that the coredump was related to a CLI command.
          Next, retrieved and analyzed following traces:
                  - Cisco CallManager Admin
                  - IPT Platform CLI Logs


          The IPT Platform CLI logs revealed that the “show tech locales” was the
           last CLI command executed just prior to the coredump occurrence.
          Topic search did not yield any known bugs.
          An escalation was submitted to Business Unit.
          CSCsz24566 was then filed. It was eventually resolved by the BU.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   55
Case study-2
          TAC case: 612476435.
          Problem Description: CallManager service coredumps every 2 and half
           days
                  admin:utils core active list

                        Size         Date            Core File Name
                  =================================================================
                               2009-09-13 08:03:25   core.9800.6.ccm.1252843074
                               2009-09-15 15:58:52   core.2497.6.ccm.1253044183
                               2009-09-18 00:03:38   core.3564.6.ccm.1253245847
                               2009-09-20 08:00:16   core.6676.6.ccm.1253447596
                               2009-09-22 16:00:18   core.8282.6.ccm.1253649103

          Backtrace:
                  #0 0x001627a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
                  #1 0x00d64815 in raise () from /lib/tls/libc.so.6
                  #2 0x00d66279 in abort () from /lib/tls/libc.so.6
                  #3 0x084c4e7a in preabort () at ProcessCMProcMon.cpp:101
                  #4 0x084c4e92 in IntentionalAbort (reason=0xa9fdbdc "CallManager's timers appear
                  incorrect. This may be due to CPU or blocked function. Attempting to restart
                  CallManager.") at ProcessCMProcMon.cpp:106
                  #5 0x084c66c3 in CMProcMon::verifySdlTimerServices () at ProcessCMProcMon.cpp:843
                  #6 0x084c7035 in CMProcMon::callManagerMonitorThread (cmProcMon=0xec122d0) at
                  ProcessCMProcMon.cpp:439
                  #7 0x0107e5fb in ACE_OS_Thread_Adapter::invoke (this=0xf3ef3b8) at
                  OS_Thread_Adapter.cpp:94
                  #8 0x01040cbf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137
                  #9 0x002dc3cc in start_thread () from /lib/tls/libpthread.so.0
                  #10 0x00e061ae in clone () from /lib/tls/libc.so.6



Presentation_ID    © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential               56
Case study-2
            The backtrace indicates that its an intentional coredump.
            Hence, need to review the Perfmon data next to check for
                  • CPU Utilization
                  • Memory Leaks

            The CPU utilization looks steady prior to the coredump.




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   57
Case study-2
            The %VM Used counter appears to be high




            The VMSize for CCM is high. Also, note how
           the line slopes upwards. Signifies increasing
           memory usage over time.


           => Data points to a CCM memory leak.


Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   58
Case study-2
          Escalation was submitted to the Business Unit (BU).


          Filed a software defect CSCtc70568 with BU recommendation.


          High level analysis of why CCM coredump’ed:
                  Due to the memory leak, an internal data structure became large in size. A
                  new entry was subsequently added to this data structure. The data
                  structure had to be re-sized to accommodate the new element. The re-size
                  operation took a long time and the CallManager service coredump’ed as a
                  result of that.


          CSCtc70568 ended up being marked as a duplicate of CSCsx25778.




Presentation_ID      © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential      59
Commonly Found Crash Defects

           CSCsv49493 – 7828-H3 goes down with journal aborted error
           CSCta73022 –7835-I2/7845-I2 file system read-only mode journal aborted
            error
           CSCtb89163 – CER defect for above
           CSCtb79203 – 7845H server read only
           CSCte19556 – Core while deleting H323 Gateway part of RG
           CSCtd58872 – Cdcc to check the return value from getSideGivenCI
            prevent CCM core
           CSCte44391 – kpml message over 24 character causes ccm coredump
           CSCsl74589 – HardwareFailureAlert is raised due to iLO 2 Comm Error
           CSCsl01006 – CCM core when making call while updating pickup group
           CSCsk21012 – process core due to File size limit exceeded

Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   60
Q/A

           Questions?




Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   61
Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   62

More Related Content

What's hot

XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
XPDS16:  CPUID handling for guests - Andrew Cooper, CitrixXPDS16:  CPUID handling for guests - Andrew Cooper, Citrix
XPDS16: CPUID handling for guests - Andrew Cooper, CitrixThe Linux Foundation
 
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle Final
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle FinalCvc2009 Moscow Xen App5 Fp1 Fabian Kienle Final
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle FinalLiudmila Li
 
Dom0less - Xen Developer Summit 2019
Dom0less  - Xen Developer Summit 2019Dom0less  - Xen Developer Summit 2019
Dom0less - Xen Developer Summit 2019Stefano Stabellini
 
2018 Genivi Xen Overview Nov Update
2018 Genivi Xen Overview Nov Update2018 Genivi Xen Overview Nov Update
2018 Genivi Xen Overview Nov UpdateThe Linux Foundation
 
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...Lorscheider Santiago
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsThe Linux Foundation
 
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...The Linux Foundation
 
VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1Louis Göhl
 
Ora10g Rac Best Practices
Ora10g Rac Best PracticesOra10g Rac Best Practices
Ora10g Rac Best Practicesvasanthkp
 
Guide to clone_sles_instances
Guide to clone_sles_instancesGuide to clone_sles_instances
Guide to clone_sles_instancesSatheesh Thomas
 
Fosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationFosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationThe Linux Foundation
 
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and Use
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and UseCommon Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and Use
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and UseDaniel O'Donnell
 
ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleThe Linux Foundation
 
Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...solarisyougood
 
Vsphere esxi-vcenter-server-55-troubleshooting-guide
Vsphere esxi-vcenter-server-55-troubleshooting-guideVsphere esxi-vcenter-server-55-troubleshooting-guide
Vsphere esxi-vcenter-server-55-troubleshooting-guideSree Harsha Boyapati
 

What's hot (20)

Freeradius edir
Freeradius edirFreeradius edir
Freeradius edir
 
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
XPDS16:  CPUID handling for guests - Andrew Cooper, CitrixXPDS16:  CPUID handling for guests - Andrew Cooper, Citrix
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
 
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle Final
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle FinalCvc2009 Moscow Xen App5 Fp1 Fabian Kienle Final
Cvc2009 Moscow Xen App5 Fp1 Fabian Kienle Final
 
Dom0less - Xen Developer Summit 2019
Dom0less  - Xen Developer Summit 2019Dom0less  - Xen Developer Summit 2019
Dom0less - Xen Developer Summit 2019
 
2018 Genivi Xen Overview Nov Update
2018 Genivi Xen Overview Nov Update2018 Genivi Xen Overview Nov Update
2018 Genivi Xen Overview Nov Update
 
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...
Diretrizes para Implementação do Citrix XenServer 6.2.0 em Servidores HP Prol...
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
OSSNA18: Xen Beginners Training
OSSNA18: Xen Beginners Training OSSNA18: Xen Beginners Training
OSSNA18: Xen Beginners Training
 
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...
OSSEU17: How Open Source Project Xen Puts Security Software Vendors Ahead of ...
 
How to configure esx to pass an audit
How to configure esx to pass an auditHow to configure esx to pass an audit
How to configure esx to pass an audit
 
VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1
 
Ora10g Rac Best Practices
Ora10g Rac Best PracticesOra10g Rac Best Practices
Ora10g Rac Best Practices
 
Guide to clone_sles_instances
Guide to clone_sles_instancesGuide to clone_sles_instances
Guide to clone_sles_instances
 
Fosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationFosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using Virtualization
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and Use
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and UseCommon Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and Use
Common Criteria and BSM in OSX (10.3.6 and 10.4.x) - How to Install and Use
 
Command reference nos-v3_5
Command reference nos-v3_5Command reference nos-v3_5
Command reference nos-v3_5
 
ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...
 
Vsphere esxi-vcenter-server-55-troubleshooting-guide
Vsphere esxi-vcenter-server-55-troubleshooting-guideVsphere esxi-vcenter-server-55-troubleshooting-guide
Vsphere esxi-vcenter-server-55-troubleshooting-guide
 

Similar to TSRT Crashes

Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfstroganovboris
 
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesSolve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesNiklas Quarfot Nielsen
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
Mobile Hacking using Linux Drivers
Mobile Hacking using Linux DriversMobile Hacking using Linux Drivers
Mobile Hacking using Linux DriversAnil Kumar Pugalia
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share.Gastón. .Bx.
 
cynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro GmbH
 
Fast boot
Fast bootFast boot
Fast bootSZ Lin
 
NSC #2 - D3 02 - Peter Hlavaty - Attack on the Core
NSC #2 - D3 02 - Peter Hlavaty - Attack on the CoreNSC #2 - D3 02 - Peter Hlavaty - Attack on the Core
NSC #2 - D3 02 - Peter Hlavaty - Attack on the CoreNoSuchCon
 
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn ContainerDay Security 2023
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!Matthew Broberg
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationDEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationFelipe Prado
 
EM12C High Availability without SLB and RAC
EM12C High Availability without SLB and RACEM12C High Availability without SLB and RAC
EM12C High Availability without SLB and RACSecure-24
 

Similar to TSRT Crashes (20)

Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
 
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesSolve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with Kubernetes
 
Damn Simics
Damn SimicsDamn Simics
Damn Simics
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
Mobile Hacking using Linux Drivers
Mobile Hacking using Linux DriversMobile Hacking using Linux Drivers
Mobile Hacking using Linux Drivers
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share
 
cynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guide
 
Fast boot
Fast bootFast boot
Fast boot
 
NSC #2 - D3 02 - Peter Hlavaty - Attack on the Core
NSC #2 - D3 02 - Peter Hlavaty - Attack on the CoreNSC #2 - D3 02 - Peter Hlavaty - Attack on the Core
NSC #2 - D3 02 - Peter Hlavaty - Attack on the Core
 
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn
Container Security - Let's see Falco and Sysdig in Action by Stefan Trimborn
 
Audit
AuditAudit
Audit
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationDEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
 
EM12C High Availability without SLB and RAC
EM12C High Availability without SLB and RACEM12C High Availability without SLB and RAC
EM12C High Availability without SLB and RAC
 
Building
BuildingBuilding
Building
 
Attack on the Core
Attack on the CoreAttack on the Core
Attack on the Core
 

More from ashiesh0007

H323 Digital Failures
H323 Digital FailuresH323 Digital Failures
H323 Digital Failuresashiesh0007
 
Boot Process Of Ip Phone
Boot Process Of Ip PhoneBoot Process Of Ip Phone
Boot Process Of Ip Phoneashiesh0007
 
Solution Reference Network Design Guide 7.X
Solution Reference Network Design Guide   7.XSolution Reference Network Design Guide   7.X
Solution Reference Network Design Guide 7.Xashiesh0007
 
Computer Network
Computer NetworkComputer Network
Computer Networkashiesh0007
 
Basics Of Networking (Overview)
Basics Of Networking (Overview)Basics Of Networking (Overview)
Basics Of Networking (Overview)ashiesh0007
 
Ccna.Voice.Quick.Refence.Sheet
Ccna.Voice.Quick.Refence.SheetCcna.Voice.Quick.Refence.Sheet
Ccna.Voice.Quick.Refence.Sheetashiesh0007
 
Electrical Engineering Interview Questions
Electrical Engineering Interview QuestionsElectrical Engineering Interview Questions
Electrical Engineering Interview Questionsashiesh0007
 

More from ashiesh0007 (10)

Cipc
CipcCipc
Cipc
 
H323 Digital Failures
H323 Digital FailuresH323 Digital Failures
H323 Digital Failures
 
Wireshark
WiresharkWireshark
Wireshark
 
Subnetting
SubnettingSubnetting
Subnetting
 
Boot Process Of Ip Phone
Boot Process Of Ip PhoneBoot Process Of Ip Phone
Boot Process Of Ip Phone
 
Solution Reference Network Design Guide 7.X
Solution Reference Network Design Guide   7.XSolution Reference Network Design Guide   7.X
Solution Reference Network Design Guide 7.X
 
Computer Network
Computer NetworkComputer Network
Computer Network
 
Basics Of Networking (Overview)
Basics Of Networking (Overview)Basics Of Networking (Overview)
Basics Of Networking (Overview)
 
Ccna.Voice.Quick.Refence.Sheet
Ccna.Voice.Quick.Refence.SheetCcna.Voice.Quick.Refence.Sheet
Ccna.Voice.Quick.Refence.Sheet
 
Electrical Engineering Interview Questions
Electrical Engineering Interview QuestionsElectrical Engineering Interview Questions
Electrical Engineering Interview Questions
 

TSRT Crashes

  • 1. Troubleshooting Communications Manager Crashes, Cores, Service Restarts Nikhil Phansalkar, Adam Frankel Cisco Unified Communications Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 1
  • 2. Overview In this Presentation we will focus on troubleshooting the following issues:  Service Crashes • Identify and debug coredumps •Troubleshoot services not starting up properly • Common issues that trigger service failures (Licensing, DNS etc)  Server Crashes • Symptoms of hardware failure • File system corruption • Kernel Panic • Using netdump to troubleshoot Kernel Panic • ASR (Automatic Server Recovery) • IMM (Integrated Management Module)  Case Studies Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 2
  • 3. Identifying an Application Core How to determine that a coredump has occurred on a system ? Here are the typical symptoms of a coredump:  Server remained up, but service was temporarily affected.  An alert generated from RTMT about a core file being generated.  A message in Eventviewer – Application log. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 3
  • 4. Identifying an Application Core How to determine which application has generated the coredump file ? Right click on the alert and select Alert Detail. This will show which application generated the core, the time of the core, and the server that had the core. Use the CLI command to list all cores present on the system.: utils core list [for CUCM ver 5.x, 6.x] utils core active list [for CUCM ver 7.x and later] In the above examples, it’s the CCM application that generated the coredump. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 4
  • 5. Generating Backtrace Use the following CLI command to generate a backtrace: utils core analyze <CoreFilename> [for CUCM ver 5.x, 6.x] utils core active analyze <CoreFilename> [for CUCM ver 7.x and later] Option-1: Generate the backtrace using the CLI command in the customer environment. The core analysis may cause momentary increase in CPU utilization. For busy systems, it is advised to run this command during off-hours. Option-2: Generate the backtrace on a lab server.  Download and retrieve the core file from the production system.  Upload the core file to /var/log/active/core on a lab server (requires root access). The lab server should be running the exact same CUCM version.  Execute the CLI command on the lab server. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 5
  • 6. Search Topic Using the first 4 to 6 lines of the backtrace to formulate a search string for Topic. Consider the following backtrace: As a starting point, the following search string can be used: _STL::list PickupMemberDnTable::findSubscribedMemberDnList PickupMonitoring::sendNotifyReq Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 6
  • 7. Review Results of Topic Search Check if there are any known bugs applicable to the customer’s CUCM version. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 7
  • 8. Troubleshoot Unresolved Coredumps If the backtrace does not match an existing bug, then the following data should be collected for analysis:  Event Viewer-SystemLog  Event Viewer-ApplicationLog  RIS DataCollector PerfmonLog  Logs (set to Detailed/Debug trace level) for the service that generated the coredump. It’s a good idea to get CallManager logs even if its not the application that crashed.  Coredump file (required to submit an escalation to BU). Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 8
  • 9. Troubleshoot Unresolved Coredumps  The logs will provide an indication of the system activity prior to the crash.  The intention is to isolate any unique events or errors that may have been a factor in triggering the coredump.  If the coredump has occurred multiple times, check for repeating patterns of any particular event/error. Identifying the circumstances leading up to the coredump typically expedites the resolution of these issues.  Finally, open an escalation with the Business Unit. Use the template on the escalation page to ensure that you have collected all the required information.  If its not a known issue, then most likely you could be the proud submitter of a new software defect! Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 9
  • 10. Intentional Coredumps: Resource Starvation An CallManager service may generate a coredump intentionally. This could be due to:  High CPU utilization on the system. Thus CCM may get not access to the CPU resources and may crash itself on purpose in order to recover from that state.  This also can indicate some thread that the CCM is trying to use is blocked and thus CCM crashes to attempt to get it out of this state. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 10
  • 11. Intentional Coredumps : Resource Starvation Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 11
  • 12. Intentional Core Dumps: Due to Mem Leak  Sometimes, a memory leak may trigger a coredump.  This is because due to OS limitation, any individual process can allocate max 3 Gb memory.  If the process tries to allocate memory beyond this limit, an intentional coredump will be generated.  Refer next slide to see what the backtrace will look like in this situation. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 12
  • 13. Intentional Coredumps: Due to Mem Leak backtrace =================================== #0 0x00a157a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x01276825 in raise () from /lib/tls/libc.so.6 #2 0x01278289 in abort () from /lib/tls/libc.so.6 #3 0x0050d58b in __gnu_cxx::__verbose_terminate_handler () from /usr/local/cm/lib/libstlport.so.5.1 #4 0x0050b2a1 in __cxxabiv1::__terminate () from /usr/local/cm/lib/libstlport.so.5.1 #5 0x0050b2d6 in std::terminate () from /usr/local/cm/lib/libstlport.so.5.1 #6 0x0050b41f in __cxa_throw () from /usr/local/cm/lib/libstlport.so.5.1 #7 0x0050b86c in operator new () from /usr/local/cm/lib/libstlport.so.5.1 #8 0x0a06bb2d in SdlProcessBase::operator new (size=102700) at SdlProcessBase.cpp:105 #9 0x0a0014e2 in H245SessionManager::create (parentId={mSdlProcessName = 0x0, mSdlNodeId = 4, mSdlAppId = 100, mSdlProcessNumber = 150, mSdlProcessInstance = 2629}, vH245TerminalType=H245_Gateway, vH245TransportConnectionMode=H245Client, vH245IpAddress=404699044, vH245IpPort=40076, vTCPTos=96, vPassThruMSD=false, vTCSTimeout=10, vFastStartInd=0, vFsAudioOutgoingLCN=0, vFsAudioIncomingLCN=0, pktCaptureContext=0xbffab74d "", allowTCPKeepAlivesForH323=true) at ProcessH245SessionManager.cpp:221 #10 0x08a5629c in H245Interface::start_Transition (this=0xbff99008, s=@0x5c70990) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:123 #11 0x08a99354 in H245Interface::fireSignal (this=0xbff99008, sdlSignal=@0x5c70990) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:175 #12 0x0a06c904 in SdlProcessBase::inputSignal (this=0xbff99008, rSignal=0x5c70990, traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0, lowPriority=0, veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at SdlProcessBase.cpp:397 #13 0x0a0746ce in SdlRouter::callProcess (this=0xe225ac0, _sdlSignal=0x5c70990, _deleteSignal=@0x36b8d07, _traceType=SdlSystemLog::SignalRouterThread, _hp=0, _np=0, _lp=0, _vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:371 #14 0x0a0740f3 in SdlRouter::scheduler (sdlRouter=0xe225ac0) at SdlRouter.cpp:281 #15 0x05514bd7 in ACE_OS_Thread_Adapter::invoke (this=0xfe57a30) at OS_Thread_Adapter.cpp:94 #16 0x054d5087 in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137 #17 0x00db73cc in start_thread () from /lib/tls/libpthread.so.0 #18 0x0131a96e in clone () from /lib/tls/libc.so.6 Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 13
  • 14. Troubleshoot Intentional Coredumps  Intentional coredumps typically generate similar backtraces.  Searching topic may yield several several hits. But, they may not always be pertinent to the issue you are troubleshooting.  Remember: intentional coredump is a symptom of some other problem.  If you see an intentional coredump, retrieving and analyzing PerfMonLogs is crucial to figure out the CPU/Memory utilization prior to the coredump since that will lead you to root cause. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 14
  • 15. Services Not Starting  A service not starting is different from a service crash. Often times the service never started on system boot.  Some Possible Culprits • Licensing • Database • Disk Space • services.conf corruption • Software defect Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 15
  • 16. Services Not Starting  Perform a “utils service list” via CLI Is the service deactivated? Is the service “Commanded out of service”? Is the service in a “[STOPPED]” state?  Make an assessment as to which service(s) is expected to be started but is not Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 16
  • 17. Licensing  If CCM is not starting, verify License Unit Report that SW_Feature License is loaded and sufficient NODE Licenses are available Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 17
  • 18. Verify disk space  ‘show status’ will display disk usage for active, inactive, and common partitions  Verify that none are above 97% disk usage  Some services require disk space on the active partition to start and on the common partition for logging purposes Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 18
  • 19. Symptoms of DB Problems  If multiple services will not start and no logs are being written, there may be a problem with Informix  Verify if “A Cisco DB” has started  Run ‘show tech dbstateinfo’ • Determine if Informix is online (first line • Find #RSAM to compare the number of db sessions and used DB memory per user, similar to ‘onstat –g ses’  Check informix logs for DB errors activelog cm/log/informix/ccm.log Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 19
  • 20. Symptoms of DB Problems Check for any user with excess sessions open or if any single session is using excess DB memory. This may identify a process that needs to be investigated further. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 20
  • 21. Informix/DNS  CSCsw88022 -Database should still start and function when DNS is unavailable. This is fixed as of 7.1(1) as sqlhosts no longer uses dns  If “dns” is present in the “hosts” line of the /etc/nsswitch.conf then Informix relies on DNS to startup properly (pre 7.1)  Check ‘utils network host [fqdn/ip]’ Make sure that external resolution resolves properly for all CUCM servers, forward and reverse. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 21
  • 22. Services Deactivated After Reboot  The ‘services.conf’ is located in /usr/local/platform/conf  It contains a list of which services to activate on boot  If the disk is full this file might be recreated as a zero byte file. This will cause all services to be deactivated on startup.  Remedy the disk situation  Restore the services.conf from another server or lab server of same version as a workaround  After service is restored advise customer to rebuild corrupted node Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 22
  • 23. Troubleshoot Server Freezes Problem Symptoms:  The server was running fine for a number of minutes, months, or years and then suddenly stops responding.  The server cannot be accessed via the web, ssh, or the console.  All CUCM services stopped responding. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 23
  • 24. Troubleshoot Server Freezes  Check the console for any messages. Eg: EXT3-fs error (device sda6) in start_transaction: Journal has aborted  The errors may also be written to Eventviewer-SystemLog. But, this can only be viewed after system reboot. Note that it may not capture all messages displayed on the console. Note: you can access the console using iLO (on HP servers) or using IMM (on supported IBM servers).  Reboot the server. A recovery disc may be required to ensure that the file system has fully recovered.  Check for hardware issues.  If none of the above reveals the cause, then enable netdump using the CLI to gather information for subsequent failures. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 24
  • 25. dmesg  dmesg (for "display message") can be used to print the message buffer of the kernel.  This contains diagnostic messages (example: when I/O devices encounter errors).The messages are typically displayed to the console. But, the console output can quickly get overwritten.  If filesystem becomes readonly, syslog messages are no longer written to syslog file on disk. But, the messages will still exist in kernel memory.  dmesg provides a mechanism to review these messages at a later time.  Currently, this command has to be executed from root. There is an enhancement defect CSCtc59353 to get this information directly from the admin CLI. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 25
  • 26. Hardware Problems: Server Self Diagnostics Power on Self Test (POST)  During boot up, server will test all hardware for functionality  Failure of any device results in POST which is displayed on screen, audible error (beeps), or an amber/red light being displayed  Hard drives have indicator light green is normal running state, amber or red indicates a problem  Inspect hardware report for SMART errors. This may occur if disk has a large number of bad sectors. In this case light may still be green.  Lights on front of server, and on the motherboard can help indicate failing hardware If there is a red or amber light on front of server, run vendor diagnostic to get more details Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 26
  • 27. Vendor Diagnostics (HP/IBM)  IBM and HP require bootable hardware diagnostics discs to be run.  IBM Servers require DSA  HP Servers require Smart Start  Detailed Steps are provided in the email templates on TAC-Wiki  http://tac-wiki/Communications_Manager_Hardware_failure Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 27
  • 28. File System Issues A forced reboot or hard reset can cause damage to the file systems that will prevent the server from booting. This can also be caused due to a firmware bug or a hardware problem (eg: bad hard drive). Symptoms:  Server does not boot completely. Console may indicate: *** An error occured during the file system check. *** Dropping you to a shell; the system will reboot *** when you leave the shell. Give root password for maintenance (or type Control-D to continue):  Server displays file system related errors on boot: EXT3-fs error (device ...) in start_transaction: Journal has aborted  Server indicates a manual file system check (FSCK) is required Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 28
  • 29. File System Issues Resolution:  Boot the server using the CUCM recovery disk.  Execute the automatic and manual file system check.  It is always suggested to use the latest recovery disc regardless of product version. Note: Prior to CUCM 6.1.4 and CUCM 7.0.2, the recovery disk contained manual [m] and automatic [f] fsck options. The automatic option [f] was not effective and sometimes did not resolve the issue. The manual option [m] worked fine in all cases. Starting with CUCM 6.1.4 onwards & CUCM 7.0.2 onwards, the fsck logic was enhanced and recovery CD menu was updated to contain the automatic option only [refer CSCsu08170].  Not all file system corruptions can be fixed. You might have to fresh install and execute a DRS restore.  If the system is still experiencing issues, this points to hardware failure. Install new hard drives and then perform a fresh install with DRS recovery.  A frequently observed bug is CSCta73022. If /common partition is affected, BU recommends rebuilding the server. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 29
  • 30. Kernel Panic  A kernel panic is an action taken by an operating system upon detecting an internal fatal error from which it cannot safely recover.  Typically caused by attempts by the operating system to read an invalid or non-permitted memory address are a common source of kernel  In many cases, the operating system could continue operation after memory violations have occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error.  A kernel panic may also occur as a result of a hardware failure or a bug in the operating system.  This is similar to Windows "Bug Check" (aka: "Blue Screen of Death").  IPVMS, CSA and FIOR are the Cisco kernel modules that may cause Kernel Panic. You can try disabling them as a workaround. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 30
  • 31. Netdump  Use netdump to troubleshoot kernel panic issues.  Netdump uses UDP port 6666.  Contains information that indicates where the kernel panicked.  Utilizes a client-server model.  Does not work with NIC-teaming enabled. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 31
  • 32. Configuring Netdump Configure the Netdump server 2. Login to the server designated as the netdump server. 3. Start the netdump server: utils netdump server start 4. Enter the following command for all the netdump client machines: utils netdump server add-client <Ip-Addr-of-netdump-client> 5. Enter the following command to verify status of the netdump server: utils netdump server status 6. Use the following command to verify the clients on the list: utils netdump server list-clients Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 32
  • 33. Configuring Netdump Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 33
  • 34. Configuring Netdump Configure the Netdump client 2. Login to the server designated as the netdump client. 3. Start the netdump client: utils netdump client start <Ip-Addr-of-netdump-server> 4. Enter the following command to verify status of the netdump client: utils netdump client status Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 34
  • 35. Configuring Netdump Verify that the client and server are communicating.  After configuring the netdump server and netdump client, execute the following command on the netdump server: file list activelog crash/  You should see a new sub-directory which has the client IP address and the date-timestamp when it started: admin:file list activelog crash/ <dir> 14.48.60.80-2010-03-05-11:30 <dir> magic <dir> scripts dir count = 3, file count = 0 admin:  A new sub-directory will be created each time the netdump client is restarted. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 35
  • 36. Netdump: Example !!DO NOT TRY THIS IN A PRODUCTION ENVIRONMENT!! On netdump client machine, trigger a kernel panic: The console displays: Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 36
  • 37. Netdump: Example The netdump diagnostic information gets stored in a sub-directory at the /var/crash location on the netdump server: Contents of the log file: Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 37
  • 38. ASR: Automatic Server Recovery  Applicable only to HP servers. Enabled by default.  ASR is implemented via HP ASM driver (Advanced System Management).  ASR is implemented via a 10 minute countdown timer .  During regular operation, the ASM driver frequently resets this timer to prevent it from counting down to zero.  If the timer counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot.  Need to collect IML logs from the system (IML: Integrated Management Log) using the following command: file view system-management-log ID Severity Initial Time Update Time Count ------------------------------------------------------------- 0000 Critical 20:44 04/02/2007 20:44 04/02/2007 0001 LOG: ASR Detected by System ROM Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 38
  • 39. IMM: Integrated Management Module  Newer IBM servers such as the 7835-I3 and the 7845-I3 include IBM’s IMM.  IMMs have an OS Watchdog feature that is similar to HP’s ASRs. This feature is disabled by default.  Refer to CSCte05285 which tracks the enhancement request to include the server recovery functionality into the new IBM servers.  You can access IMM using its own Ethernet port (labelled ‘System Mgmt’). Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 39
  • 40. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 40
  • 41. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 41
  • 42. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 42
  • 43. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 43
  • 44. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 44
  • 45. IMM: Integrated Management Module The IMM is set initially with a user name of USERID and password of PASSW0RD (with a zero, not the letter O). Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 45
  • 46. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 46
  • 47. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 47
  • 48. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 48
  • 49. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 49
  • 50. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 50
  • 51. IMM: Integrated Management Module Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 51
  • 52. HP vs. IBM HP IBM Enabled in all HP servers by Supported in newer IBM servers only Automated default. [7835-I3 and 7845-I3] via IMM. Disabled Recovery by default. To view corresponding logs: To view corresponding logs: <TBD> ‘file view system-management-log’ In-depth vendor Diagnostics Smartstart –CD (bootable) DSA-CD (bootable) (requires downtime) High-level system CLI commands: CLI commands: diagnostics (does utils create report hardware utils create report hardware not require utils diagnose test utils diagnose test show hardware show hardware downtime) show environment show environment Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 52
  • 53. Case study-1  TAC case: 611181361.  Problem Description: Customer created TAC case to investigate following alarm: 04/06/2009 20:38:26.455 LPM|GenAlarm: AlarmName = CoreDumpFileFound, DeviceName = fm11d-bq50vcm1, AlarmMsg = CoreDumpFileFound TotalCoresFound : 1 CoreDetails : The following lists up to 6 cores dumped by corresponding applications. Core1 : Unknown (core.3733.11.showtechCCMDB.s.1239075504) AppID : Cisco Log Partition Monitoring Tool Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 53
  • 54. Case study-1  Backtrace: #33 0x080668a4 in execute_command () #0 0x080ba54c in glob_filename () #34 0x08067ed2 in execute_command_internal () #1 0x080ba5a2 in glob_filename () #35 0x08066fde in execute_command_internal () #2 0x080ba5a2 in glob_filename () #36 0x080668a4 in execute_command () #3 0x080ba5a2 in glob_filename () #37 0x08067ed2 in execute_command_internal () #4 0x080ba5a2 in glob_filename () #38 0x08066fde in execute_command_internal () #5 0x080ba5a2 in glob_filename () #39 0x080668a4 in execute_command () #6 0x080ba5a2 in glob_filename () #40 0x08068e94 in execute_command_internal () #7 0x080ba5a2 in glob_filename () #41 0x08066f6d in execute_command_internal () #8 0x080ba5a2 in glob_filename () #42 0x080668a4 in execute_command () #9 0x080823b2 in shell_glob_filename () #43 0x0805c969 in reader_loop () #10 0x0807ed3d in expand_words_shellexp () #44 0x0805ae9b in main () #11 0x0807f26c in expand_words_shellexp () #12 0x0807ec19 in expand_words () #13 0x08069766 in execute_command_internal () #14 0x08066d9c in execute_command_internal () #15 0x08094822 in parse_and_execute () #16 0x0807b3b2 in command_substitute () #17 0x0807e223 in pat_subst () #18 0x08079700 in cond_expand_word () #19 0x080797c1 in cond_expand_word () #20 0x08079819 in expand_string_unsplit () #21 0x08079478 in string_rest_of_args () #22 0x08078f8c in strip_trailing_ifs_whitespace () #23 0x08079029 in do_assignment () #24 0x0807f2b4 in expand_words_shellexp () #25 0x0807ec19 in expand_words () #26 0x08069766 in execute_command_internal () #27 0x08066d9c in execute_command_internal () #28 0x08067f09 in execute_command_internal () #29 0x08066fde in execute_command_internal () #30 0x080668a4 in execute_command () #31 0x08067ed2 in execute_command_internal () #32 0x08066fde in execute_command_internal () Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 54
  • 55. Case study-1  The backtrace contained strings such as ‘execute_command_internal’, ‘parse_and_execute’ , ‘expand_words_shellexp’.  This most likely meant that the coredump was related to a CLI command.  Next, retrieved and analyzed following traces: - Cisco CallManager Admin - IPT Platform CLI Logs  The IPT Platform CLI logs revealed that the “show tech locales” was the last CLI command executed just prior to the coredump occurrence.  Topic search did not yield any known bugs.  An escalation was submitted to Business Unit.  CSCsz24566 was then filed. It was eventually resolved by the BU. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 55
  • 56. Case study-2  TAC case: 612476435.  Problem Description: CallManager service coredumps every 2 and half days admin:utils core active list Size Date Core File Name ================================================================= 2009-09-13 08:03:25 core.9800.6.ccm.1252843074 2009-09-15 15:58:52 core.2497.6.ccm.1253044183 2009-09-18 00:03:38 core.3564.6.ccm.1253245847 2009-09-20 08:00:16 core.6676.6.ccm.1253447596 2009-09-22 16:00:18 core.8282.6.ccm.1253649103  Backtrace: #0 0x001627a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00d64815 in raise () from /lib/tls/libc.so.6 #2 0x00d66279 in abort () from /lib/tls/libc.so.6 #3 0x084c4e7a in preabort () at ProcessCMProcMon.cpp:101 #4 0x084c4e92 in IntentionalAbort (reason=0xa9fdbdc "CallManager's timers appear incorrect. This may be due to CPU or blocked function. Attempting to restart CallManager.") at ProcessCMProcMon.cpp:106 #5 0x084c66c3 in CMProcMon::verifySdlTimerServices () at ProcessCMProcMon.cpp:843 #6 0x084c7035 in CMProcMon::callManagerMonitorThread (cmProcMon=0xec122d0) at ProcessCMProcMon.cpp:439 #7 0x0107e5fb in ACE_OS_Thread_Adapter::invoke (this=0xf3ef3b8) at OS_Thread_Adapter.cpp:94 #8 0x01040cbf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137 #9 0x002dc3cc in start_thread () from /lib/tls/libpthread.so.0 #10 0x00e061ae in clone () from /lib/tls/libc.so.6 Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 56
  • 57. Case study-2  The backtrace indicates that its an intentional coredump.  Hence, need to review the Perfmon data next to check for • CPU Utilization • Memory Leaks  The CPU utilization looks steady prior to the coredump. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 57
  • 58. Case study-2  The %VM Used counter appears to be high  The VMSize for CCM is high. Also, note how the line slopes upwards. Signifies increasing memory usage over time. => Data points to a CCM memory leak. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 58
  • 59. Case study-2  Escalation was submitted to the Business Unit (BU).  Filed a software defect CSCtc70568 with BU recommendation.  High level analysis of why CCM coredump’ed: Due to the memory leak, an internal data structure became large in size. A new entry was subsequently added to this data structure. The data structure had to be re-sized to accommodate the new element. The re-size operation took a long time and the CallManager service coredump’ed as a result of that.  CSCtc70568 ended up being marked as a duplicate of CSCsx25778. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 59
  • 60. Commonly Found Crash Defects  CSCsv49493 – 7828-H3 goes down with journal aborted error  CSCta73022 –7835-I2/7845-I2 file system read-only mode journal aborted error  CSCtb89163 – CER defect for above  CSCtb79203 – 7845H server read only  CSCte19556 – Core while deleting H323 Gateway part of RG  CSCtd58872 – Cdcc to check the return value from getSideGivenCI prevent CCM core  CSCte44391 – kpml message over 24 character causes ccm coredump  CSCsl74589 – HardwareFailureAlert is raised due to iLO 2 Comm Error  CSCsl01006 – CCM core when making call while updating pickup group  CSCsk21012 – process core due to File size limit exceeded Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 60
  • 61. Q/A  Questions? Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 61
  • 62. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 62