SlideShare uma empresa Scribd logo
1 de 37
How to walk away from your Outage looking like a HERO




            Teresa Dietrich, Vice President Technology
            Derek Chang, Director Site Reliability Engineering
Who we are and Why we are here….
 Teresa Dietrich – VP of Technical Operations @ WebMD,
 previously with AOL, @teresadg (Twitter),
 www.teresadietrich.net

 Derek Chang – Director of Site Reliability Engineering aka SRE
 @WebMD, experience in Development, WebOps and CMS
 www.derekchang.me

 We are passionate about Outages, Process & Procedures and
 Always making new mistakes!!




                                                                  2
About WebMD




• Most Recognized & Trusted Brand of Health Information
• Serves consumers, physicians, other healthcare professionals, employers and health
  plans.
• 107 million visitors/month on both desktop and mobile platforms
• 2.5 billion page views/month

                                                                                       3
What is An Outage?
Service is unavailable to users or to a subset
of users
Service is unable to function as designed and
implemented
Degradation of service to the point the
resource is unusable (Defined SLAs)




                                                 4
Why do Outages happen?
 Bugs in OS, middleware, and application
 Hardware failure
 Infrastructure failure (Network, SAN)
 Environment failures (Power, Cooling)
 Human Error
 Demand exceeds capacity
 Malicious attacks


                                           5
How are Outages exacerbated?
  Too long for monitoring to catch the issue
  Monitoring does not catch the issue, humans eventually do
  Too long to alert appropriate people of issue
  Too long for people to respond to alerts
  Too long to find the cause or source of the issue
  To long to resolve the issue
  Lack of communication to Internal and External customers
  Multiple failure scenario




                                                              6
A different way to do a Post Mortem
  Focus on improving processes and systems for
  future, not assigning responsibility for the outage.
  Structure, structure, structure!
  Discover, Analyze and Review
  Analysis done by a third party engineer with DevOps
  experience @ WebMD.
  Data collected in a prescribed and orderly fashion, using a
  template.
  Recommendations for improvement owned, assigned and
  tracked through resolution.



                                                                7
Incident Analysis Template 1




  You can download the template @ www.teresadietrich.net



                                                           8
Incident Analysis Template 2




 You can download the template @ www.teresadietrich.net

                                                          9
Incident 1 – background info




                               10
Incident 1 – outage resolution




                                 11
Incident 1 – timeline analysis




                                 12
Incident 1 – timeline analysis




                                 13
Incident 1 – recent application builds, changes and maintenance




                                                                  14
Incident 1 – log analysis




                            15
Incident 1 – log analysis




                            16
Incident 1 – monitoring correlation




                                      17
Incident 1 – monitoring correlation




                                      18
Incident 1 – root cause analysis




                                   19
Incident 1 – root cause analysis




                                   20
Incident 1 – root cause analysis
It's caused by a known Oracle bug 5181800 specifically on oracle version 10.2.0.2.

About LNS: LNS (log-write network-server) and ARCH (archiver) processes running on the primary database select archived redo
logs and send them to the standby database (IAD1) where the RFS (remote file server) background process within the Oracle
instance performs the task of receiving archived redo-logs originating from the primary database (PHX1)




                                                                                                                               21
Incident 1 – review and recommendation
#      Type            Review                       Description                                 Recommendation
       Process
                       no ON clear was sent after outage update 4 was the last                  1.    Better process for outage communication
RR01
                       outage is cleared          communication                                 2.    firstaid NMS - notification management system

       Monitoring                               Currently oracle relies on home-grown
       detection                                script to monitor oracle event queue and
                                                                                            We should look to third party monitoring tool at hand
                                                send email upon errors. The fact that IAD1
                       inadequate monitoring on                                             (e.g. Zenoss) to monitor oracle components and
RR03                                            RAC problem (which is the origin of
                       oracle infrastructure                                                implement oracle GRID control to provide additional
                                                control file lock in PHX1) didn't catch our
                                                                                            monitoring
                                                attention made the troubleshooting a
                                                more difficult and longer process.
       Monitor alert
                       inadequate monitoring on no alert was sent before/during outage
RR04                                                                                            We should set up alert from Gomez and Truesight.
                       user experience          from Gomez and Truesight.

       Development     excessive errors in the
       request         application log make it                                                1.      review current logging implementation
                                                    15000 errors on 1/25, 28000 errors on
                       extremely difficult to                                                 2.      log clean up
RR05                                                1/26 and 10000 errors on 1/27 on a single
                       troubleshoot by log and in                                             3.      operations should review log and provide report
                                                    tomcat server
                       turn impact the recovery                                                       with engineering regularly (bi-weekly or monthly)
                       time
       Ops request     potential log rotation
                       problem on tomcat server
RR06                                                several logs are only 1 kilobytes in size   review/correct log setting and rotation script.
                       (Medscape www backend
                       farm)




                                                                                                                                                          22
Investigation Procedures




                           23
Investigation Procedures




                           24
Investigation Procedures




                           25
Incident 2 – background information




                                      26
Incident 2 – Timeline analysis and application profiling




                                                           27
Incident 2 – root cause




                          28
Incident 2 - resolution




                          29
Incident 2 – Resolution rollout
•   Research: Further research revealed the Jsp compilation meta data are only stored in JVM when the Tomcat
    Jasper engine runs at development mode
•   Potential business impact: Teams agreed the solution to turn-off development mode under the assumption
    that there is no business impact – PJSP update will still function properly
•   POC: A brief POC test showed non-development mode does reduce memory footprint (memory usage dropped
    from 196.2Mb to 61.3Mb and total objects in memory dropped from 2.6m to 876k) and all PJSP updates are
    recompiled and ready to serve in a short moment.
•   Deployment: Zenoss JMX chart showed the memory dropped back close to initial consumption (0.2-0.3Gb)
    after each GC cycle while with development mode, the memory inflated to 1G in a couple days and GC could
    not reclaim memory space and tomcat needed to be restarted.




                                                                                                               30
Incident 2 – Resolution rollout
Fix verification: The fix was applied to the whole farm in production. Since then, the result is good - no more restart due
to out of memory space and view article performance is more than 30% better in Truesight (avg. 109.5ms compared to
155.9ms before)




                                                                                                                              31
Incident 2 – review and recommendation




                                         32
Change people’s reaction to “Post Mortem”

  Removing the emotion and blame from the Post
  Mortem process help minimize the dread and lack of
  participation.
  Standard procedures and templates shape people’s
  expectations and perceptions of the Post Mortem
  process.
  With the lead engineer of the investigation having no
  day to day responsibility with regards to product in
  question, we can greatly reduced the defensiveness
  and political stances by those involved.


                                                          33
Ensure the lessons are learned
  Publishing the results to first to the teams involved and then to the
  entire technology organization helps with education, openness
  about the process and accountability for the changes
  recommended.
  Take the recommendations, once agreed and approved, and turn
  them into actionable items: Dev Change Requests, Ops Tickets,
  Process Update and Communication, Monitoring Change.
  A single person should own the recommendations becoming action
  items and responsibility for seeing them through completion. Don’t
  let them fall by the wayside. During the next outage, try and
  highlight how the previous lessons improved the next outage, do
  your own PR for your process.




                                                                          34
Questions
               Time permitting

                     OR

                 Office hours

            Tuesday June 26 @ 1pm


                                    35
Appendix - Investigation Procedures
1.   Collect background information
     – Scope of impact
     – Information about the product(s) impacted
     – Interview personnel involved
2.   Initial interpretation
     – Type of incident – outage, service degradation
     – Expectation from senior management
     – Depth and scope of investigation
     – Resource planning




                                                        36
Appendix - Investigation Procedures
3.   In-depth analysis
     – Timeline analysis
     – Change analysis
     – Log analysis
     – Monitoring data correlation
4.   Research
     – Vendor documentation and white paper
     – Architecture review
     – Code review and application profiling
     – Infrastructure review
5.   Resolution and recommendation

                                               37

Mais conteúdo relacionado

Destaque

Post mortem report
Post mortem reportPost mortem report
Post mortem reportKuaci Pedas
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.Server Density
 
HSSE Management - JMT M99 Incident Analysis
HSSE Management - JMT M99 Incident AnalysisHSSE Management - JMT M99 Incident Analysis
HSSE Management - JMT M99 Incident AnalysisYang Ming
 
Critical incident analysis of knm
Critical incident analysis of knmCritical incident analysis of knm
Critical incident analysis of knmSoe Lu Kyaw
 
Project post-mortem analysis
Project post-mortem analysisProject post-mortem analysis
Project post-mortem analysisJaiveer Singh
 
Colorado Cyber TTX attack AAR After Action Report ESF 18
Colorado Cyber TTX attack AAR After Action Report   ESF 18Colorado Cyber TTX attack AAR After Action Report   ESF 18
Colorado Cyber TTX attack AAR After Action Report ESF 18David Sweigert
 
ExCeed Community Economic And Entrepreneurial Development
ExCeed Community Economic And Entrepreneurial DevelopmentExCeed Community Economic And Entrepreneurial Development
ExCeed Community Economic And Entrepreneurial DevelopmentCommunity Development Society
 
Vulnerability Management: What You Need to Know to Prioritize Risk
Vulnerability Management: What You Need to Know to Prioritize RiskVulnerability Management: What You Need to Know to Prioritize Risk
Vulnerability Management: What You Need to Know to Prioritize RiskAlienVault
 
Module 3: Incident Analysis as part of the Incident Management Continuum
Module 3: Incident Analysis as part of the Incident Management ContinuumModule 3: Incident Analysis as part of the Incident Management Continuum
Module 3: Incident Analysis as part of the Incident Management ContinuumCanadian Patient Safety Institute
 
Responsible use of ict brief project report - feb 2011
Responsible use of ict   brief project report - feb 2011Responsible use of ict   brief project report - feb 2011
Responsible use of ict brief project report - feb 2011Mel Tan
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management Argyle Executive Forum
 
Emergency preparedness plan for mines
Emergency preparedness plan for minesEmergency preparedness plan for mines
Emergency preparedness plan for minesKrishna Deo Prasad
 
Sap tech ed_Delivering Continuous SAP Solution Availability
Sap tech ed_Delivering Continuous SAP Solution Availability Sap tech ed_Delivering Continuous SAP Solution Availability
Sap tech ed_Delivering Continuous SAP Solution Availability Robert Max
 
Incident investigation and Root Cause Analysis
Incident investigation and Root Cause AnalysisIncident investigation and Root Cause Analysis
Incident investigation and Root Cause AnalysisHeatherawarens
 
International Business Management, Disney Land Case
International Business Management, Disney Land CaseInternational Business Management, Disney Land Case
International Business Management, Disney Land CaseSoe Lu Kyaw
 

Destaque (20)

Post mortem report
Post mortem reportPost mortem report
Post mortem report
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.
 
HSSE Management - JMT M99 Incident Analysis
HSSE Management - JMT M99 Incident AnalysisHSSE Management - JMT M99 Incident Analysis
HSSE Management - JMT M99 Incident Analysis
 
Critical incident analysis of knm
Critical incident analysis of knmCritical incident analysis of knm
Critical incident analysis of knm
 
Project post-mortem analysis
Project post-mortem analysisProject post-mortem analysis
Project post-mortem analysis
 
Colorado Cyber TTX attack AAR After Action Report ESF 18
Colorado Cyber TTX attack AAR After Action Report   ESF 18Colorado Cyber TTX attack AAR After Action Report   ESF 18
Colorado Cyber TTX attack AAR After Action Report ESF 18
 
ExCeed Community Economic And Entrepreneurial Development
ExCeed Community Economic And Entrepreneurial DevelopmentExCeed Community Economic And Entrepreneurial Development
ExCeed Community Economic And Entrepreneurial Development
 
The Importance Of After Action Reports
The Importance Of After Action ReportsThe Importance Of After Action Reports
The Importance Of After Action Reports
 
Knowledge Management: leveraging NGO Resources
Knowledge Management: leveraging NGO Resources Knowledge Management: leveraging NGO Resources
Knowledge Management: leveraging NGO Resources
 
Tables for april 2015 release
Tables for april 2015 releaseTables for april 2015 release
Tables for april 2015 release
 
Vulnerability Management: What You Need to Know to Prioritize Risk
Vulnerability Management: What You Need to Know to Prioritize RiskVulnerability Management: What You Need to Know to Prioritize Risk
Vulnerability Management: What You Need to Know to Prioritize Risk
 
Module 3: Incident Analysis as part of the Incident Management Continuum
Module 3: Incident Analysis as part of the Incident Management ContinuumModule 3: Incident Analysis as part of the Incident Management Continuum
Module 3: Incident Analysis as part of the Incident Management Continuum
 
Vulnerability Management
Vulnerability ManagementVulnerability Management
Vulnerability Management
 
Responsible use of ict brief project report - feb 2011
Responsible use of ict   brief project report - feb 2011Responsible use of ict   brief project report - feb 2011
Responsible use of ict brief project report - feb 2011
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
Emergency preparedness plan for mines
Emergency preparedness plan for minesEmergency preparedness plan for mines
Emergency preparedness plan for mines
 
IT Service Desk Software RFP Template
IT Service Desk Software RFP TemplateIT Service Desk Software RFP Template
IT Service Desk Software RFP Template
 
Sap tech ed_Delivering Continuous SAP Solution Availability
Sap tech ed_Delivering Continuous SAP Solution Availability Sap tech ed_Delivering Continuous SAP Solution Availability
Sap tech ed_Delivering Continuous SAP Solution Availability
 
Incident investigation and Root Cause Analysis
Incident investigation and Root Cause AnalysisIncident investigation and Root Cause Analysis
Incident investigation and Root Cause Analysis
 
International Business Management, Disney Land Case
International Business Management, Disney Land CaseInternational Business Management, Disney Land Case
International Business Management, Disney Land Case
 

Semelhante a How to walk away from your Outage looking like a HERO

Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...South Tyrol Free Software Conference
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPRoy Blackstone
 
Dyna Trace Whitepaper Performance
Dyna Trace Whitepaper PerformanceDyna Trace Whitepaper Performance
Dyna Trace Whitepaper Performancegopi1985
 
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUGSandesh Rao
 
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUGSandesh Rao
 
Safe Peak Technical Ppt W Product Publish
Safe Peak Technical Ppt W Product   PublishSafe Peak Technical Ppt W Product   Publish
Safe Peak Technical Ppt W Product Publishsqlserver.co.il
 
Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersPrince JabaKumar
 
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011IndicThreads
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationKnoldus Inc.
 
Performance tuningtoolkitintroduction
Performance tuningtoolkitintroductionPerformance tuningtoolkitintroduction
Performance tuningtoolkitintroductionRohit Kelapure
 
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic AnalyticsSAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic AnalyticsQin Liu
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportQAware GmbH
 
2 20613 qualys_top_10_reports_vm
2 20613 qualys_top_10_reports_vm2 20613 qualys_top_10_reports_vm
2 20613 qualys_top_10_reports_vmazfayel
 

Semelhante a How to walk away from your Outage looking like a HERO (20)

Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAP
 
Dyna Trace Whitepaper Performance
Dyna Trace Whitepaper PerformanceDyna Trace Whitepaper Performance
Dyna Trace Whitepaper Performance
 
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
 
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
 
Safe Peak Technical Ppt W Product Publish
Safe Peak Technical Ppt W Product   PublishSafe Peak Technical Ppt W Product   Publish
Safe Peak Technical Ppt W Product Publish
 
Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load Balancers
 
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
 
PacketsNeverLie
PacketsNeverLiePacketsNeverLie
PacketsNeverLie
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive Application
 
Performance tuningtoolkitintroduction
Performance tuningtoolkitintroductionPerformance tuningtoolkitintroduction
Performance tuningtoolkitintroduction
 
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic AnalyticsSAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Srs
SrsSrs
Srs
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
2 20613 qualys_top_10_reports_vm
2 20613 qualys_top_10_reports_vm2 20613 qualys_top_10_reports_vm
2 20613 qualys_top_10_reports_vm
 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
 

How to walk away from your Outage looking like a HERO

  • 1. How to walk away from your Outage looking like a HERO Teresa Dietrich, Vice President Technology Derek Chang, Director Site Reliability Engineering
  • 2. Who we are and Why we are here…. Teresa Dietrich – VP of Technical Operations @ WebMD, previously with AOL, @teresadg (Twitter), www.teresadietrich.net Derek Chang – Director of Site Reliability Engineering aka SRE @WebMD, experience in Development, WebOps and CMS www.derekchang.me We are passionate about Outages, Process & Procedures and Always making new mistakes!! 2
  • 3. About WebMD • Most Recognized & Trusted Brand of Health Information • Serves consumers, physicians, other healthcare professionals, employers and health plans. • 107 million visitors/month on both desktop and mobile platforms • 2.5 billion page views/month 3
  • 4. What is An Outage? Service is unavailable to users or to a subset of users Service is unable to function as designed and implemented Degradation of service to the point the resource is unusable (Defined SLAs) 4
  • 5. Why do Outages happen? Bugs in OS, middleware, and application Hardware failure Infrastructure failure (Network, SAN) Environment failures (Power, Cooling) Human Error Demand exceeds capacity Malicious attacks 5
  • 6. How are Outages exacerbated? Too long for monitoring to catch the issue Monitoring does not catch the issue, humans eventually do Too long to alert appropriate people of issue Too long for people to respond to alerts Too long to find the cause or source of the issue To long to resolve the issue Lack of communication to Internal and External customers Multiple failure scenario 6
  • 7. A different way to do a Post Mortem Focus on improving processes and systems for future, not assigning responsibility for the outage. Structure, structure, structure! Discover, Analyze and Review Analysis done by a third party engineer with DevOps experience @ WebMD. Data collected in a prescribed and orderly fashion, using a template. Recommendations for improvement owned, assigned and tracked through resolution. 7
  • 8. Incident Analysis Template 1 You can download the template @ www.teresadietrich.net 8
  • 9. Incident Analysis Template 2 You can download the template @ www.teresadietrich.net 9
  • 10. Incident 1 – background info 10
  • 11. Incident 1 – outage resolution 11
  • 12. Incident 1 – timeline analysis 12
  • 13. Incident 1 – timeline analysis 13
  • 14. Incident 1 – recent application builds, changes and maintenance 14
  • 15. Incident 1 – log analysis 15
  • 16. Incident 1 – log analysis 16
  • 17. Incident 1 – monitoring correlation 17
  • 18. Incident 1 – monitoring correlation 18
  • 19. Incident 1 – root cause analysis 19
  • 20. Incident 1 – root cause analysis 20
  • 21. Incident 1 – root cause analysis It's caused by a known Oracle bug 5181800 specifically on oracle version 10.2.0.2. About LNS: LNS (log-write network-server) and ARCH (archiver) processes running on the primary database select archived redo logs and send them to the standby database (IAD1) where the RFS (remote file server) background process within the Oracle instance performs the task of receiving archived redo-logs originating from the primary database (PHX1) 21
  • 22. Incident 1 – review and recommendation # Type Review Description Recommendation Process no ON clear was sent after outage update 4 was the last 1. Better process for outage communication RR01 outage is cleared communication 2. firstaid NMS - notification management system Monitoring Currently oracle relies on home-grown detection script to monitor oracle event queue and We should look to third party monitoring tool at hand send email upon errors. The fact that IAD1 inadequate monitoring on (e.g. Zenoss) to monitor oracle components and RR03 RAC problem (which is the origin of oracle infrastructure implement oracle GRID control to provide additional control file lock in PHX1) didn't catch our monitoring attention made the troubleshooting a more difficult and longer process. Monitor alert inadequate monitoring on no alert was sent before/during outage RR04 We should set up alert from Gomez and Truesight. user experience from Gomez and Truesight. Development excessive errors in the request application log make it 1. review current logging implementation 15000 errors on 1/25, 28000 errors on extremely difficult to 2. log clean up RR05 1/26 and 10000 errors on 1/27 on a single troubleshoot by log and in 3. operations should review log and provide report tomcat server turn impact the recovery with engineering regularly (bi-weekly or monthly) time Ops request potential log rotation problem on tomcat server RR06 several logs are only 1 kilobytes in size review/correct log setting and rotation script. (Medscape www backend farm) 22
  • 26. Incident 2 – background information 26
  • 27. Incident 2 – Timeline analysis and application profiling 27
  • 28. Incident 2 – root cause 28
  • 29. Incident 2 - resolution 29
  • 30. Incident 2 – Resolution rollout • Research: Further research revealed the Jsp compilation meta data are only stored in JVM when the Tomcat Jasper engine runs at development mode • Potential business impact: Teams agreed the solution to turn-off development mode under the assumption that there is no business impact – PJSP update will still function properly • POC: A brief POC test showed non-development mode does reduce memory footprint (memory usage dropped from 196.2Mb to 61.3Mb and total objects in memory dropped from 2.6m to 876k) and all PJSP updates are recompiled and ready to serve in a short moment. • Deployment: Zenoss JMX chart showed the memory dropped back close to initial consumption (0.2-0.3Gb) after each GC cycle while with development mode, the memory inflated to 1G in a couple days and GC could not reclaim memory space and tomcat needed to be restarted. 30
  • 31. Incident 2 – Resolution rollout Fix verification: The fix was applied to the whole farm in production. Since then, the result is good - no more restart due to out of memory space and view article performance is more than 30% better in Truesight (avg. 109.5ms compared to 155.9ms before) 31
  • 32. Incident 2 – review and recommendation 32
  • 33. Change people’s reaction to “Post Mortem” Removing the emotion and blame from the Post Mortem process help minimize the dread and lack of participation. Standard procedures and templates shape people’s expectations and perceptions of the Post Mortem process. With the lead engineer of the investigation having no day to day responsibility with regards to product in question, we can greatly reduced the defensiveness and political stances by those involved. 33
  • 34. Ensure the lessons are learned Publishing the results to first to the teams involved and then to the entire technology organization helps with education, openness about the process and accountability for the changes recommended. Take the recommendations, once agreed and approved, and turn them into actionable items: Dev Change Requests, Ops Tickets, Process Update and Communication, Monitoring Change. A single person should own the recommendations becoming action items and responsibility for seeing them through completion. Don’t let them fall by the wayside. During the next outage, try and highlight how the previous lessons improved the next outage, do your own PR for your process. 34
  • 35. Questions Time permitting OR Office hours Tuesday June 26 @ 1pm 35
  • 36. Appendix - Investigation Procedures 1. Collect background information – Scope of impact – Information about the product(s) impacted – Interview personnel involved 2. Initial interpretation – Type of incident – outage, service degradation – Expectation from senior management – Depth and scope of investigation – Resource planning 36
  • 37. Appendix - Investigation Procedures 3. In-depth analysis – Timeline analysis – Change analysis – Log analysis – Monitoring data correlation 4. Research – Vendor documentation and white paper – Architecture review – Code review and application profiling – Infrastructure review 5. Resolution and recommendation 37

Notas do Editor

  1. Request investigation: initiated by business or senior managementPlanning - initial interpretation (Derek); Scope/depth of investigation and resourceCommunicate – to stakeholders and potential contributors (investigation can be expensive), coordinationDATA – discover, analyze, reviewRecommendation – dev request, ops request, process improvementCommunicate – investigation resultsImplement changes – monitoring improvement
  2. Quantitative approach/analysis Know existing systems. Products and infrastructureData sourcesInterviews (listen to all parties) advantage and disadvantage – we don’t know the product. Be justTimeline/events: build deployments, changes and maintenanceArchitecture and vendor documentationServer/app Log analysisCode and config review – application profilingData correlationRecurring cycles of discover, analyze and review
  3. Planning – inioutline investigation (scope/depth),
  4. Gradual memory degradation 5 weeks4 weeks2 weeks. Restart every 2 weeks is where we stand when SRE is engagedMemory consumption (old gen) quickly built up within 48 hours after restartOverall performance (host latency) improve by 30-40%
  5. Figure 1 – JVM memory consumption trend – JMX (Java Management Extension) export to Zenoss80% of JVM heap occupied by jspsevlet compilation hash maps
  6. Jasper stores jsp compilation meta data for developers to review when error occursJasper checks JSP timestamp update in EVERY page request. Easier/faster for developers to verify jsp changes.Because of 1, meta data cannot be GC’ed
  7. Here I am going to share with you about the approach and the procedures my team take during investigation.Certainly the first step is to collect information to estimate the scope of impact and details about the products that are impacted.source of information are documentation, email communication and then most important of all, through interview people who are involved. (listen to all parties before you set the direction and expectation)Secondly, we will classify the incident, whether it’s an outage or it’s a gradual service degradation (which can possibly turn into an outage) Then we meet with senior management about our findings and decide the scope and the depth of the investigation:we’re not alone in investigation.Investigation can be expensive
  8. Quantitative approach/analysis – choose the right tools. Splunk, expolog, dynatrace, Data collection, correlation and interpretationLearn, research and review