How to walk away from your Outage looking like a HERO

How to walk away from your Outage looking like a HERO

Teresa Dietrich, Vice President Technology
Derek Chang, Director Site Reliability Engineering

Who we are and Why we are here….
Teresa Dietrich – VP of Technical Operations @ WebMD,
previously with AOL, @teresadg (Twitter),
www.teresadietrich.net

Derek Chang – Director of Site Reliability Engineering aka SRE
@WebMD, experience in Development, WebOps and CMS
www.derekchang.me

We are passionate about Outages, Process & Procedures and
Always making new mistakes!!

2

About WebMD

• Most Recognized & Trusted Brand of Health Information
• Serves consumers, physicians, other healthcare professionals, employers and health
plans.
• 107 million visitors/month on both desktop and mobile platforms
• 2.5 billion page views/month

3

What is An Outage?
Service is unavailable to users or to a subset
of users
Service is unable to function as designed and
implemented
Degradation of service to the point the
resource is unusable (Defined SLAs)

4

Why do Outages happen?
Bugs in OS, middleware, and application
Hardware failure
Infrastructure failure (Network, SAN)
Environment failures (Power, Cooling)
Human Error
Demand exceeds capacity
Malicious attacks

5

How are Outages exacerbated?
Too long for monitoring to catch the issue
Monitoring does not catch the issue, humans eventually do
Too long to alert appropriate people of issue
Too long for people to respond to alerts
Too long to find the cause or source of the issue
To long to resolve the issue
Lack of communication to Internal and External customers
Multiple failure scenario

6

A different way to do a Post Mortem
Focus on improving processes and systems for
future, not assigning responsibility for the outage.
Structure, structure, structure!
Discover, Analyze and Review
Analysis done by a third party engineer with DevOps
experience @ WebMD.
Data collected in a prescribed and orderly fashion, using a
template.
Recommendations for improvement owned, assigned and
tracked through resolution.

7

Incident Analysis Template 1

You can download the template @ www.teresadietrich.net

8

Incident Analysis Template 2

You can download the template @ www.teresadietrich.net

9

Incident 1 – background info

10

Incident 1 – outage resolution

11

Incident 1 – timeline analysis

12

Incident 1 – timeline analysis

13

Incident 1 – recent application builds, changes and maintenance

14

Incident 1 – log analysis

15

Incident 1 – log analysis

16

Incident 1 – monitoring correlation

17

Incident 1 – monitoring correlation

18

Incident 1 – root cause analysis

19


20

It's caused by a known Oracle bug 5181800 specifically on oracle version 10.2.0.2.

About LNS: LNS (log-write network-server) and ARCH (archiver) processes running on the primary database select archived redo
logs and send them to the standby database (IAD1) where the RFS (remote file server) background process within the Oracle
instance performs the task of receiving archived redo-logs originating from the primary database (PHX1)

21

Incident 1 – review and recommendation
# Type Review Description Recommendation
Process
no ON clear was sent after outage update 4 was the last 1. Better process for outage communication
RR01
outage is cleared communication 2. firstaid NMS - notification management system

Monitoring Currently oracle relies on home-grown
detection script to monitor oracle event queue and
We should look to third party monitoring tool at hand
send email upon errors. The fact that IAD1
inadequate monitoring on (e.g. Zenoss) to monitor oracle components and
RR03 RAC problem (which is the origin of
oracle infrastructure implement oracle GRID control to provide additional
control file lock in PHX1) didn't catch our
monitoring
attention made the troubleshooting a
more difficult and longer process.
Monitor alert
inadequate monitoring on no alert was sent before/during outage
RR04 We should set up alert from Gomez and Truesight.
user experience from Gomez and Truesight.

Development excessive errors in the
request application log make it 1. review current logging implementation
15000 errors on 1/25, 28000 errors on
extremely difficult to 2. log clean up
RR05 1/26 and 10000 errors on 1/27 on a single
troubleshoot by log and in 3. operations should review log and provide report
tomcat server
turn impact the recovery with engineering regularly (bi-weekly or monthly)
time
Ops request potential log rotation
problem on tomcat server
RR06 several logs are only 1 kilobytes in size review/correct log setting and rotation script.
(Medscape www backend
farm)

22

Investigation Procedures

23


24


25

Incident 2 – background information

26

Incident 2 – Timeline analysis and application profiling

27

Incident 2 – root cause

28

Incident 2 - resolution

29

Incident 2 – Resolution rollout
• Research: Further research revealed the Jsp compilation meta data are only stored in JVM when the Tomcat
Jasper engine runs at development mode
• Potential business impact: Teams agreed the solution to turn-off development mode under the assumption
that there is no business impact – PJSP update will still function properly
• POC: A brief POC test showed non-development mode does reduce memory footprint (memory usage dropped
from 196.2Mb to 61.3Mb and total objects in memory dropped from 2.6m to 876k) and all PJSP updates are
recompiled and ready to serve in a short moment.
• Deployment: Zenoss JMX chart showed the memory dropped back close to initial consumption (0.2-0.3Gb)
after each GC cycle while with development mode, the memory inflated to 1G in a couple days and GC could
not reclaim memory space and tomcat needed to be restarted.

30

Incident 2 – Resolution rollout
Fix verification: The fix was applied to the whole farm in production. Since then, the result is good - no more restart due
to out of memory space and view article performance is more than 30% better in Truesight (avg. 109.5ms compared to
155.9ms before)

31

Incident 2 – review and recommendation

32

Change people’s reaction to “Post Mortem”

Removing the emotion and blame from the Post
Mortem process help minimize the dread and lack of
participation.
Standard procedures and templates shape people’s
expectations and perceptions of the Post Mortem
process.
With the lead engineer of the investigation having no
day to day responsibility with regards to product in
question, we can greatly reduced the defensiveness
and political stances by those involved.

33

Ensure the lessons are learned
Publishing the results to first to the teams involved and then to the
entire technology organization helps with education, openness
about the process and accountability for the changes
recommended.
Take the recommendations, once agreed and approved, and turn
them into actionable items: Dev Change Requests, Ops Tickets,
Process Update and Communication, Monitoring Change.
A single person should own the recommendations becoming action
items and responsibility for seeing them through completion. Don’t
let them fall by the wayside. During the next outage, try and
highlight how the previous lessons improved the next outage, do
your own PR for your process.

34

Questions
Time permitting

OR

Office hours

Tuesday June 26 @ 1pm

35

Appendix - Investigation Procedures
1. Collect background information
– Scope of impact
– Information about the product(s) impacted
– Interview personnel involved
2. Initial interpretation
– Type of incident – outage, service degradation
– Expectation from senior management
– Depth and scope of investigation
– Resource planning

36

Appendix - Investigation Procedures
3. In-depth analysis
– Timeline analysis
– Change analysis
– Log analysis
– Monitoring data correlation
4. Research
– Vendor documentation and white paper
– Architecture review
– Code review and application profiling
– Infrastructure review
5. Resolution and recommendation

37

How to walk away from your Outage looking like a HERO

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a How to walk away from your Outage looking like a HERO

Semelhante a How to walk away from your Outage looking like a HERO (20)

How to walk away from your Outage looking like a HERO

Notas do Editor