Normal accidents and outpatient surgeries

Normal Accidents and
Outpatient Surgeries
Resilience Engineering Done Right

Safety in a Complex and Changing Environment
"...so safety isn't about the absence of something...that
you need to count errors or monitor violations. But
the presence of something. But the presence of what?
When we need to find that things go right under difficult
circumstances, it's mostly because of people's adaptive
capability; their ability to recognize, adapt to, and absorb
changes and disruptions, some of which might fall outside
of what the system is designed or trained to handle"
-Sidney Dekker

Safety in a Complex and Changing Environment
"...so safety isn't about the absence of something...that you
need to count errors or monitor violations. But the presence
of something. But the presence of what?
When we need to find that things go right under difficult
circumstances, it's mostly because of people's adaptive
capability; their ability to recognize, adapt to, and absorb
changes and disruptions, some of which might fall outside of
what the system is designed or trained to handle"
-Sidney Dekker
RESILIENCE

Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the
system is ready for production.

Vocabulary Lesson
Continuous Integration: The ability to quickly make sure
the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its
functioning prior to, during, or following changes and
disturbances in order to sustain required operations.

Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the
system is ready for production.
Maintainability: Characteristic of design and installation
which determines the probability that a failed equipment,
machine, or system can be restored to its normal state
within a given timeframe.

Vocabulary Lesson
Continuous Integration: The ability to quickly make sure
the system is ready for production.
Maintainability: Characteristic of design and installation
which determines the probability that a failed equipment,
machine, or system can be restored to its normal state
within a given timeframe.
The SYSTEM includes all the
hardware and software, but
also all of the PEOPLE
involved.

Maintainability = Uptime Goodness
MTTR vs. MTBF

MTTR vs. MTBF
Low MTTR > Low MTBF

MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F

MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F
Low MTTR Requires:
• more useful metrics
• intelligent data analysis
• pre-planned, purposeful resilience
• cooperation between application and infrastructure

Your Average Operations Engineer

Automation as a Default:
"One of the be st wa ys to e lim ina te hum a n proble m s is to
ta ke the hum a n out of the proble m . Ma chine s a re ve ry
good a t doing things re pe a te dly a nd doing the m the
sa m e wa y e ve ry single tim e . Hum a ns a re not good a t
this. Le t the m a chine s do it.”
Rapid Recovery:
"Do we spe nd a n unpre dicta ble a m ount of tim e trying to
solve som e obscure issue , or do we sim ply re cre a te the
insta nce providing the se rvice from configura tion
m a na ge m e nt"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/

"One of the best ways to eliminate human problems is to
take the human out of the problem. Machines are very good
at doing things repeatedly and doing them the same way
every single time. Humans are not good at this. Let the
machines do it."
Rapid Recovery:
"Do we spe nd a n unpre dicta ble a m ount of tim e trying to
solve som e obscure issue , or do we sim ply re cre a te the
insta nce providing the se rvice from configura tion
m a na ge m e nt"
PUPPET + KICKSTART
+ Network Automation

"One of the best ways to eliminate human problems is to
take the human out of the problem. Machines are very good
at doing things repeatedly and doing them the same way
every single time. Humans are not good at this. Let the
machines do it."
Rapid Recovery:
"Do we spend an unpredictable amount of time trying to
solve some obscure issue, or do we simply recreate the
instance providing the service from configuration
management"
PUPPET + KICKSTART
+ Network Automation
ESPER + HEALTHCHECK + NAGIOS
+ SPLUNK+ OHSHIT

Comfortable Changes
1) Are Small
• Many Small Changes = Fewer Incidents with lower MTTR

Comfortable Changes
1) Are Small
2) Are Reproducible
RPM:
• Really Peaceful Mornings
• Reduce Paging Monitors
• Reusable Provisioning Methods

Comfortable Changes
1) Are Small
2) Are Reproducible
RPM:
• Really Peaceful Mornings
• Reduce Paging Monitors
• Reusable Provisioning Methods
Rule # 81: If you are logging into servers, you are doing it
wrong.

Comfortable Changes
3) Are easily understood by your most junior team members

Comfortable Changes
Rule # 4: Keep it Simple, because you are smart. Do not
make it overly complex because you can.

Comfortable Changes
Rule # 4: Keep it Simple, because you are smart. Do not
make it overly complex because you can.
4) Can be deployed to a subset of production systems

Comfortable Changes
5) Follow Process

Comfortable Changes
5) Follow Process
Change control, deployment processes, peer review, all of
these things matter for a world-class OPS organization.

Comfortable Changes
6) Have been approved by a GO / NO-GO process with all
relevant parties checking in.

Comfortable Changes
6) Have been approved by a GO / NO-GO process with all
relevant parties checking in.
Ensure that all teams involved in a change have signed off,
including ON-CALL and CUSTOMER SERVICE

Small Changes
John Allspaw presented these graphs
of data gathered at Etsy.
More Smaller Deployments
means
Faster MTTR
means
Fewer Minutes of Disruption

Operations Meta-Metrics
When in doubt, COLLECT DATA, Build a Timeline!
Things to Monitor:
Changes
(who/what/when/type)
Incidents
(Type/Severity/Duration)
Responses to Incidents
(TTD/TTR)
Things to Collect:
IRC/Jabber Logs
Jira Logs
Search your Data: Use
HBASE+PIG/HIVE, ESPER,
SOLR and SPLUNK
Store everything, even
stuff you don't yet know
how to use.

Tracking Incidents - MTTD
1. Frequency
2.Severity
3.Root Cause: Five Whys Mentality
o why was the website down? The CPU utilization on all our front-end servers
went to 100%
o why did the CPU usage spike? A new bit of code contained an infinite loop!
o why did that code get written? So-and-so made a mistake
o why did his mistake get checked in? He didn't write a unit test for the
feature
o why didn't he write a unit test? He's a new employee, and he was not properly
trained
1. Time-to-Detect
2.Time-to-Resolve

Rule # 18: Monitor EVERYTHING, alert on actionable items
only, record other for trend information.
Rule # 20: Do not make the monitoring system so noisy it is
useless.

Data Points to source these metrics from:
Output from Application, CLOG, Puppet, Jabber, Jira,
healthcheck, hardware, Eluna, Nagios....all collectible data

Handling Incident Response - MTTR
Detect a Problem
Communicate to Support/Community/Executives
Begin to take Action
Coordinate Troubleshooting/Diagnosis
Confirm Stability, Resolving Steps

Handling Incident Response - MTTR
Rule # 24: Assign people to be point people for every bit
of technology
Rule # 25: Assign Backup People to those People
Rule #12: Know your bottlenecks, and how to spot them.
Rule # 42: Create gigantic poster size drawings of the
physical layouts of your data center
Rule #43: Create gigantic poster size drawings of the
logical flows of each part of your product.

XKCD #974:
I find that when someone is taking time to do something
right in the present, they're a perfectionist with no ability
to prioritize, whereas when someone took time to do
something right in the past, they're a master artisan of
great foresight.

Normal accidents and outpatient surgeries

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a Normal accidents and outpatient surgeries

Semelhante a Normal accidents and outpatient surgeries (20)

Último

Último (20)

Normal accidents and outpatient surgeries