2. Safety in a Complex and Changing Environment
"...so safety isn't about the absence of something...that
you need to count errors or monitor violations. But
the presence of something. But the presence of what?
When we need to find that things go right under difficult
circumstances, it's mostly because of people's adaptive
capability; their ability to recognize, adapt to, and absorb
changes and disruptions, some of which might fall outside
of what the system is designed or trained to handle"
-Sidney Dekker
3. Safety in a Complex and Changing Environment
"...so safety isn't about the absence of something...that you
need to count errors or monitor violations. But the presence
of something. But the presence of what?
When we need to find that things go right under difficult
circumstances, it's mostly because of people's adaptive
capability; their ability to recognize, adapt to, and absorb
changes and disruptions, some of which might fall outside of
what the system is designed or trained to handle"
-Sidney Dekker
RESILIENCE
5. Vocabulary Lesson
Continuous Integration: The ability to quickly make sure
the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its
functioning prior to, during, or following changes and
disturbances in order to sustain required operations.
6. Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the
system is ready for production.
Resilience: The intrinsic ability of a system to adjust its
functioning prior to, during, or following changes and
disturbances in order to sustain required operations.
Maintainability: Characteristic of design and installation
which determines the probability that a failed equipment,
machine, or system can be restored to its normal state
within a given timeframe.
7. Vocabulary Lesson
Continuous Integration: The ability to quickly make sure
the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its
functioning prior to, during, or following changes and
disturbances in order to sustain required operations.
Maintainability: Characteristic of design and installation
which determines the probability that a failed equipment,
machine, or system can be restored to its normal state
within a given timeframe.
The SYSTEM includes all the
hardware and software, but
also all of the PEOPLE
involved.
10. Maintainability = Uptime Goodness
MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F
11. Maintainability = Uptime Goodness
MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F
Low MTTR Requires:
• more useful metrics
• intelligent data analysis
• pre-planned, purposeful resilience
• cooperation between application and infrastructure
14. Automation as a Default:
"One of the be st wa ys to e lim ina te hum a n proble m s is to
ta ke the hum a n out of the proble m . Ma chine s a re ve ry
good a t doing things re pe a te dly a nd doing the m the
sa m e wa y e ve ry single tim e . Hum a ns a re not good a t
this. Le t the m a chine s do it.”
Rapid Recovery:
"Do we spe nd a n unpre dicta ble a m ount of tim e trying to
solve som e obscure issue , or do we sim ply re cre a te the
insta nce providing the se rvice from configura tion
m a na ge m e nt"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
15. Automation as a Default:
"One of the best ways to eliminate human problems is to
take the human out of the problem. Machines are very good
at doing things repeatedly and doing them the same way
every single time. Humans are not good at this. Let the
machines do it."
Rapid Recovery:
"Do we spe nd a n unpre dicta ble a m ount of tim e trying to
solve som e obscure issue , or do we sim ply re cre a te the
insta nce providing the se rvice from configura tion
m a na ge m e nt"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
PUPPET + KICKSTART
+ Network Automation
16. Automation as a Default:
"One of the best ways to eliminate human problems is to
take the human out of the problem. Machines are very good
at doing things repeatedly and doing them the same way
every single time. Humans are not good at this. Let the
machines do it."
Rapid Recovery:
"Do we spend an unpredictable amount of time trying to
solve some obscure issue, or do we simply recreate the
instance providing the service from configuration
management"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
PUPPET + KICKSTART
+ Network Automation
ESPER + HEALTHCHECK + NAGIOS
+ SPLUNK+ OHSHIT
18. Comfortable Changes
1) Are Small
• Many Small Changes = Fewer Incidents with lower MTTR
2) Are Reproducible
RPM:
• Really Peaceful Mornings
• Reduce Paging Monitors
• Reusable Provisioning Methods
19. Comfortable Changes
1) Are Small
• Many Small Changes = Fewer Incidents with lower MTTR
2) Are Reproducible
RPM:
• Really Peaceful Mornings
• Reduce Paging Monitors
• Reusable Provisioning Methods
Rule # 81: If you are logging into servers, you are doing it
wrong.
21. Comfortable Changes
3) Are easily understood by your most junior team members
Rule # 4: Keep it Simple, because you are smart. Do not
make it overly complex because you can.
22. Comfortable Changes
3) Are easily understood by your most junior team members
Rule # 4: Keep it Simple, because you are smart. Do not
make it overly complex because you can.
4) Can be deployed to a subset of production systems
24. Comfortable Changes
5) Follow Process
Change control, deployment processes, peer review, all of
these things matter for a world-class OPS organization.
26. Comfortable Changes
6) Have been approved by a GO / NO-GO process with all
relevant parties checking in.
Ensure that all teams involved in a change have signed off,
including ON-CALL and CUSTOMER SERVICE
29. Small Changes
John Allspaw presented these graphs
of data gathered at Etsy.
More Smaller Deployments
means
Faster MTTR
means
Fewer Minutes of Disruption
30.
31.
32. Operations Meta-Metrics
When in doubt, COLLECT DATA, Build a Timeline!
Things to Monitor:
Changes
(who/what/when/type)
Incidents
(Type/Severity/Duration)
Responses to Incidents
(TTD/TTR)
Things to Collect:
IRC/Jabber Logs
Jira Logs
Search your Data: Use
HBASE+PIG/HIVE, ESPER,
SOLR and SPLUNK
Store everything, even
stuff you don't yet know
how to use.
33. Tracking Incidents - MTTD
1. Frequency
2.Severity
3.Root Cause: Five Whys Mentality
o why was the website down? The CPU utilization on all our front-end servers
went to 100%
o why did the CPU usage spike? A new bit of code contained an infinite loop!
o why did that code get written? So-and-so made a mistake
o why did his mistake get checked in? He didn't write a unit test for the
feature
o why didn't he write a unit test? He's a new employee, and he was not properly
trained
1. Time-to-Detect
2.Time-to-Resolve
34. Tracking Incidents - MTTD
Rule # 18: Monitor EVERYTHING, alert on actionable items
only, record other for trend information.
Rule # 20: Do not make the monitoring system so noisy it is
useless.
35. Tracking Incidents - MTTD
Data Points to source these metrics from:
Output from Application, CLOG, Puppet, Jabber, Jira,
healthcheck, hardware, Eluna, Nagios....all collectible data
36. Handling Incident Response - MTTR
Detect a Problem
Communicate to Support/Community/Executives
Begin to take Action
Communicate to Support/Community/Executives
Coordinate Troubleshooting/Diagnosis
Communicate to Support/Community/Executives
Confirm Stability, Resolving Steps
Communicate to Support/Community/Executives
37. Handling Incident Response - MTTR
Rule # 24: Assign people to be point people for every bit
of technology
Rule # 25: Assign Backup People to those People
Rule #12: Know your bottlenecks, and how to spot them.
Rule # 42: Create gigantic poster size drawings of the
physical layouts of your data center
Rule #43: Create gigantic poster size drawings of the
logical flows of each part of your product.
38. XKCD #974:
I find that when someone is taking time to do something
right in the present, they're a perfectionist with no ability
to prioritize, whereas when someone took time to do
something right in the past, they're a master artisan of
great foresight.