5. “A way to improve availability is
to install proven hardware and
software, and then leave it alone”
Jim Gray
Why Do Computers Stop and What Can Be Done About It?
6. • Systems need to be reliable
• Nuklear weapon arsenal, heart rate monitoring,
World of Warcraft servers, Streaming business
• Third party dependencies (software and
hardware)
Be reliable!
7. DynamoDB Outage US-East
• “… there was a brief network disruption that impacted a
portion of DynamoDB’s storage servers.”
• 2:19am until 7:10am PDT
• “There are several other AWS services that use
DynamoDB that experienced problems during the event.”
• SQS, EC2 auto scaling, CloudWatch
8. • Deployments themselves may cause issues
• Unpredicted behaviour after a change has been
rolled out
• Issues during rollback
• Change in client / user behaviour
It’s not always the infrastructure
10. Do the simplest thing first
• Prepare for your machines to die
• “Cattle, not pets” (Adrian Cockcroft)
• Resilience through redundancy
• Stateless machines
11. Deal with infrastructure issues
• Latency between instances
• Package loss
• Ports blocked
• or even outages of an entire AZ
12. Think big!
• Remember that DynamoDB failure?
• Outage of an entire AWS region!
• You’ll need more than one region in the first place
• Re-routing of entire traffic from one region to another
• Any region needs to be able to scale to take the load of
two regions
17. What’s in it?
• A compilation of scripts
• Scripts mess with your AWS account
• Thus, they are very AWS specific
• If not on AWS, get inspired and build your toolset around
these ideas
• Not a comprehensive toolset
18. • Latency Monkey
• Conformity Monkey
• Security Monkey
• Doctor Monkey
• 10-18 Monkey
Simian Army
20. • Systematic approach to Chaos Testing
• Started by Netflix
• Talk about it a lot to attract talent
• Many other companies doing similar things in that field
• Want to grow a community around it
Chaos Engineering
21. “Experiment on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
Netflix
25. The “Happy Path”
• Trace through code
where nothing bad
happens
• usually testing happens
first on the happy path
• Bad things usually
happen off the happy
path
26. Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
28. Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
3.Run experiments in production
29. Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
3.Run experiments in production
4.Automate experiments to run continuously