Kolton Andrus from Netflix
Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.
At Netflix, we run failure exercises on a regular basis to ensure we are prepared. These efforts hardened our Edge services and helped us to have a quiet holiday season and a smooth global launch. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!
Video available here: http://www.microservices.com/kolton-andrus-breaking-things-on-purpose
14. Experiment
Form a hypothesis If we lose the Ratings service,
members will get default ratings
Measurable Outcome This will manifest as increased
Hystrix Fallbacks
Success Criteria But the overall success rate will
remain constant
Abort Conditions Halt immediately if members are
unable to stream
27. “Required Reading” and References
Antifragile: Things That Gain from Disorder by Nassim Nicholas Taleb
On Designing and Deploying Internet-Scale Services by James Hamilton
Drift into Failure by Sidney Dekker
Why failure testing is important
why you should be running them in production for your microservices.
Abstract
Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.
At Netflix, we run failure exercises on a regular basis to ensure we are prepared. These efforts hardened our Edge services and helped us to have a quiet holiday season and a smooth global launch. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!
About me:
Netflix - Edge Platform Engineer
Amazon - Retail Website Availability and Performance
“Call Leader”
Lead Failure Test exercises at both
Lead into “Context on why failure testing is important” though counter intuitive
What is the opposite of fragile?
Robust/Resilient? Those are indifferent to change. We want something that improves with change.
Vaccine Analogy - Injecting a small amount of something bad can make us immune
The downside is the impact of the failure test
The upside is the prevention of future outages
Additionally the upside is in training our organizations to handle failure.
Prepare your organization for what could go wrong.
Run on your own terms. During the day, after the caffeine has kicked in.
Practice. Train. Answer questions. Know how to turn it off up front.
Failure Scenarios :: Threat Model
Analysis of past events - Start with low hanging fruit
We can’t prepare for everything - Black Swan
If we run only in one AWS region, and that region goes down, what will happen?
Cost/Benefit Analysis -> Prioritize the largest risks first
What is the downside
Lost revenue
two nines (3 days) for a $100M revenue company = $10M revenue lost
three nines (8 hrs) for a $1B revenue company = $10M revenue lost
Cost for Target being down on Black Friday?
Est that an hour of downtime costs FB $1.7M in lost advertising revenue
Brand Reputation
Customer Trust
Edge Service Failure Testing. Gateway for all the Netflix devices and website, talks to almost all of the streaming services.
Process:
Meet with the team to outline the exercise
Discuss what could go wrong
Common Points:
Network Bounds
Loss of a dependency
Setup Communication
Let your team/dependencies/organization know you are running a test.
Invite anyone interested or impacted
Many eyes looking will spot errors faster
Share your pass/fail criteria
Command Center
Team Bullpen
Chat Room
Conference call
Smallest possible step
Run it locally
Run in test
Run it for a single instance
Validate the expected outcome
Example of a CDN selection ‘successful’ cdn selection failure test
Small Scale - Then to find functional failures
Large Scale - Resource Constraints, Queuing, Cascading Failure
Emergent Behavior?
Those unwilling to test in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon. - James Hamilton
Funny Anecdote - We did have an outage in Q3, and it came one day before the scheduled failure test. So run them early and often!
Use to deploy Netflix services to the cloud. Open sourced, cloud independent solution. Very critical piece of infrastructure. Automation is there to help prevent outages, doing it by hand isn’t ideal.
Low Hanging Fruit
Single Points of Failure - Instance, AZ
Lack of Monitoring - KPIs, Dashboards
Lack of Alerting - ‘Normal’ behavior
Brief Hystrix overview
Leveraging Hystrix for protection
Fallbacks for non-critical behavior
Circuit Breaker pattern
Resource isolation (Thread Pools)
Separating critical from non-critical
Configuration can be difficult
Happy case vs Worst case (Timeouts, ThreadPool usage)
Ensuring that fallbacks work on the client device
Run by the traffic team, this is one of the best examples of the power of failure testing.
New learnings every few runs
Ready when called upon
AWS Outage - Q4 2015
AWS Outage - Q1 2016 - Jan 14th?
Counterpoint: Everything is a hammer
Comes up in every outage (should we shift traffic?)
Clear in some cases (AWS in one region is having problems)
Unclear is others (A service in a single region is having problems)
Bad in some (contaminate another region)
YoY from `13 to `14 our team was paged 21% less.
YoY from `14 to `15 our team was paged 20% less.
Perfect uptime over the Holidays (busiest period) - Great when you’re the on call over NYE