Embracing Failure - Fault Injection and Service Resilience at Netflix
13 de Nov de 2014•0 gostou
14 gostaram
Seja o primeiro a gostar disto
mostrar mais
•3,105 visualizações
visualizações
Vistos totais
0
No Slideshare
0
De incorporações
0
Número de incorporações
0
Baixar para ler offline
Denunciar
Tecnologia
A presentation given at AWS re:Invent on how Netflix induces failure to validate and harden production systems. Technologies discussed include the Simian Army (Chaos Monkey, Gorilla, Kong) and our next gen Failure Injection Test framework (FIT).
Netflix Ecosystem
• ~50 million members, ~50 countries
• > 1 billion hours per month
• > 1000 device types
• 3 AWS Regions, hundreds of services
• Hundreds of thousands of requests/second
• CDN serves petabytes of data at terabits/second
Static
Content
Akamai
Netflix CDN
Service
Partners
AWS/Netfli
x
Control
Plane
Internet
Failures can happen any time
• Disks fail
• Power outages
• Natural disasters
• Software bugs
• Human error
We design for failure
• Exception handling
• Fault tolerance and isolation
• Fall-backs and degraded experiences
• Auto-scaling clusters
• Redundancy
Testing for failure is hard
• Web-scale traffic
• Massive, changing data sets
• Complex interactions and request patterns
• Asynchronous, concurrent requests
• Complete and partial failure modes
Constant innovation and change
What if we regularly inject failures
into our systems under controlled
circumstances?
Blast Radius
• Unit of isolation
• Scope of an outage
• Scope a chaos exercise
Instance
Zone
Region
Global
Chaos Monkey
• Monkey loose in your DC
• Run during business hours
• What we learned
– Auto-replacement works
– State is problematic
A State of Xen - Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes
• 218 rebooted
• 22 did not reboot successfully
• Automation replaced failed nodes
• 0 downtime due to reboot
Chaos Gorilla
Simulate an availability zone
outage
• 3-zone configuration
• Eliminate one zone
• Ensure that others can
handle the load and
nothing breaks
Chaos Gorilla
Challenges
• Rapidly shifting traffic
– LBs must expire connections quickly
– Lingering connections to caches must be addressed
• Service configuration
– Not all clusters auto-scaled or pinned
– Services not configured for cross-zone calls
– Mismatched timeouts – fallbacks prevented fail-over
Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Geo-located
Chaos Kong
Chaos Kong
Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Customer
Device
Challenges
● Rapidly shifting traffic
○ Auto-scaling configuration
○ Static configuration/pinning
○ Instance start time
○ Cache fill time
Challenges
● Service Configuration
○ Timeout configurations
○ Fallbacks fail or don’t provide the
desired experience
● No minimal (critical) stack
○ Any service may be critical!
Services Slow Down and Fail
Simulate latent/failed service
calls
• Inject arbitrary latency and errors at
the service level
• Observe for effects
Latency Monkey
Challenges
• Startup resiliency is an issue
• Services owners don’t know all dependencies
• Fallbacks can fail too
• Second order effects not easily tested
• Dependencies are in constant flux
• Latency Monkey tests function and scale
– Not a staged approach
– Lots of opt-outs
FIT Details
● Common Simulation Syntax
● Single Simulation Interface
● Transported via Http Request header
Integrating Failure
request
[sendRequestHeader] >>fit.failure: 1|fit.Serializer|
2|[[{"name”:”failSocial,
Filter
Service
Ribbon
Filter
Service
Ribbon
ServerRcv
ClientSend
ServerRcv
Service A
response
Service B
”whitelist":false,
"injectionPoints”:
[“SocialService”]},{}
]],
{"Id":
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
Failure Scenarios
● Set of injection points to fail
● Defined based on
○ Past outages
○ Specific dependency interactions
○ Whitelist of a set of critical services
○ Dynamic tracing of dependencies
FIT Insights : Salp
● Distributed tracing inspired by Dapper paper
● Provides insight into dependencies
● Helps define & visualize scenarios
Dialing Up Failure
Functional Validation
● Isolated synthetic transactions
○ Set of devices
Validation at Scale
● Dial up customer traffic - % based
● Simulation of full service failure
Chaos!
Take-aways
• Don’t wait for random failures
– Cause failure to validate resiliency
– Remove uncertainty by forcing failures regularly
– Better to fail at 2pm than 2am
• Test design assumptions by stressing them
Embrace Failure
The Simian Army is part of
the Netflix open source
cloud platform
http://netflix.github.com
Netflix talks at re:Invent
Talk Time Title
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 3:30pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud