Mais conteúdo relacionado

Apresentações para você(20)

Similar a Embracing Failure - Fault Injection and Service Resilience at Netflix(20)


Embracing Failure - Fault Injection and Service Resilience at Netflix

  1. Embracing Failure Fault Injection and Service Resilience at Netflix Josh Evans – Director of Operations Engineering Naresh Gopalani – Software Engineer and Architect © 2014, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.
  2. Netflix Ecosystem • ~50 million members, ~50 countries • > 1 billion hours per month • > 1000 device types • 3 AWS Regions, hundreds of services • Hundreds of thousands of requests/second • CDN serves petabytes of data at terabits/second Static Content Akamai Netflix CDN Service Partners AWS/Netfli x Control Plane Internet
  3. Availability means that members can ● sign up ● activate a device ● browse ● watch
  4. What keeps us up at night
  5. Failures can happen any time • Disks fail • Power outages • Natural disasters • Software bugs • Human error
  6. We design for failure • Exception handling • Fault tolerance and isolation • Fall-backs and degraded experiences • Auto-scaling clusters • Redundancy
  7. Testing for failure is hard • Web-scale traffic • Massive, changing data sets • Complex interactions and request patterns • Asynchronous, concurrent requests • Complete and partial failure modes Constant innovation and change
  8. What if we regularly inject failures into our systems under controlled circumstances?
  9. Blast Radius • Unit of isolation • Scope of an outage • Scope a chaos exercise Instance Zone Region Global
  10. An Instance Fails Edge Cluster Cluster A Cluster B Cluster C Cluster D
  11. Chaos Monkey • Monkey loose in your DC • Run during business hours • What we learned – Auto-replacement works – State is problematic
  12. A State of Xen - Chaos Monkey & Cassandra Out of our 2700+ Cassandra nodes • 218 rebooted • 22 did not reboot successfully • Automation replaced failed nodes • 0 downtime due to reboot
  13. An Availability Zone Fails EU-West US-West US-East AZ1 AZ2
  14. Chaos Gorilla Simulate an availability zone outage • 3-zone configuration • Eliminate one zone • Ensure that others can handle the load and nothing breaks Chaos Gorilla
  15. Challenges • Rapidly shifting traffic – LBs must expire connections quickly – Lingering connections to caches must be addressed • Service configuration – Not all clusters auto-scaled or pinned – Services not configured for cross-zone calls – Mismatched timeouts – fallbacks prevented fail-over
  16. A Region Fails US-West US-East EU-West
  17. Regional Load Balancers Zuul – Traffic Shaping/Routing AZ1 AZ2 AZ3 Data Data Data Geo-located Chaos Kong Chaos Kong Regional Load Balancers Zuul – Traffic Shaping/Routing AZ1 AZ2 AZ3 Data Data Data Customer Device
  18. Challenges ● Rapidly shifting traffic ○ Auto-scaling configuration ○ Static configuration/pinning ○ Instance start time ○ Cache fill time
  19. Challenges ● Service Configuration ○ Timeout configurations ○ Fallbacks fail or don’t provide the desired experience ● No minimal (critical) stack ○ Any service may be critical!
  20. A Service Fails Service Zone Region Global
  21. Services Slow Down and Fail Simulate latent/failed service calls • Inject arbitrary latency and errors at the service level • Observe for effects Latency Monkey
  22. Latency Monkey Device ELB Zuul Edge Service B Service C Internet Service A
  23. Challenges • Startup resiliency is an issue • Services owners don’t know all dependencies • Fallbacks can fail too • Second order effects not easily tested • Dependencies are in constant flux • Latency Monkey tests function and scale – Not a staged approach – Lots of opt-outs
  24. More Precise and Continuous
  25. Service Failure Testing:FIT
  26. Distributed Systems Fail ● Complex interactions at scale ● Variability across services ● Byzantine failures ● Combinatorial complexity
  27. Any service can cause cascading failures ELB
  28. Fault Injection Testing (FIT) Device Service B Service C Internet Zuul Edge Device or Account Override Service A Request-level simulations ELB
  29. Failure Injection Points IPC Cassandra Client Memcached Client Service Container Fault Tolerance
  30. FIT Details ● Common Simulation Syntax ● Single Simulation Interface ● Transported via Http Request header
  31. Integrating Failure request [sendRequestHeader] >>fit.failure: 1|fit.Serializer| 2|[[{"name”:”failSocial, Filter Service Ribbon Filter Service Ribbon ServerRcv ClientSend ServerRcv Service A response Service B ”whitelist":false, "injectionPoints”: [“SocialService”]},{} ]], {"Id": "252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
  32. Failure Scenarios ● Set of injection points to fail ● Defined based on ○ Past outages ○ Specific dependency interactions ○ Whitelist of a set of critical services ○ Dynamic tracing of dependencies
  33. FIT Insights : Salp ● Distributed tracing inspired by Dapper paper ● Provides insight into dependencies ● Helps define & visualize scenarios
  34. Dialing Up Failure Functional Validation ● Isolated synthetic transactions ○ Set of devices Validation at Scale ● Dial up customer traffic - % based ● Simulation of full service failure Chaos!
  35. Continuous Validation Critical Services Non-critical Services Synthetic Transactions
  36. Don’t Fear The Monkeys
  37. Take-aways • Don’t wait for random failures – Cause failure to validate resiliency – Remove uncertainty by forcing failures regularly – Better to fail at 2pm than 2am • Test design assumptions by stressing them Embrace Failure
  38. The Simian Army is part of the Netflix open source cloud platform
  39. Netflix talks at re:Invent Talk Time Title BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 3:30pm Performance Tuning EC2 DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The Pros and Cons of Micro Services Architectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud
  40. Please give us your feedback on this presentation Josh Evans @josh_evans_nflx Naresh Gopalani Join the conversation on Twitter with #reinvent © 2014, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.