O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
The Forces That Disrupt Netflix
Nov. 7, 2016Haley Tucker
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
ACROBAT
FLEA
our world
parallel world
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer...
ACROBAT
FLEA
our world
parallel world
computing
ENGINEER
PROLOGUE
DISTRIBUTED SYSTEMS
Proxy/Routing
DECOMPOSING THE MONOLITH
Devices
Netflix
Service
Netflix
Service
Edge
Service
Traffic
Netflix
Playback
Servi...
Notes on Distributed Systems for Young Bloods
# Distributed
systems are
different because
they fail often.
--Jeff Hodges
TABLE OF CONTENTS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Metadata impacts on availability
CHAPTER 2: THE VANISHING OF ...
Whoops, something went wrong…
Netflix Streaming Error
We’re having trouble playing this title right now. Please try again ...
CHAPTER ONE
THE WEIRD DATA IN THE CATALOG
45 MINUTES!!
Clock, by heyyobecky4lyfe, Tumblr
VIDEO METADATA
ARCHITECTURE
Video
Metadata
Service
Amazon S3
Source
System
Source
System
Netflix
Services
Netflix
Services...
Amazon S3
Netflix
Playback
Service
{
String msg = “This should never
happen!”;
throw new IllegalStateException(msg);
}
MITIGATION
BLAST RADIUS
Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr
Amazon WS Global Infrastructure
Amazon WS Global Infrastructure
STAGGERED ROLLOUT
Pager
Diagnosis?
Canary, CC BY 2.0, Steve P2008 2014, Flikr
PREVENTION
CANARIES
TRADITIONAL CANARY
Canary
(New Code)
Baseline
(Old Code)
TrafficTraffic
Video
Metadata
Service
Amazon S3
Netflix
Services
...
CONSISTENCY
VALID STATE TRANSITIONS
DATA CANARY
Netflix
Services
Netflix
Services
Netflix
Services
Netflix
Services
Video
Metadata
Service
Amazon S3
Source
Sy...
Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia
SEEING RETURNS
Verify consistency prior to
applying state changes.
…one tool is a data canary.
CHAPTER TWO
THE VANISHING OF CRITICAL SERVICES
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer...
Proxy/Routing
Devices
LOG DATA
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playbac...
Proxy/Routing
Devices
Proxy
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
S...
Proxy/Routing
Devices
CASCADING FAILURE
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netfli...
{
throw new OutOfMemoryError();
}
Log Data Service
Cassandra
Playback Service
PREVENTION
MANAGING RESOURCE CONSTRAINTS
Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr
Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr
REDUCE SURFACE AREA
1
Keep Only
Dependencies
which are
Necessary
SO MANY JARS!!
2 LIMIT “MAGIC”
Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr
3
Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr
ADD KILL SWITCHES
DEV
FAVOR IMMUTABILITY
Playback Service
TEST
Playback Service
PROD
Playback Service
4
try {
remoteService.call();
} catch( Throwable t ){
//Oops!
System.exit(1);
}
Log Data
Service
Cassandra
Playback
Service
...
It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr
MITIGATION
CIRCUIT BREAKERS
Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr
FAILURE TESTING
Proxy/Routing
Devices
FAILURE TESTING
Log Data Service
Traffic
Cassandra
Playback Service Automating Chaos
Experiments in
...
Manage resource constraints by
reducing surface area.
Leverage circuit breakers and
rigorously test failures.
CHAPTER THREE
THE THROTTLE
Proxy/Routing
Devices
PLAYBACK ARCHITECTURE
Edge Service Edge Service Edge Service
Playback Service
Traffic Traffic
URL Se...
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers and Fallbacks
Metrics Retries and Timeouts
RP...
Playback
Service
Traffic Concurrent
Requests
Throttled Requests
(HTTP 503)
THROTTLING
}
System.gc();
}
URL
Service
Playback Service
Edge Service
Proxy/Routing
Traffic
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers
Metrics
Retries and
Timeouts
RPCService Disco...
FALLBACK TESTING
With 100% Fallback,
CPU held at 90%
15 RPSNo fallback, CPU held at 90%
58 RPS
Siege: https://github.com/J...
SELECTING FALLBACKS
CACHESTATIC
FALLBACK
SERVICE
URL
Service
Playback Service
Edge Service
Proxy/Routing
Traffic
}
return Response
.status(503)
.build();
}
REQUEST BUCKETING
NON-CRITICAL
Experience or
Performance
Impact
CRITICAL
Customer
Streaming
Impact
Fire Buckets at Oakwort...
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/Routing
Devices
Edge Service Edge Service Edge Service
Traffic Tr...
CRITICAL
Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr
NON-CRITICAL
Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/Routing
Devices
Edge Service Edge Service Edge Service
Traffic Tr...
No heavy fallbacks!!
Fallbacks should be light and fast.
Shard your application based on
operational characteristics.
EPILOGUE
KEY TAKEAWAYS
KEY TAKEAWAYS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Verify consistency prior to applying state changes.
• One tool is...
The unexpected will happen.
Plan to fail.
PARTING THOUGHT
DISTRIBUTED SYSTEMS
SOCIAL
Questions?
Haley Tucker
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-resilience
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
Próximos SlideShares
Carregando em…5
×

Stranger Things: The Forces that Disrupt Netflix

722 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2h3bAvP.

Haley Tucker discusses how other systems may affect Netflix' services, strategies to protect their systems and make sure they won't fail even if things go wrong. Filmed at qconsf.com.

Haley Tucker works on the Playback Features team at Netflix, responsible for ensuring that customers receive the best possible viewing experience every time they click play. Her services fill a key role in enabling Netflix to stream amazing content to 65M+ members on 1000+ devices.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Stranger Things: The Forces that Disrupt Netflix

  1. 1. The Forces That Disrupt Netflix Nov. 7, 2016Haley Tucker
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-resilience
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. ACROBAT FLEA our world parallel world
  5. 5. # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport
  6. 6. ACROBAT FLEA our world parallel world computing ENGINEER
  7. 7. PROLOGUE DISTRIBUTED SYSTEMS
  8. 8. Proxy/Routing DECOMPOSING THE MONOLITH Devices Netflix Service Netflix Service Edge Service Traffic Netflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic
  9. 9. Notes on Distributed Systems for Young Bloods # Distributed systems are different because they fail often. --Jeff Hodges
  10. 10. TABLE OF CONTENTS CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Metadata impacts on availability CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Crashing services and cascading failures CHAPTER 3: THE THROTTLE • Latency spikes and the impact of fallbacks FORCES AT WORK
  11. 11. Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.
  12. 12. CHAPTER ONE THE WEIRD DATA IN THE CATALOG
  13. 13. 45 MINUTES!! Clock, by heyyobecky4lyfe, Tumblr
  14. 14. VIDEO METADATA ARCHITECTURE Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix ServicesNetflix Service Traffic
  15. 15. Amazon S3 Netflix Playback Service { String msg = “This should never happen!”; throw new IllegalStateException(msg); }
  16. 16. MITIGATION BLAST RADIUS Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr
  17. 17. Amazon WS Global Infrastructure
  18. 18. Amazon WS Global Infrastructure STAGGERED ROLLOUT
  19. 19. Pager Diagnosis?
  20. 20. Canary, CC BY 2.0, Steve P2008 2014, Flikr PREVENTION CANARIES
  21. 21. TRADITIONAL CANARY Canary (New Code) Baseline (Old Code) TrafficTraffic Video Metadata Service Amazon S3 Netflix Services Netflix Services Netflix Services Netflix ServicesNetflix Service Source System Source System Traffic
  22. 22. CONSISTENCY VALID STATE TRANSITIONS
  23. 23. DATA CANARY Netflix Services Netflix Services Netflix Services Netflix Services Video Metadata Service Amazon S3 Source System Source System Netflix Service Netflix Data Canary Service Data Tester Netflix Service Traffic
  24. 24. Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia SEEING RETURNS
  25. 25. Verify consistency prior to applying state changes. …one tool is a data canary.
  26. 26. CHAPTER TWO THE VANISHING OF CRITICAL SERVICES
  27. 27. # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport
  28. 28. Proxy/Routing Devices LOG DATA Log Data Service Traffic Cassandra Playback ServiceNetflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic
  29. 29. Proxy/Routing Devices Proxy Log Data Service Traffic Cassandra Playback ServiceNetflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic
  30. 30. Proxy/Routing Devices CASCADING FAILURE Log Data Service Traffic Cassandra Playback ServiceNetflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic
  31. 31. { throw new OutOfMemoryError(); } Log Data Service Cassandra Playback Service
  32. 32. PREVENTION MANAGING RESOURCE CONSTRAINTS Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr
  33. 33. Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr REDUCE SURFACE AREA
  34. 34. 1 Keep Only Dependencies which are Necessary SO MANY JARS!!
  35. 35. 2 LIMIT “MAGIC” Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr
  36. 36. 3 Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr ADD KILL SWITCHES
  37. 37. DEV FAVOR IMMUTABILITY Playback Service TEST Playback Service PROD Playback Service 4
  38. 38. try { remoteService.call(); } catch( Throwable t ){ //Oops! System.exit(1); } Log Data Service Cassandra Playback Service Proxy/Routing Traffic
  39. 39. It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr MITIGATION CIRCUIT BREAKERS
  40. 40. Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr FAILURE TESTING
  41. 41. Proxy/Routing Devices FAILURE TESTING Log Data Service Traffic Cassandra Playback Service Automating Chaos Experiments in Production by Ali Basiri Applying Failure Testing Research @Netflix by Kolton Andrus and Peter Alvaro
  42. 42. Manage resource constraints by reducing surface area. Leverage circuit breakers and rigorously test failures.
  43. 43. CHAPTER THREE THE THROTTLE
  44. 44. Proxy/Routing Devices PLAYBACK ARCHITECTURE Edge Service Edge Service Edge Service Playback Service Traffic Traffic URL Service
  45. 45. NETFLIX CLIENT JARS Playback Service URL Service URL Client Circuit-breakers and Fallbacks Metrics Retries and Timeouts RPCService Discovery
  46. 46. Playback Service Traffic Concurrent Requests Throttled Requests (HTTP 503) THROTTLING
  47. 47. } System.gc(); } URL Service Playback Service Edge Service Proxy/Routing Traffic
  48. 48. NETFLIX CLIENT JARS Playback Service URL Service URL Client Circuit-breakers Metrics Retries and Timeouts RPCService Discovery Heavy Fallback
  49. 49. FALLBACK TESTING With 100% Fallback, CPU held at 90% 15 RPSNo fallback, CPU held at 90% 58 RPS Siege: https://github.com/JoeDog/siege
  50. 50. SELECTING FALLBACKS CACHESTATIC FALLBACK SERVICE
  51. 51. URL Service Playback Service Edge Service Proxy/Routing Traffic } return Response .status(503) .build(); }
  52. 52. REQUEST BUCKETING NON-CRITICAL Experience or Performance Impact CRITICAL Customer Streaming Impact Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr
  53. 53. APPLICATION SHARDING Non-Critical Playback Service Proxy/Routing Devices Edge Service Edge Service Edge Service Traffic Traffic URL Service Critical Playback Service Non-Critical URL Service
  54. 54. CRITICAL Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr
  55. 55. NON-CRITICAL Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr
  56. 56. APPLICATION SHARDING Non-Critical Playback Service Proxy/Routing Devices Edge Service Edge Service Edge Service Traffic Traffic Critical Playback Service URL Service Non-Critical URL Service
  57. 57. No heavy fallbacks!! Fallbacks should be light and fast. Shard your application based on operational characteristics.
  58. 58. EPILOGUE KEY TAKEAWAYS
  59. 59. KEY TAKEAWAYS CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Verify consistency prior to applying state changes. • One tool is a data canary. CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Manage resource constraints by reducing surface area. • Leverage circuit breakers and rigorously test failures. CHAPTER 3: THE THROTTLE • No heavy fallbacks!! Fallbacks should be light and fast. • Shard your application based on operational characteristics.
  60. 60. The unexpected will happen. Plan to fail.
  61. 61. PARTING THOUGHT DISTRIBUTED SYSTEMS SOCIAL
  62. 62. Questions? Haley Tucker
  63. 63. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-resilience

×