SlideShare uma empresa Scribd logo
1 de 77
Prepare for Failure 
Fail fast. Isolate. Shed load.
@robhruska
Webapp
Webapp 
Mongo SQL Cache Rabbit Stripe 
Twilio
Webapp Webapp Webapp
Webapp
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp 
Mobile 
Push 
Recruit BBall
Webapp
Webapp
Webapp
Webapp
https://github.com/Netflix/Hystrix
https://github.com/Netflix/Hystrix
https://github.com/Netflix/Hystrix
https://github.com/hudl/Mjolnir
https://github.com/hudl/Mjolnir
Timeouts
Timeouts Bulkheads
Timeouts Bulkheads Circuit Breakers
Timeouts
∞ 
Timeouts 
System.Net.Http.HttpClient 
100s 
java.net.HttpURLConnection ∞ 
org.apache.commons.httpclient.HttpClient
Timeouts 
~15s 
Set High Observe Peak 
99.5% 
Adjust Down
Timeouts 
1250ms
Bulkheads
Bulkheads
Bulkheads
Bulkheads
Thread Pools
Thread Pools
Thread Pools
Thread Pools
Thread Pools
Thread Pools
Thread Pools
Thread Pools
Thread Pools
1/20 
20/20 
4/10 
4/20 
Semaphores
Circuit Breakers
Circuit Breakers
Circuit Breakers 
34 1% 
Operations Error
Circuit Breakers 
29 75% 
Operations Error
Circuit Breakers
Circuit Breakers
Circuit Breakers
Circuit Breakers
Circuit Breakers
+/-
+/-
Timeouts Bulkheads Circuit Breakers
Webapp
Webapp
users/get-user 
… 
… 
… 
…
users/get-user 
… 
… 
… 
…
users/get-user 
… 
… 
… 
…
A 
C 
G 
B
A 
C 
G 
B
A 
C 
G 
B
A 
C 
G 
B
A 
C 
G 
B
Resources 
github.com/Netflix/Hystrix 
github.com/hudl/Mjolnir 
michaelnygard.com 
@robhruska

Mais conteúdo relacionado

Mais procurados

Deploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetDeploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetColin Brown
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty
 
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014javier ramirez
 
Socket Utilization in Node.js
Socket Utilization in Node.jsSocket Utilization in Node.js
Socket Utilization in Node.jsAkhilesh Gupta
 
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015Wolfram Kriesing - EcmaScript6 for real - code.talks 2015
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015AboutYouGmbH
 
DevOps Days India 2013: Build Radiator on Raspberry Pi
DevOps Days India 2013: Build Radiator on Raspberry PiDevOps Days India 2013: Build Radiator on Raspberry Pi
DevOps Days India 2013: Build Radiator on Raspberry PiAkshay Karle
 
How we are using BigQuery and Apps Scripts at teowaki
How we are using BigQuery and Apps Scripts at teowakiHow we are using BigQuery and Apps Scripts at teowaki
How we are using BigQuery and Apps Scripts at teowakijavier ramirez
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleEric Lubow
 
The Real World - Plugging the Enterprise Into It (nodejs)
The Real World - Plugging  the Enterprise Into It (nodejs)The Real World - Plugging  the Enterprise Into It (nodejs)
The Real World - Plugging the Enterprise Into It (nodejs)Aman Kohli
 
FirefoxOS Meetup - Updates on Offline in HTML5 Web Apps
FirefoxOS Meetup - Updates on Offline in HTML5 Web AppsFirefoxOS Meetup - Updates on Offline in HTML5 Web Apps
FirefoxOS Meetup - Updates on Offline in HTML5 Web AppsNatasha Rooney
 
Making It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusMaking It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusEric Lubow
 

Mais procurados (15)

Deploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetDeploying E.L.K stack w Puppet
Deploying E.L.K stack w Puppet
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
 
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
 
SQL Server On SANs
SQL Server On SANsSQL Server On SANs
SQL Server On SANs
 
Socket Utilization in Node.js
Socket Utilization in Node.jsSocket Utilization in Node.js
Socket Utilization in Node.js
 
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015Wolfram Kriesing - EcmaScript6 for real - code.talks 2015
Wolfram Kriesing - EcmaScript6 for real - code.talks 2015
 
DevOps Days India 2013: Build Radiator on Raspberry Pi
DevOps Days India 2013: Build Radiator on Raspberry PiDevOps Days India 2013: Build Radiator on Raspberry Pi
DevOps Days India 2013: Build Radiator on Raspberry Pi
 
How we are using BigQuery and Apps Scripts at teowaki
How we are using BigQuery and Apps Scripts at teowakiHow we are using BigQuery and Apps Scripts at teowaki
How we are using BigQuery and Apps Scripts at teowaki
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
Leveragong splunk for finding needle in the Haystack
Leveragong splunk for finding needle in the HaystackLeveragong splunk for finding needle in the Haystack
Leveragong splunk for finding needle in the Haystack
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary Tale
 
The Real World - Plugging the Enterprise Into It (nodejs)
The Real World - Plugging  the Enterprise Into It (nodejs)The Real World - Plugging  the Enterprise Into It (nodejs)
The Real World - Plugging the Enterprise Into It (nodejs)
 
FirefoxOS Meetup - Updates on Offline in HTML5 Web Apps
FirefoxOS Meetup - Updates on Offline in HTML5 Web AppsFirefoxOS Meetup - Updates on Offline in HTML5 Web Apps
FirefoxOS Meetup - Updates on Offline in HTML5 Web Apps
 
Making It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusMaking It To Veteren Cassandra Status
Making It To Veteren Cassandra Status
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 

Semelhante a Prepare for failure (fail fast, isolate, shed load)

Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - DeploymentFabio Akita
 
Migrating to a bazel based CI system: 6 learnings
Migrating to a bazel based CI system: 6 learnings Migrating to a bazel based CI system: 6 learnings
Migrating to a bazel based CI system: 6 learnings Or Shachar
 
Migrating to a Bazel-based CI System: 6 Learnings - Or Shachar
Migrating to a Bazel-based CI System: 6 Learnings - Or ShacharMigrating to a Bazel-based CI System: 6 Learnings - Or Shachar
Migrating to a Bazel-based CI System: 6 Learnings - Or ShacharWix Engineering
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Germany
 
Puppet at GitHub - PuppetConf 2013
Puppet at GitHub - PuppetConf 2013Puppet at GitHub - PuppetConf 2013
Puppet at GitHub - PuppetConf 2013Puppet
 
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...Baruch Sadogursky
 
Real Time Web Applications and Merb
Real Time Web Applications and MerbReal Time Web Applications and Merb
Real Time Web Applications and MerbCollin Miller
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesSeveralnines
 
Updates on Offline: “My AppCache won’t come back” and “ServiceWorker Tricks ...
Updates on Offline: “My AppCache won’t come back” and  “ServiceWorker Tricks ...Updates on Offline: “My AppCache won’t come back” and  “ServiceWorker Tricks ...
Updates on Offline: “My AppCache won’t come back” and “ServiceWorker Tricks ...Natasha Rooney
 
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019confluent
 
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...Baruch Sadogursky
 
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integration
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integrationSouJava May 2020: Apache Camel 3 - the next generation of enterprise integration
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integrationClaus Ibsen
 
Free The Enterprise With Ruby & Master Your Own Domain
Free The Enterprise With Ruby & Master Your Own DomainFree The Enterprise With Ruby & Master Your Own Domain
Free The Enterprise With Ruby & Master Your Own DomainKen Collins
 
Jump into Squeak - Integrate Squeak projects with Docker & Github
Jump into Squeak - Integrate Squeak projects with Docker & GithubJump into Squeak - Integrate Squeak projects with Docker & Github
Jump into Squeak - Integrate Squeak projects with Docker & Githubhubx
 
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG🎤 Hanno Embregts 🎸
 
Velocity 2016 - Operational Excellence with Hystrix
Velocity 2016 - Operational Excellence with HystrixVelocity 2016 - Operational Excellence with Hystrix
Velocity 2016 - Operational Excellence with HystrixBilly Yuen
 
SXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersSXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersElena-Oana Tabaranu
 
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...Baruch Sadogursky
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverlessYan Cui
 

Semelhante a Prepare for failure (fail fast, isolate, shed load) (20)

Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - Deployment
 
Migrating to a bazel based CI system: 6 learnings
Migrating to a bazel based CI system: 6 learnings Migrating to a bazel based CI system: 6 learnings
Migrating to a bazel based CI system: 6 learnings
 
Migrating to a Bazel-based CI System: 6 Learnings - Or Shachar
Migrating to a Bazel-based CI System: 6 Learnings - Or ShacharMigrating to a Bazel-based CI System: 6 Learnings - Or Shachar
Migrating to a Bazel-based CI System: 6 Learnings - Or Shachar
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
 
Puppet at GitHub - PuppetConf 2013
Puppet at GitHub - PuppetConf 2013Puppet at GitHub - PuppetConf 2013
Puppet at GitHub - PuppetConf 2013
 
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at The Pittsburgh...
 
Real Time Web Applications and Merb
Real Time Web Applications and MerbReal Time Web Applications and Merb
Real Time Web Applications and Merb
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - Slides
 
Updates on Offline: “My AppCache won’t come back” and “ServiceWorker Tricks ...
Updates on Offline: “My AppCache won’t come back” and  “ServiceWorker Tricks ...Updates on Offline: “My AppCache won’t come back” and  “ServiceWorker Tricks ...
Updates on Offline: “My AppCache won’t come back” and “ServiceWorker Tricks ...
 
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
 
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Downtown San J...
 
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integration
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integrationSouJava May 2020: Apache Camel 3 - the next generation of enterprise integration
SouJava May 2020: Apache Camel 3 - the next generation of enterprise integration
 
Free The Enterprise With Ruby & Master Your Own Domain
Free The Enterprise With Ruby & Master Your Own DomainFree The Enterprise With Ruby & Master Your Own Domain
Free The Enterprise With Ruby & Master Your Own Domain
 
Jump into Squeak - Integrate Squeak projects with Docker & Github
Jump into Squeak - Integrate Squeak projects with Docker & GithubJump into Squeak - Integrate Squeak projects with Docker & Github
Jump into Squeak - Integrate Squeak projects with Docker & Github
 
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
 
Velocity 2016 - Operational Excellence with Hystrix
Velocity 2016 - Operational Excellence with HystrixVelocity 2016 - Operational Excellence with Hystrix
Velocity 2016 - Operational Excellence with Hystrix
 
SXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersSXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBusters
 
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...
Building a private CI/CD pipeline with Java and Docker in the Cloud as presen...
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverless
 
Deployment de Rails
Deployment de RailsDeployment de Rails
Deployment de Rails
 

Último

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Último (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Prepare for failure (fail fast, isolate, shed load)

Notas do Editor

  1. My name’s Rob. I currently work on Hudl’s Platform squad, and we’ve recently made some large structural changes to how we develop and deploy our web application. We’ve moved from a large, monolithic application to more of a distributed system - and I’m going to cover some of the coding tools and patterns we use make that distributed system a bit more fault tolerant and resistant to failure. To start, a little background.
  2. The majority of our developers do a lot, if not all, of their work on hudl.com, our website. Coaches use Hudl to upload game and practice film, break it down, and share it out with athletes. Athletes watch and analyze game film to become better at their positions, and also do things like create highlights for awesome moments. Coaches can also do a lot of team management, build out and share playbooks, script their practices, stat games, etc. It’s become pretty feature rich, and these features get used by millions of people.
  3. And as of the beginning of this year, all of those features were crammed into one big project - the git repository, unpacked, was some 6GB. All that code got compiled into a single deployment that ran on each of our webservers.
  4. The webapp worked with a handful of other services - a number of our own databases and cache servers, and also some external services like Stripe for credit card processing.
  5. Each of these circles actually represents multiple machines - we currently have a couple hundred webservers up here, and most of these other services are actually clusters of several machines. Most of that’s for scalability, and with the exception of a one or two of these gray circles, each node in a cluster is pretty independent and doesn’t communicate with the rest of the cluster or other services. This is still a pretty straightforward system. There are really only two layers of dependencies between applications. It’s fairly easy to reason about relationships between nodes and know what kinds of problems might happen. Additionally, these downstream nodes don’t change very frequently. Most are very stable, “core” components like databases. We aren’t deploying to them or upgrading them near as frequently as the webapp code we’re writing - and since they’re not changing very much, they’re a little bit less likely to fail. We could keep this system running pretty smoothly, and quickly identify and recover from most incidents.
  6. However, that simplicity came with costs. We’ve grown our product team pretty rapidly, and still have aggressive goals for where we’re going. Every new developer we hired and every new feature we introduced meant that we piled more and more code onto that one web application. That meant longer checkouts and longer app startup times for local development. It meant increased build and deploy times, which are important to us because we like to deploy changes frequently, sometimes 15-20 a day. It also meant that if one of our squads accidentally deployed some code with issues into production, it had a much higher chance to cripple or break the entire site, because everything runs in the same application.
  7. And that’s not a made-up scenario - we’ve seen all kinds of those things in production already: memory leaks, obscure circular references, super-aggressive loops that eat up CPU - all these things that are really difficult to find during development and testing, but become nasty things when you throw real production traffic at them.
  8. So we set out to solve these problems. We decided to move to a more distributed, service-oriented architecture. This meant taking that big webapp and splitting off services that individual squads could work with independently. We slowly rolled out separate applications for our college recruiting product, our basketball platform, mobile device push notifications, and a handful of others.
  9. Our intent is to have dozens of these in the medium term.
  10. And even hundreds of services in the long term.
  11. So look at how that changes the graph of how the components in the system interact. Even with just a few new services, we start adding a lot more dependencies between each of the nodes.
  12. Recruit and Basketball are going to need to send mobile push notifications out to users’ devices.
  13. Recruit will need to ask Basketball for game film on recruitable Basketball athletes.
  14. All three of these are still going to have to work with what we affectionately call “the monolith”, the original webapp that’s inevitably going to stick around for a while as we continue to move pieces off of it. Seems straightforward, but it gets more complex.
  15. Remember that the webapp had a bunch of other backend services it talked to. Databases, caches, queues.
  16. On top of that, each of those new services can have their own databases, caches, queues, and other internal and external dependencies. And just to really make a mess of things...
  17. each of these services is clustered, and has several of each of these nodes running in it. And this is with only four services. Imagine what it looks like with a hundred. Where are the failures going to happen here? Previously, we had a relatively linear, predictable set of places where things could go wrong. We knew what most of them were, and how to react. But here? Here, problems in one small part can cascade upward and outward in ways we’re not able to predict and handle.
  18. An error all the way down here in this database can cause problems...
  19. ...in its corresponding application...
  20. ...which can propagate upstream, crashing applications all the way back up. I’ll show you with a little more detail just how that can happen in a bit. These are all network calls, and the network is a hostile place; when it’s most inconvenient for you, it’s probably going to lose connectivity or become super slow. Network and server hardware can fail, packets can get dropped, there will be unscheduled maintenance, developers will push bad code. These and other failures are all but guaranteed to happen, especially if you’re running on commodity hardware in the cloud, where you’re not in as much control of your own systems. We can probably take a good crack at preventing these problems. It’d take a lot of money, time, and humans, but not a lot of our organizations have lots of money, time, and humans to dedicate, especially if we want to keep competing, innovating, and moving forward with the product that we actually deliver to our customers. So without those resources, your applications have to be tolerant of these failures and prevent them from branching out through your system. http://aphyr.com/posts/288-the-network-is-reliable
  21. So what does that mean? How do we anticipate those failures and become more tolerant? There are a number patterns out there that help us solve these problems. The approach we took was inspired by a couple different sources. Cascading failure is a big topic in Michael Nygard’s book, Release It! - this is a really excellent book. Even if you don’t work in complex or distributed systems, it helps you think about how systems can fail, which is useful to keep in mind when you’re designing and coding. Last time I checked, the Kindle edition was around 16 or 17 bucks on Amazon, and that’s well worth it. And when we approach architectural problems, we have some engineering team role models that we look to to see what they’ve done in similar situations. One of those is Netflix. We both run on Amazon Web Services, so they tend to encounter similar situations as us when it comes to how AWS works and behaves, or misbehaves. Netflix have a library called Hystrix, which takes and applies some concepts from Nygard’s book to solutions that help us with these problems. Hystrix is written in Java since Netflix works on the JVM. There’s not really a strong Hystrix equivalent for .NET, which is our primary platform - there’s a port of the library, but it doesn’t get much activity, and we also didn’t really think that porting Hystrix directly to C# the way we wanted to approach it. C# has some great asynchronous language features that don’t have 1:1 equivalents in Java, and we wanted to use them. We also wanted to understand the problems and solutions a little better ourselves, so we wrote library similar to Hystrix for .NET called Mjolnir. So what are Hystrix and Mjolnir? Let’s check out some code.
  22. Both have an abstract class called Command that you can inherit from. Within these commands, you put some code that might be dangerous, that might be suscept to those problems that we saw in the example. Here’s an example from Hystrix.
  23. You can see the run() method down here - whatever you put in here is what gets protected. In the majority of cases, this is probably code that does I/O over the network. Imagine that we’re grabbing an HTTP client in this run method and making a GET request. Inter-cluster service calls, database calls, calls to systems outside of your application. Those sorts of things. It *could* be code that’s heavily Memory-bound, or code that interacts with something on disk, but those cases are a lot rarer than network communication.
  24. To execute the command, you just create a new one and call execute(). It’s fairly straightforward. Hystrix also has equivalent methods to execute() that let you get a Future or an Observable if you want to work with the result in a more asynchronous way.
  25. Here’s the equivalent with Mjolnir in C#. It looks pretty similar. Some slight differences, but the structure is the same. Potentially dangerous code goes into the ExecuteAsync() method.
  26. To use the Command, you create a new one and call InvokeAsync(). That’s the async version - Mjolnir’s also got a synchronous equivalent, Invoke().
  27. What are these Commands? When you put code into that overridden method and run the Command, what actually happens? The Command applies a few protective patterns around it.
  28. One thing they do is enforce Timeouts. Timeouts are kind of an obvious thing, but it’s easy to forget about them.
  29. They use Bulkheads, which are a way to isolate pieces of your system from the rest of it, and keep failures contained.
  30. They also employ the idea of Circuit Breakers, which help track failure rates and fail fast when things seem to be going wrong. We’ll go into a little more detail about what each of these does.
  31. First, timeouts.
  32. Timeouts are pretty simple, right? You configure a timeout duration, and the operation will abort if it takes any longer. But when you’re deep in the bits, coding up a feature, you’re not typically thinking about how long it’s going to take to make a network call. You’re just trying to put things together and get the feature to work, right? It’s so easy to forget to set the timeout. When you start wrapping your code in Commands, they make timeouts required. They let you configure timeouts per-Command, but also have default global timeouts that get used if you don’t explicitly set one. They make you put some thought into what a reasonable execution time should be. Does anyone know what the default timeout for a .NET HttpClient call is? [ADVANCE] 100 seconds When was the last time you waited 100 seconds for a page to load? Maybe once in a while. Maybe. [ADVANCE] What about java.net.HttpURLConnection? Anyone know the default timeout? [ADVANCE] infinite [ADVANCE] apache commons HttpClient is also infinite. It’s pretty obvious that defaults aren’t going to cut it. You can’t have these multi-minute or infinite requests sitting around in your application, blocking and tying up threads. Plus, your users aren’t going to wait that long, anyway.
  33. So what *should* your timeouts be? It depends, and you’ll have to experiment. Start with something reasonably high - that depends on how aggressive you want to be. Netflix starts with 1 second timeouts for new commands. We’ve been a bit more generous with ours. Given that we’ve recently introduced this into our system, we need to be sure we’re not being too aggressive and affecting our users’ experience. So we default to 10-15 seconds for our timeouts. After you get your Commands out there, you’ll want to Observe how long the command takes. This might be a day, a week, a month - ideally it’s whatever will take you through peak traffic. This is another reason we set ours pretty high - our peak cycle is different from Netflix. I assume theirs is relatively consistent week-to-week, but we see our peak loads in September, so what works for us right now may be too aggressive then. Once we go through that peak, we’ll tune them again more accurately, and try to get down to that one second mark. And when you tune them, you’ll want to adjust them to a timeout value that covers around 99.5% of their requests. https://github.com/Netflix/Hystrix/wiki/Operations#how-to-configure-and-tune-a-circuit
  34. For example, this is one of our commands elapsed time over one day. For this, we’d set our timeout around the 1250 to 1300 millisecond mark, which covers almost every single request. In this particular case I think I determined we’d end up rejecting just one or two requests out of around 85 thousand. It’s important to tune them as soon as you can and as low as you can. Our 15 default is still pretty high. There’s likely more than one command execution on any given page request, so they’ll still stack up to something longer than 15 seconds, and the user’s probably not going to wait that long. As an aside, timeouts are something you can do even without integrating one of these libraries - it’s a smart thing to get used to setting them whenever the API you’re working with supports it. But timeouts alone aren’t enough. They’ll help last a little bit longer in a failure situation, but we need more.
  35. We need to isolate problems with Bulkheads.
  36. For those unfamiliar with how bulkheads work in naval vessels, here’s an awesome picture of a boat that I made.
  37. Some larger ships are divided up into compartments, with these upright walls in between called bulkheads. The bulkheads are watertight and fire-resistant, so ...
  38. If the hull gets punctured or a fire breaks out in one of the compartments, the bulkhead isolates the problem to that one part of the ship. The rest of the compartments aren’t affected, and the ship can stay afloat and continue operating. Hystrix and Mjolnir take this concept and apply it to code using thread pools and counting semaphores.
  39. Let’s look at an example. This is another look at that cascading failure I described earlier, but with a little more detail. On top here in gray is the application we’re trying to protect, and below are four applications it communicates with. Imagine it’s making frequent HTTP calls out to each of these systems.
  40. Visualize the application in terms of something like threads or sockets. There are a finite number it can handle, and that number is constrained. It might be by the amount of memory on the machine or the memory the app was allocated when it started up, or possibly by a capped thread pool size that’s managed by the application container.
  41. When everything’s working normally, those HTTP requests are firing off to other applications and responding pretty quickly - say, 10 milliseconds. So imagine that each of these squares turns a color for that 10 milliseconds and then frees up the resource, turning the square white again.
  42. But let’s say that one of those downstream applications starts responding really slowly. Maybe our basketball squad decided to run some batch data import, but forgot to throttle it, and cranked their application’s CPU up to 100%. Requests are still getting accepted, but now instead of 10 milliseconds, the application might take 10 or 20 seconds to respond depending on how overloaded that cluster is.
  43. This is where Timeouts alone won’t save you. If you’ve got a 15 second timeout, and requests are coming in faster than they’re timing out, the application’s going to have more and more threads tied up, blocking, waiting for those requests to get a response back. For a while, it’s fine - they’re probably just taking up other unused threads in the application container’s thread pool, and the pool might start resizing itself and adding more threads if it does that sort of thing.
  44. But eventually, those slow requests are going to start starving out the other calls.
  45. And if that continues, the application will max out its threads, and every one of them will be blocking on HTTP requests, leaving you completely hosed. If other things depend on you, you’ve also turned into another bad node in the system. This is the cascade we saw earlier, where a problem downstream can propagate upward and do damage to systems that depend on it, and systems that depend on those systems - all the way up. That’s where the bulkheads help out.
  46. Instead of letting Commands run wild through your application’s thread pool, they get passed through their own thread pool that has a fixed maximum number of threads, typically around 10. These individual thread pools are the bulkheads.
  47. If one Command starts to act up, it can only use up as many threads as are in that pool. When it consumes them all, new executions of that Command will simply be rejected. The library will throw an exception or execute a predefined, safer fallback method. The rest of the Commands in the system continue working normally, and aren’t affected by the outage that’s going on with that one node, which is good, because there’s a good chance they don’t even need to interact with that service. The main job of the thread pools is to limit the number of concurrent calls for a single set of operations. We can also do something similar by using counting semaphores.
  48. We could replace the thread pool with a semaphore. The counting semaphore is just a lock that can have up to a specific number of concurrent lock holders. The lock acquisition is done in a try/acquire way, which means that if we can’t immediately access one of the spots in the lock, we don’t block, but instead return false. When the the semaphore is at its maximum, new Commands won’t be able to acquire that lock and will get immediately rejected, just like the thread pool. Hystrix supports semaphores, Mjolnir doesn’t yet, but will at some point. So these thread pool and semaphore bulkheads are pretty important - they do a great job of mitigating that failure cascade and preventing it from spreading and taking over everything.
  49. On to our remaining protection layer. We’ve covered Timeouts and Bulkheads, let’s talk about Circuit Breakers. They’re pretty straightforward.
  50. Every Command gets passed through a circuit breaker.
  51. Each circuit breaker maintains a rolling count of the operations that have come through it, and whether or not each of those was a success or a failure.
  52. A traditional electrical circuit breaker trips and opens if too much current is drawn through it. One of our circuit breakers trips if it sees too many errors. Once a breaker is tripped, any Command that would have gone through it is instead immediately rejected. The breaker will stay tripped for a configured period. Maybe 10 or 30 seconds or so.
  53. Once the wait period ends, the breaker will allow a single operation through.
  54. If that operation succeeds, the breaker is considered fixed and closed, allowing all operations through again. But if the single test operation fails...
  55. ...the breaker remains tripped for another waiting period, and the process repeats.
  56. Circuit breakers serve two important purposes. They help us fail fast back to the caller. We already know that our operation is having problems, we might as well not make our users or client applications wait just to eventually tell them what we already know.
  57. They also help shed load from systems that are already under stress. When downstream applications are having problems, sending them more traffic certainly isn’t going to help them. And if you think about it, when you hit a site (say, GitHub) and it either goes really slowly or dumps an error back on you, what’s one of the things you usually do? You try to figure out if it’s you or them, so you hit F5 with a page reload, which sends off another request and is only going to burden the crippled system even more. We just need to back off until the application’s back up and working correctly. We also save a few threads and some processing on our calling side.
  58. One really nice thing is that most of timeout, thread pool, and circuit breaker behavior is all configurable at runtime. If you find that a timeout’s just a little too aggressive, you can bump it up without an application restart. Need to adjust the error percentage that a circuit breaker trips at? Done.
  59. Another thing that can be useful in practice is grouping several types of operations together so that they all use the same thread pool or circuit breaker. If you think about it, if you’ve got a service that lets you update user names, change user passwords, and delete users, there could be a high chance that if you start seeing elevated error rates or timeouts on the user name updates, you’ll probably start seeing the same problems with the password updates and the deletes. Not all of the time - sometimes there may be bugs in just one, but it’s likely that if problems are happening with the user database they’re all going to, it’ll affect everything user-related. So what we do, and what these libraries allow, is group similar operations together - all of our user service operations flow through the same circuit breaker, which means that they’re all going to hit a tripped breaker together and fail together. That gives us the opportunity to fail a little bit faster, instead of waiting for a circuit breaker for each different type of operation to trip. How you might group operations is kind of a judgement call, and you’ll probably have to try things out and adjust them if it doesn’t quite fit. We’ve keep our groups of service methods pretty granular - we have focused groups of services for users, for roles, video clips, playbook plays, highlights. We kind of tie our groups to individual SQL tables or mongo collections. Some of our groups might be a little wrong, and might cause some operations to run into a tripped breaker when they actually didn’t need to. We’re okay with that, and will recognize those situations and can move around our groupings if we find that we need to.
  60. So by using all three of these patterns, you can do a lot improve your applications’ fault tolerance and stop that failure cascade.
  61. If you look at the cascading failure we saw earlier, but with these added protections, that database error only ever really affects...
  62. ...small parts of the apps that depend on it, leaving the rest of the app chugging along.
  63. Here’s an example of what these protections look like in our production environment. For this particular case, we had a small incident where a couple of our recruit servers had trouble connecting to our main webapp. This is a snapshot of one of our metrics charts of the elapsed round trip time between clusters for each group of Commands we have. You can see at the end there, about half past noon, a bunch of them jumped up to 15 seconds round trip, up from at most around 1 second.
  64. From our logs, this is what one set of Commands from one of our recruit servers looked like right when that started happening. In blue is the volume of canceled commands. A canceled command means it hit its timeout and we aborted the call. In yellow is the volume of rejected commands, which means the circuit breaker immediately prevented them from even happening. This bottom chart shows two individual events, the first is when the circuit breaker tripped, and the second is when it fixed itself. So you can see, for this particular set of Commands, there was a brief period up front where users would have been waiting 15 seconds for errors to occur, and then when the breaker tripped, we started immediately returning with an error instead of making them wait 15 seconds and *then* fail. This is a pretty small example with what was a fairly small incident. It didn’t hit any of our thread pool limits, but did help us fail faster and take a little load off while we waited for the problem to resolve.
  65. One thing that’s useful to keep in mind is that these bulkheads and circuit breakers are just patterns. They’re used by Hystrix and Mjolnir, but you can also employ them on your own, and we’ve done that ourselves as well. Within every application we have a transport layer that uses HTTP to send API requests from one cluster to another.
  66. That transport layer helps locate service endpoints - it holds onto a mapping of routes to other applications that we’ll round-robin through when sending these HTTP requests. Our transport layer watches every request that goes to each individual machine, and looks specifically for socket errors. We figure that if we encounter a socket error, it means that there’s something more fundamentally wrong with connectivity between us and that other machine.
  67. If we see more than three consecutive socket errors from us to a single machine, we’ll pull it from that internal mapping and mark it as unhealthy - note that we only do that within the application that observed it - we don’t broadcast that out to other nodes in our cluster or other applications - I’ll come back to that in a second. We’ll then, in the background, send a ping request to it every 5 seconds until we get a successful response, at which point we’ll put it back into that mapping as an eligible endpoint. This is just another example of the circuit breaker pattern, but in a more focused and specialized case. We’re monitoring operations for their failure rates, reacting by preventing them from happening altogether, and then self-healing when we see that things are better. Coming back to what I hinted at earlier, this brings us to an interesting observation.
  68. Note that all of the patterns and behavior that I’ve talked about here are happening right within one application on one machine. There’s not some global authority or monitoring application that watches the whole system and tells nodes about the state of the rest of the system. In fact, if you try to do that, it’s got the potential to be pretty inaccurate.
  69. Imagine this orange circle over here is some sort of global arbiter system that monitors all of our servers, and its responsibility is to detect when systems become unavailable and tell the rest of the systems about that. I’ll call it system G (for “global”).
  70. In some cases, this may work. If G can’t get to system C down here, it’s possible that system C dropped completely off the network or became unresponsive, in which case it’s fine to tell A and B that C is no longer available, and that they should stop sending traffic to it.
  71. But what’s also possible is that system G is able to connect C just fine, but system B actually isn’t. This might be because B goes through a different switch, and that switch is having problems. It might be that something’s wrong with the way system B is receiving DNS information. It could be for any number of reasons. In that case, G doesn’t really work well as an arbiter. It’s not going to be able to reliably tell B to stop sending traffic to C.
  72. It also doesn’t really work if G can’t connect to systems A or B to tell them about system C being gone, though A and B might still be able to communicate fine with each other and with this other, unmarked gray circle over on the right. This global arbiter model falls apart for a number of other similar situations.
  73. So all of our observations about the system are done by individual nodes. A, B, and C make decisions for themselves, and don’t really involve G very much. System A has a much better idea about the other applications that it can see and successfully communicate with. There can still some need for a global state. For us, we do keep a global service registry that lists all of our servers and a little bit of information about them. Applications, about every 15 seconds or so, grab that state to refresh their worldview. Bring new servers into their service registries, those sorts of things. That matters the most when an application starts up, because it gives the application the initial information it needs to build up its route mappings and those sorts of things. But outside of application startup, they don’t rely on that state to get things done. If they happen to lose connectivity to G for a bit, that’s fine. They’ll continue functioning. G just serves as a way to make sure things stay relatively in-sync. So the point here is that we give individual nodes the responsibility to make decisions about how they communicate.
  74. Another important takeaway here is that you need to design and write your user interfaces in a way that gracefully handles problems. It’s difficult to do, and we haven’t mastered it yet, but we’re getting better at it. It takes a bit of a mental shift. Here’s a screenshot of one of our pages where an athlete can manage his or her profile information. Part of that profile management involves uploading academic documents like transcripts for recruiters to view. This page itself is served up by one of our monolith servers, but all of the academic document data is managed by our recruit cluster. That means that when we load this page, we have to make a call out to a recruit server to grab the user’s academic documents. If recruit’s running slowly, there’s a chance that we’ll run into a tripped breaker when we grab this information. Instead of just blowing up and throwing a 500 back to the user, we just drop a small error message into the documents section - the rest of the page continues to render and behave just fine. So it takes a little bit of thought, because you need to figure out if you can still give your user a decent experience if parts of your underlying system are having troubles. There are going to be some obvious situations where a full-on error page is inevitable - if recruit’s down and we’re trying to serve up a very recruit-heavy page, we’re probably not going to be able to do much. But there are a lot of places like this, where that outside interaction is just a small part of the user’s experience, and you can let them continue to do other things that aren’t related to the failures. In some cases, you might even consider building in an automatic ajax retry or putting in a “try again” button. Maybe this was just a really quick, transient network blip that fixes itself in a few seconds. I know I scoffed at the fact that users are going to retry and cause you more problems, but that only causes problems if you’re not protected by circuit breakers. If your breaker is still tripped, it’s going to be a really quick request for your server to process, and it’s not going to push that request downstream to wherever the problem is.
  75. Finally, and this is solid advice regardless of what you’re doing: monitor your systems. Use logging and aggregators like Splunk, Kibana, LogStash. Hook up metrics services like Nagios, statsd, Riemann, or a host of other things. Send yourself alerts via email or something fancier like PagerDuty when breakers trip or error rates get high. You need to know how your systems behave and when they misbehave. The charts you saw earlier came from Splunk and Graphite, which are a couple tools we use, but you should be able to plug in logging and monitoring pretty easily.
  76. So I encourage you to give these frameworks a look. Hystrix is pretty far along, and has a number of other nifty features like request caching. Their wiki documentation is great, and they’ve got some good diagrams and dashboards that can give you more insight. I hope I’ve passed on a few valuable things to everyone today, even if it was just to build some awareness on how applications can interact and fail in unpredictable ways. These situations are more likely to happen under higher traffic volumes and in more distributed systems, but the patterns and ideas have applications within systems of any size and any load. Check out Michael Nygard’s book, Release It!, which sets the foundation for a lot of these ideas. I’ll get this slide deck pushed up and tweet it out and it’ll also probably end up on one of Mjolnir’s wiki pages. If you have any questions, I’d be happy to field them - or you can find me during one of the breaks if you think of something later.