SlideShare uma empresa Scribd logo
1 de 92
Baixar para ler offline
one chaos experiment a day.
keep the outages away.
yan cui,@theburningmonk
Chaos
Engineering?
MUST KILL SERVERS!
RAWR!!
RAWR!!
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
@theburningmonk theburningmonk.com
microservices death stars circa 2015
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!
everything fails, all the time
@theburningmonk theburningmonk.com
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
anything that can go wrong, will go wrong.
MURPHY’s LAW
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
game days
failure injection
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
Yan Cui
http://theburningmonk.com
@theburningmonk
http://bit.ly/yubl-serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
theburningmonk.com/courses
theburningmonk.com/courses
realworldserverless.com
“using serverless reduces the blast radius”
www.buzzsprout.com/877747/4615985
@theburningmonk theburningmonk.com
serverless improves resilience
as platform takes care of infrastructure failures
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
Shared Responsibility Model
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
there are no servers to kill!
SERVERLESS
@theburningmonk theburningmonk.com
improperly tuned timeouts
@theburningmonk theburningmonk.com
missing error handling
@theburningmonk theburningmonk.com
missing fallbacks
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
STEP 1.
define steady state
i.e. “what does normal look like”
@theburningmonk theburningmonk.com
STEP 2.
hypothesis that steady state continues in control and experimental group
e.g. “the system stays up if a server dies”
@theburningmonk theburningmonk.com
STEP 3.
inject realistic failures
e.g. “slow response from 3rd-party service”
@theburningmonk theburningmonk.com
STEP 4.
try to disprove hypothesis
i.e. “look for difference between control and experimental group”
@theburningmonk theburningmonk.com
latency inject latency to function invocation
@theburningmonk theburningmonk.com
“what if service X has elevated latency?”
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
hypothesis: API would timeout and our try-catch
would handle it and return default response
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
502
200
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
3s timeout
6s timeout
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
max 29s integration
max 15 mins timeout
@theburningmonk theburningmonk.com
and then there’s
cold starts…
@theburningmonk theburningmonk.com
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.
Don’t forget about the cold starts!
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
https://bit.ly/2Wvfort
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
outcome: a more resilient system
@theburningmonk theburningmonk.com
latency
exception
inject latency to function invocation
throws exception
@theburningmonk theburningmonk.com
latency
exception
statuscode
inject latency to function invocation
throws exception
return HTTP status code
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
denylist
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
looses network connectivity
@theburningmonk theburningmonk.com
“what if DynamoDB has an elevated error rate?”
@theburningmonk theburningmonk.com
API Gateway Lambda DynamoDB
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
hypothesis: the AWS SDK retries would handle it
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
action: set max retry count + fallback
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
outcome: a more resilient system
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
denylist
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
looses network connectivity
everything fails, all the time
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead
@theburningmonk
theburningmonk.com
github.com/theburningmonk

Mais conteúdo relacionado

Semelhante a A chaos experiment a day, keeping the outage away

Semelhante a A chaos experiment a day, keeping the outage away (20)

Beware the potholes on the road to serverless
Beware the potholes on the road to serverlessBeware the potholes on the road to serverless
Beware the potholes on the road to serverless
 
Beware the potholes on the road to serverless
Beware the potholes on the road to serverlessBeware the potholes on the road to serverless
Beware the potholes on the road to serverless
 
Essential open source tools for serverless developers
Essential open source tools for serverless developersEssential open source tools for serverless developers
Essential open source tools for serverless developers
 
Patterns and practices for building resilient Serverless applications
Patterns and practices for building resilient Serverless applicationsPatterns and practices for building resilient Serverless applications
Patterns and practices for building resilient Serverless applications
 
How to build observability into a serverless application
How to build observability into a serverless applicationHow to build observability into a serverless application
How to build observability into a serverless application
 
Dont try these at home
Dont try these at homeDont try these at home
Dont try these at home
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Stand back; I'm going to try Scientist!
Stand back; I'm going to try Scientist!Stand back; I'm going to try Scientist!
Stand back; I'm going to try Scientist!
 
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-SubramanyaErlang - Because s**t Happens by Mahesh Paolini-Subramanya
Erlang - Because s**t Happens by Mahesh Paolini-Subramanya
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
 
testing for people who hate testing
testing for people who hate testingtesting for people who hate testing
testing for people who hate testing
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Serverless observability - a hero's perspective
Serverless observability - a hero's perspectiveServerless observability - a hero's perspective
Serverless observability - a hero's perspective
 
Serverless gives you wings
Serverless gives you wingsServerless gives you wings
Serverless gives you wings
 
re:Invent 2019 Highly Available ECS Spot Architecture: Save 50%-90%
re:Invent 2019 Highly Available ECS Spot Architecture: Save 50%-90%re:Invent 2019 Highly Available ECS Spot Architecture: Save 50%-90%
re:Invent 2019 Highly Available ECS Spot Architecture: Save 50%-90%
 
Introduction of Plasma Chamber at EDCON 2019
Introduction of Plasma Chamber at EDCON 2019 Introduction of Plasma Chamber at EDCON 2019
Introduction of Plasma Chamber at EDCON 2019
 
Sangam 18 - Database Development: Return of the SQL Jedi
Sangam 18 - Database Development: Return of the SQL JediSangam 18 - Database Development: Return of the SQL Jedi
Sangam 18 - Database Development: Return of the SQL Jedi
 
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
 

Mais de Yan Cui

How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
Yan Cui
 

Mais de Yan Cui (20)

How to win the game of trade-offs
How to win the game of trade-offsHow to win the game of trade-offs
How to win the game of trade-offs
 
How to choose the right messaging service
How to choose the right messaging serviceHow to choose the right messaging service
How to choose the right messaging service
 
How to choose the right messaging service for your workload
How to choose the right messaging service for your workloadHow to choose the right messaging service for your workload
How to choose the right messaging service for your workload
 
Lambda and DynamoDB best practices
Lambda and DynamoDB best practicesLambda and DynamoDB best practices
Lambda and DynamoDB best practices
 
Lessons from running AppSync in prod
Lessons from running AppSync in prodLessons from running AppSync in prod
Lessons from running AppSync in prod
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
 
Why your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncWhy your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSync
 
Build social network in 4 weeks
Build social network in 4 weeksBuild social network in 4 weeks
Build social network in 4 weeks
 
Building a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLBuilding a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQL
 
FinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyFinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economy
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold starts
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
How to debug slow lambda response times
How to debug slow lambda response timesHow to debug slow lambda response times
How to debug slow lambda response times
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
Debugging Lambda timeouts
Debugging Lambda timeoutsDebugging Lambda timeouts
Debugging Lambda timeouts
 
Serverless a superpower for frontend developers
Serverless a superpower for frontend developersServerless a superpower for frontend developers
Serverless a superpower for frontend developers
 
Debugging AWS Lambda Performance Issues
Debugging AWS Lambda Performance  IssuesDebugging AWS Lambda Performance  Issues
Debugging AWS Lambda Performance Issues
 
Serverless Security: Defence Against the Dark Arts
Serverless Security: Defence Against the Dark ArtsServerless Security: Defence Against the Dark Arts
Serverless Security: Defence Against the Dark Arts
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

A chaos experiment a day, keeping the outage away