SlideShare uma empresa Scribd logo
1 de 47
A D A R K A N D S T O R M Y N I G H T
TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
operational-antipattern
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Kiran Bhattaram
@kiranb
B U LW E R - LY T T O N
It was a dark and stormy night; the rain fell in torrents — except at
occasional intervals, when it was checked by a violent gust of wind
which swept up the streets (for it is in London that our scene lies),
rattling along the housetops, and fiercely agitating the scanty flame
of the lamps that struggled against the darkness.
DEFINITIONS
What is operability?
▸ The ability to keep a system in a safe and reliable
functioning condition, according to pre-defined
operational requirements.
Characteristics of operability
▸ safety & reliability
▸ scalability
▸ grace under pressure
DEFINITIONS
▸ ease of upgrades
▸ observability
▸ usability
▸ cultural practices around incidents
▸ AND MORE
DEFINITIONS
Characteristics of an operable system
▸ Converge towards a stable state.
▸ Give operators visibility and tools.
▸ Designed to be usable and unsurprising.
DEFINITIONS
Agenda
Robustness Usability Review!Observability
1. ROBUSTNESS
THE TALE OF THE SYSTEM THAT
COULDN’T GIVE ANYTHING UP
STORY 1
ROBUSTNESS
Define your critical path.
ROBUSTNESS
Harvest, Yield and Scalable Tolerant Systems
Yield =
successful requests
total requests
!= uptime
Harvest =
data available
total data
* dropping requests
* degrading response
ROBUSTNESS
Controlling yield: load shedding upstream requests
▸ categories of load shedders:
▸ # of requests
▸ # of concurrent requests (protect against the long tail)
▸ overall fleet utilization (keep x% of workers for core
traffic)
ROBUSTNESS
Controlling harvest: circuit breakers
▸ stop calling a dependency if it seems down!
▸ what do you return?
▸ cached data
▸ nil
▸ or propagate the error upstream
ROBUSTNESS
Controlling harvest: circuit breakers & compartmentalization
http://idighardware.com/2013/10/fire-doors-everything-you-always-wanted-to-know-but-were-afraid-to-ask/
ROBUSTNESS
Putting it all together: giving things up
▸ Combine harvest/yield degradation in different ways to
protect the critical path
▸ Monitor any degradation!
▸ Dark launch your rate limiters to check what they’d block.
ROBUSTNESS
Robustness, in review
▸ know how the system sheds
load
▸ know how it reacts to
downstream failures
Converge to a stable state.
2. OBSERVABILITY
THE TALE OF THE
FRACTAL QUEUE
STORY 2
OBSERVABILITY
Instrument EVERYTHING
▸ especially with queues
▸ percentiles, not averages
▸ don’t intermingle logs (keep a searchable trace ID on
requests)
OBSERVABILITY
Over-collect data, but build dashboards carefully
▸ work metrics
▸ is the system doing the thing it’s supposed to?
▸ resource metrics
▸ how are the components of the system behaving?
▸ build your dashboard with work metrics first.
THE TALE OF THE 64
ALERT WEEK
STORY 4
OBSERVABILITY
Don’t normalize deviance
OBSERVABILITY
Knowing what to alert on
▸ Monitor the alert volume of your system!
▸ Pages should be actionable and represent user pain.
OBSERVABILITY
Observability: what we learned
▸ Kiran has a special vendetta against unmonitored queues.
▸ Building good dashboards: work metrics & resource
metrics.
▸ Monitor alert volume, too!
3. USABILITY
6. Recognition vs. recall
9. Help users recognize,
diagnose, and recover from
errors
USABILITY
A quick side note: Nielsen Heuristics
1. Visibility of system status
2. Match between system and the
real world
3. User control and freedom
4. Consistency and standards
5. Error prevention
6. Recognition vs. recall
7. Flexibility and efficiency of use
8. Aesthetic and minimalist design
9. Help users recognize, diagnose,
and recover from errors
10. Help and documentation
1. Visibility of system status
3. User control and freedom
5. Error prevention
Story 5: the tale of the special snowflake service
USABILITY
Heuristic 4. Consistency and Standards
▸ pattern-matching across
similar systems is really
valuable!
▸ Choose boring
technology: spend your
innovation tokens wisely!
OBSERVABILITY
Heuristic 3. User control and freedom
▸ Tooling is a part of the service!
▸ relatedly, deploy mechanisms are related to availability!
▸ Give operators the ability to change operational
parameters.
THE TALE OF THE OPS
SPELL BOOK
STORY 6
USABILITY
Heuristic 6. Recognition v. recall
▸ Keep checklists minimal and heavily automated.
▸ long flowcharts in a runbook are :(
▸ relatedly: scripting user communications is helpful.
USABILITY
Heuristic 1. Visibility of system status
▸ which of these are changes to production?
▸ config changes
▸ deploys
▸ utility script runs
▸ failovers
▸ adding/decreasing capacity
THE TALE OF THE
AMBIGUOUS ERROR
MESSAGE
STORY 7
USABILITY
Heuristic 9. Help users recognize, diagnose, and recover from errors
▸ error messages are a crucial part of your interface
▸ Writing a good alert message:
▸ expressed in plain language, precisely indicate the
problem, and constructively suggest a solution (runbooks!)
▸ (ex.) CRITICAL: Served 5% 5xx results in the last 5 minutes!
<link to runbook>
USABILITY
Usability, in review
▸ Operational experience matters! Consider:
▸ whether the system follows general conventions.
▸ how it alerts operators to errors clearly and
unambiguously.
▸ how minimal and usable the tooling is.
Review
▸ Robustness
▸ Does your system converge to a stable state?
▸ Observability
▸ Can you infer what the internal state of the system looks like?
▸ Usability
▸ Do your operators have control over the state of the system?
Do you adhere to general standards?
REVIEW
THE TALE OF THE SAD
QUEUE
STORY THE LAST
:(
A DARK AND STORMY
NIGHT
STORY THE LAST
Resources
▸ Harvest, Yield, and Scalable Tolerant Systems (Brewer & Fox)
▸ How Complex Systems Fail (Cook)
▸ "Going solid": a model of system dynamics and consequences for patient safety (Cook)
▸ Nielsen’s Usability Heuristics
▸ Choose Boring Technology (Dan McKinley)
▸ Site Reliability Engineering: How Google Runs Production Systems
▸ Stripe’s (upcoming) rate limiting blog post
▸ Collection of postmortems (Dan Luu)
REVIEW
REVIEW
On Designing and Deploying Internet-Scale Services, James Hamilton
▸ list of best practices, from design, to upgrades, to incident
response
T H A N K S !
Thanks to Ines Sombra, Charity Majors, Alyssa Frazee, Rachel Sanders, and
Andy Bonventre for review!
APPENDIX
STUFF I COULDN’T GET TO
OBSERVABILITY
decouple deploys from releases
▸ get a minimal version in dark-reads into production asap
▸ corollary: have good kill switches!
▸ Know what rollbacks look like
OBSERVABILITY
collect operational metrics in this shadow phase
▸ Gain historical knowledge of what the system’s healthy
state looks like.
▸ Tweak your alerts and SLAs.
▸ Gameday the system! Write runbooks!
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
operational-antipattern

Mais conteúdo relacionado

Mais de C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Mais de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

A Dark and Stormy Night: Operational Antipatterns

  • 1. A D A R K A N D S T O R M Y N I G H T TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ operational-antipattern
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 5. B U LW E R - LY T T O N It was a dark and stormy night; the rain fell in torrents — except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
  • 6. DEFINITIONS What is operability? ▸ The ability to keep a system in a safe and reliable functioning condition, according to pre-defined operational requirements.
  • 7. Characteristics of operability ▸ safety & reliability ▸ scalability ▸ grace under pressure DEFINITIONS ▸ ease of upgrades ▸ observability ▸ usability ▸ cultural practices around incidents ▸ AND MORE
  • 8. DEFINITIONS Characteristics of an operable system ▸ Converge towards a stable state. ▸ Give operators visibility and tools. ▸ Designed to be usable and unsurprising.
  • 11. THE TALE OF THE SYSTEM THAT COULDN’T GIVE ANYTHING UP STORY 1
  • 13. ROBUSTNESS Harvest, Yield and Scalable Tolerant Systems Yield = successful requests total requests != uptime Harvest = data available total data * dropping requests * degrading response
  • 14. ROBUSTNESS Controlling yield: load shedding upstream requests ▸ categories of load shedders: ▸ # of requests ▸ # of concurrent requests (protect against the long tail) ▸ overall fleet utilization (keep x% of workers for core traffic)
  • 15. ROBUSTNESS Controlling harvest: circuit breakers ▸ stop calling a dependency if it seems down! ▸ what do you return? ▸ cached data ▸ nil ▸ or propagate the error upstream
  • 16. ROBUSTNESS Controlling harvest: circuit breakers & compartmentalization http://idighardware.com/2013/10/fire-doors-everything-you-always-wanted-to-know-but-were-afraid-to-ask/
  • 17. ROBUSTNESS Putting it all together: giving things up ▸ Combine harvest/yield degradation in different ways to protect the critical path ▸ Monitor any degradation! ▸ Dark launch your rate limiters to check what they’d block.
  • 18. ROBUSTNESS Robustness, in review ▸ know how the system sheds load ▸ know how it reacts to downstream failures Converge to a stable state.
  • 20. THE TALE OF THE FRACTAL QUEUE STORY 2
  • 21. OBSERVABILITY Instrument EVERYTHING ▸ especially with queues ▸ percentiles, not averages ▸ don’t intermingle logs (keep a searchable trace ID on requests)
  • 22. OBSERVABILITY Over-collect data, but build dashboards carefully ▸ work metrics ▸ is the system doing the thing it’s supposed to? ▸ resource metrics ▸ how are the components of the system behaving? ▸ build your dashboard with work metrics first.
  • 23. THE TALE OF THE 64 ALERT WEEK STORY 4
  • 25. OBSERVABILITY Knowing what to alert on ▸ Monitor the alert volume of your system! ▸ Pages should be actionable and represent user pain.
  • 26. OBSERVABILITY Observability: what we learned ▸ Kiran has a special vendetta against unmonitored queues. ▸ Building good dashboards: work metrics & resource metrics. ▸ Monitor alert volume, too!
  • 28. 6. Recognition vs. recall 9. Help users recognize, diagnose, and recover from errors USABILITY A quick side note: Nielsen Heuristics 1. Visibility of system status 2. Match between system and the real world 3. User control and freedom 4. Consistency and standards 5. Error prevention 6. Recognition vs. recall 7. Flexibility and efficiency of use 8. Aesthetic and minimalist design 9. Help users recognize, diagnose, and recover from errors 10. Help and documentation 1. Visibility of system status 3. User control and freedom 5. Error prevention
  • 29. Story 5: the tale of the special snowflake service
  • 30. USABILITY Heuristic 4. Consistency and Standards ▸ pattern-matching across similar systems is really valuable! ▸ Choose boring technology: spend your innovation tokens wisely!
  • 31. OBSERVABILITY Heuristic 3. User control and freedom ▸ Tooling is a part of the service! ▸ relatedly, deploy mechanisms are related to availability! ▸ Give operators the ability to change operational parameters.
  • 32. THE TALE OF THE OPS SPELL BOOK STORY 6
  • 33. USABILITY Heuristic 6. Recognition v. recall ▸ Keep checklists minimal and heavily automated. ▸ long flowcharts in a runbook are :( ▸ relatedly: scripting user communications is helpful.
  • 34. USABILITY Heuristic 1. Visibility of system status ▸ which of these are changes to production? ▸ config changes ▸ deploys ▸ utility script runs ▸ failovers ▸ adding/decreasing capacity
  • 35. THE TALE OF THE AMBIGUOUS ERROR MESSAGE STORY 7
  • 36. USABILITY Heuristic 9. Help users recognize, diagnose, and recover from errors ▸ error messages are a crucial part of your interface ▸ Writing a good alert message: ▸ expressed in plain language, precisely indicate the problem, and constructively suggest a solution (runbooks!) ▸ (ex.) CRITICAL: Served 5% 5xx results in the last 5 minutes! <link to runbook>
  • 37. USABILITY Usability, in review ▸ Operational experience matters! Consider: ▸ whether the system follows general conventions. ▸ how it alerts operators to errors clearly and unambiguously. ▸ how minimal and usable the tooling is.
  • 38. Review ▸ Robustness ▸ Does your system converge to a stable state? ▸ Observability ▸ Can you infer what the internal state of the system looks like? ▸ Usability ▸ Do your operators have control over the state of the system? Do you adhere to general standards? REVIEW
  • 39. THE TALE OF THE SAD QUEUE STORY THE LAST :(
  • 40. A DARK AND STORMY NIGHT STORY THE LAST
  • 41. Resources ▸ Harvest, Yield, and Scalable Tolerant Systems (Brewer & Fox) ▸ How Complex Systems Fail (Cook) ▸ "Going solid": a model of system dynamics and consequences for patient safety (Cook) ▸ Nielsen’s Usability Heuristics ▸ Choose Boring Technology (Dan McKinley) ▸ Site Reliability Engineering: How Google Runs Production Systems ▸ Stripe’s (upcoming) rate limiting blog post ▸ Collection of postmortems (Dan Luu) REVIEW
  • 42. REVIEW On Designing and Deploying Internet-Scale Services, James Hamilton ▸ list of best practices, from design, to upgrades, to incident response
  • 43. T H A N K S ! Thanks to Ines Sombra, Charity Majors, Alyssa Frazee, Rachel Sanders, and Andy Bonventre for review!
  • 45. OBSERVABILITY decouple deploys from releases ▸ get a minimal version in dark-reads into production asap ▸ corollary: have good kill switches! ▸ Know what rollbacks look like
  • 46. OBSERVABILITY collect operational metrics in this shadow phase ▸ Gain historical knowledge of what the system’s healthy state looks like. ▸ Tweak your alerts and SLAs. ▸ Gameday the system! Write runbooks!
  • 47. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ operational-antipattern