SlideShare uma empresa Scribd logo
1 de 42
Anatomy of
Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
@randyshoup
App Engine Outage - Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012
App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine Reliability Fixit
@randyshoup
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine Reliability Fixit
@randyshoup
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine Reliability Fixit
@randyshoup
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine Reliability Fixit
@randyshoup
@randyshoup
Stitch Fix – Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database]
• (10/25/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/24/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack]
• (10/18/2016) All systems unavailable for ~3 minutes [Shared Database]
• (10/17/2016) All systems unavailable for ~20 minutes [Shared Database]
• (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage]
• (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage]
• (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage]
• (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
@randyshoup
Database Stability Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
• 3. One application could take down entire company
@randyshoup
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
•  Results
@randyshoup
Stability Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections via connection concentrator
• 3. Stability and Scalability Program
o Ongoing 25% investment in services migration
@randyshoup
@randyshoup
Login Issues - 2019
• Problem: Some members unable to log in
• Inconsistent representations across different services in the
system
• Over time, simple system interactions grew increasingly
complex and convoluted
• Not enough graceful degradation or automated repair
@randyshoup
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
@randyshoup
Login Solutions
• 1. Clean up user data
o Find inconsistencies
o Track inconsistency metrics
o Identify and fix contributing processes and applications
• 2. User state machines
o Define user journeys as explicit state machines
o Refine and correct via cross-functional feedback
o Implement state machines in code
• 3. “Pandora” Program
o Rewrite core identity system into set of user capabilities
@randyshoup
Common Elements
• Unintentional, long-term accumulation of
small, individually reasonable decisions
• “Compelling event” catalyzes long-term
change
• Blameless culture makes learning and
improvement possible
• Structured post-incident approach
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Vicious Cycle of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup
Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup
Virtuous Cycle of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success
“Finally we can prioritize
fixing that broken system!”
@randyshoup
Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Frame the Problem:
Quality and reliability are
business concerns
@randyshoup
Use Common Currency
Time
Money People
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
@randyshoup
“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs
Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Training
• Incident Communication Plan
• Periodic Incident Updates
• Shared Incident State Doc
• Incident Call Recording
• Incident Swarming
• Local / Global Incident Reviews
• Post-Review Improvement Items
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup

Mais conteúdo relacionado

Mais procurados

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionGene Kim
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayRandy Shoup
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsRandy Shoup
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at ScaleRandy Shoup
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Randy Shoup
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesRandy Shoup
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupRandy Shoup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsRandy Shoup
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsRandy Shoup
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the roomAgileDenver
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsRandy Shoup
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudRandy Shoup
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...DevOpsDays Tel Aviv
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Chris Riley ☁
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersKeith Klundt
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeTechWell
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year LaterDave Shields
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
 

Mais procurados (20)

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling Organizations
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and Microservices
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and Events
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of Operations
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the room
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product Owners
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a Time
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year Later
 
Lazar Milovic - No estimates
Lazar Milovic - No estimatesLazar Milovic - No estimates
Lazar Milovic - No estimates
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 

Semelhante a Anatomy of Three Incidents -- Commonalities and Lessons

Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Balaji Muniraja
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldRandy Shoup
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseXebiaLabs
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...AXIA Consulting Inc.
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolutionGrupa Unity
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...AppDynamics
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuDevConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineImaginet
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...AppDynamics
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the GovernmentRichard Cheng
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalJohn Willis
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoZahid02
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a ProjectMark Greenfield
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityRandy Shoup
 

Semelhante a Anatomy of Three Incidents -- Commonalities and Lessons (20)

Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the Enterprise
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Utils_Presentation_Richard U
Utils_Presentation_Richard UUtils_Presentation_Richard U
Utils_Presentation_Richard U
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom Line
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the Government
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
 
Gaurav_CV
Gaurav_CVGaurav_CV
Gaurav_CV
 
Agile Bureaucracy
Agile BureaucracyAgile Bureaucracy
Agile Bureaucracy
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a Project
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 

Mais de Randy Shoup

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsRandy Shoup
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionRandy Shoup
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in MicroservicesRandy Shoup
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic MicroservicesRandy Shoup
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015Randy Shoup
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-ServicesRandy Shoup
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYERandy Shoup
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...Randy Shoup
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...Randy Shoup
 

Mais de Randy Shoup (9)

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building Teams
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps Transition
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 

Último

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Último (20)

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Anatomy of Three Incidents -- Commonalities and Lessons

  • 1. Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 4. App Engine Outage - Oct 2012
  • 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes @randyshoup
  • 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit @randyshoup
  • 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit @randyshoup
  • 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit @randyshoup
  • 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit @randyshoup
  • 12. Stitch Fix – Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database] @randyshoup
  • 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company @randyshoup
  • 14. Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  • 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration @randyshoup
  • 17. Login Issues - 2019 • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  • 18. Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  • 19. Login Solutions • 1. Clean up user data o Find inconsistencies o Track inconsistency metrics o Identify and fix contributing processes and applications • 2. User state machines o Define user journeys as explicit state machines o Refine and correct via cross-functional feedback o Implement state machines in code • 3. “Pandora” Program o Rewrite core identity system into set of user capabilities @randyshoup
  • 20. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  • 23. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  • 24. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  • 25. The more constrained you are on time or resources, the more important it is to get it done the first time. @randyshoup
  • 27. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment
  • 29. During the Incident • Focus on restoring service o Everything else is secondary, and should wait • Shield the team • Clear, structured communication o Even when there is nothing to report! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 30. After the Incident • Blameless postmortem • Identify and understand the contributing factors • Action items and Learnings • Follow Up! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 31. Psychological Safety • Team is safe for interpersonal risk-taking • “Being able to show and employ one’s self without fear of negative consequences” • More important than any other factor in team success
  • 32. “Finally we can prioritize fixing that broken system!” @randyshoup
  • 33. Inclusive Decisionmaking • Make better business decisions 87% of the time • Make decisions 2x faster with 1/2 the meetings • Deliver 60% better business results Cloverpop Inclusive Decisionmaking study, 2016 As we improve diversity, decisionmaking improves @randyshoup
  • 35. Frame the Problem: Quality and reliability are business concerns @randyshoup
  • 36. Use Common Currency Time Money People @randyshoup
  • 37. 15 Million “Never let a good crisis go to waste.” @randyshoup
  • 38. “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost.” @randyshoup -- John Allspaw, Adaptive Capacity Labs
  • 39. Improvement Budget • Explicit resource investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) • Retain autonomy, Provide transparency o Making these decisions is exactly why they hired you @randyshoup
  • 41. Incident Response Patterns • Incident Roles • Incident Triggers • On-Call Rotation and Onboarding • Incident Command Training • Incident Communication Plan • Periodic Incident Updates • Shared Incident State Doc • Incident Call Recording • Incident Swarming • Local / Global Incident Reviews • Post-Review Improvement Items @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response