SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Nothing Good Ever
Happens After 2am
Reversim 2019
Daniel Korn
Engineering Team Lead at BigPanda 
korndaniel1
BigPanda’s 

Outage Procedure
Roles and responsibilities
On-call Incident Manager

On-Call (IMOC)
Tech Lead

On-Call (TLOC)
Support 

On-Call (SOC)
Incident Priority Definitions
Priority Affect Outage Resolution
P1
• Core feature
• Multiple customers
24/7
P2
• Core feature
• Single customer
24/7
P3
• Secondary feature
• No workaround
Next business day
Tools
Tools
• Alerting
Tools
• Alerting
• Communication
Tools
• Alerting
• Communication
• Observability
Alert/Support
notifies On-call
IMOC asses impact,
determine P1/P2/P3
On-call performs
simple mitigation
On-call escalate

to IMOC
IMOC escalate to
TLOC and SOC
1
2
3
4
5
6
7
8
9
10
On-call If (P1) { 

StatusPage;

dedicated channel;

}
SOC update
customers
R&D mitigate till
solved, update
StatusPage
IMOC Verifies resolved,

summary in channel
IMOC postmortem,
share with stakeholders
The Long Night
THIS IS A TRUE STORY.
The events depicted in this postmortem
took place in Tel Aviv and San Francisco
in 2018.



Despite the request of the survivors, the
names have not been changed.
Out of respect for our customers, the
story has been told exactly as it occurred.
Michal
On-call
Almog & Pini
TLOCs
Daniel (Me)
TLOC
Shmeff Andru
SOC Support
Julio
Support
Background
• REMINDER: BigPanda’s SLA
• New Access Control (RBAC) service
• Not all customers migrated
• Sunday: Multi-service deployment
[MON 05:03 PM] SOC

multiple tickets:“cannot
update environments”
[05:05 PM] On-call

Asks SOC for details, opens a
dedicated Slack channel
[05:08 PM] On-call

Identifies as Auth-related,
notifies TLOCs
[05:35 PM] On-call

“we think it’s related to a
deploy, working on a fix”
[05:33 PM] SOC

considers opening a status
page, but “might be a P3”
[06:16 PM] SOC

Opens status page
Stick to the Plan
TA
K
EAW
AY
[07:41 PM] TLOCs

Deploy fix to production
[06:50-07:30 PM] TLOCs

Fix is tested, not reproduced
debate fix or revert
[07:45-08:05 PM] SOC

Verifies together with TLOCs
the issue is resolved
[08:10 PM] SOC

Closes status page

On-call and TLOCs leaving
REVERT FIRST
Rule of Thumb
TA
K
EAW
AY
[12:57 AM] SOC

“So it appears to be just a
UI issue”. Notifies On-call
[12:45 AM] Support

“Some customers can’t see
roles in the env editor”
[12:59 AM] On-call

Notifies TLOC
[01:01 AM] TLOC

Starts investigating the issue
– Someone smart
If it looks like an outage, and (support)
sounds like an outage, then it might
be just a bug“
Do not Assume
an Outage
TA
K
EAW
AY
[01:54 AM] TLOCs

Deploy fix to production, 

ask SOC to verify with customers
[01:20 AM] TLOCs

Identifying the cause, 

starting to work on a fix
If you think this has a
happy ending, you haven’t
been paying attention.
— Ramsay Bolton
“
[02:00 AM] SOC + Support 

Debating on StatusPage re-open
[01:57 AM] Support

customers reporting the initial issue -
“cannot update environments”
[02:03 AM] TLOCs

Start investigating the issue
[02:15-02:51 AM] TLOC

Manually adds missing
permissions to customers DB
[02:10 AM] TLOCs

Identifying the cause - lack of
permissions (migration)
Time to Call it
a Night
TA
K
EAW
AY
[02:56 AM] SOC

Verifies this customer is
facing the issue
[02:52 AM] TLOC

Having problems with a
specific customer
[02:56-03:25 AM] TLOCs

Identify the problem - edge case
involving FT and manual customizations
[03:25 PM] SOC

Asks TLOC to discuss the
situation on a phone call
[-04:07 AM] SOC+TLOC

SOC asks TLOC to
commit to fix by EOD
[03:29- AM] SOC + TLOC

Sensitive customer, no
changes ,issue remains
[09:30 AM - 05:12 PM] TLOCs

Implemented a fix, deploy to production,
ask SOC to verify
[05:25 PM] SOC

Verifies issue resolved
Do not Commit
to Action Items
TA
K
EAW
AY
[19:00 PM] CS + R&D + PM

Joint postmortem,

Preparing customer’s updates
[WED 11:00 AM] R&D

Conduct a postmortem,

Share with R&D and CS
Chaos isn’t a pit.
Chaos is a ladder.
— Petyr “Littlefinger” Baelish
“
Recap
• Stick to the plan
• Rule of thumb: REVERT FIRST
• Do not assume an outage
• Time to call it a night
• Do not commit to action items
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am

Mais conteúdo relacionado

Semelhante a Nothing Good Ever Happens After 2am

3 steps to hosted success
3 steps to hosted success3 steps to hosted success
3 steps to hosted successVXSuite
 
DR planning and testing
DR planning and testingDR planning and testing
DR planning and testingJason Dea
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and TestingJason Dea
 
Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]AgilePractitionersIL
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saTom Cudd
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
 
Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptxDavidForeroS
 
Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Hostway|HOSTING
 
RPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesCal Leeming
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewPT Datacomm Diangraha
 
Avoiding Technical Bankruptcy
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcymarkuskobler
 
2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issuesAmber Joan Wood
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solutionmuralis3
 
Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017W3 Group Canada Inc.
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response ManagementKeith Smith
 
Protecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensProtecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensHostway|HOSTING
 
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...SolarWinds
 
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageAtlassian
 
Stop the Line practice in SW development
Stop the Line practice in SW developmentStop the Line practice in SW development
Stop the Line practice in SW developmentGabor Gunyho
 

Semelhante a Nothing Good Ever Happens After 2am (20)

3 steps to hosted success
3 steps to hosted success3 steps to hosted success
3 steps to hosted success
 
Choked by technical debt?
Choked by technical debt?Choked by technical debt?
Choked by technical debt?
 
DR planning and testing
DR planning and testingDR planning and testing
DR planning and testing
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
 
Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an sa
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptx
 
Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!
 
RPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slides
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service Overview
 
Avoiding Technical Bankruptcy
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcy
 
2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
 
Protecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it HappensProtecting Against Disaster: Plan for the Inevitable Before it Happens
Protecting Against Disaster: Plan for the Inevitable Before it Happens
 
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
 
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
 
Stop the Line practice in SW development
Stop the Line practice in SW developmentStop the Line practice in SW development
Stop the Line practice in SW development
 

Último

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Último (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

Nothing Good Ever Happens After 2am

  • 1. Nothing Good Ever Happens After 2am Reversim 2019
  • 2. Daniel Korn Engineering Team Lead at BigPanda  korndaniel1
  • 3.
  • 4.
  • 6. Roles and responsibilities On-call Incident Manager
 On-Call (IMOC) Tech Lead
 On-Call (TLOC) Support 
 On-Call (SOC)
  • 7. Incident Priority Definitions Priority Affect Outage Resolution P1 • Core feature • Multiple customers 24/7 P2 • Core feature • Single customer 24/7 P3 • Secondary feature • No workaround Next business day
  • 12. Alert/Support notifies On-call IMOC asses impact, determine P1/P2/P3 On-call performs simple mitigation On-call escalate
 to IMOC IMOC escalate to TLOC and SOC 1 2 3 4 5
  • 13. 6 7 8 9 10 On-call If (P1) { 
 StatusPage;
 dedicated channel;
 } SOC update customers R&D mitigate till solved, update StatusPage IMOC Verifies resolved,
 summary in channel IMOC postmortem, share with stakeholders
  • 15. THIS IS A TRUE STORY. The events depicted in this postmortem took place in Tel Aviv and San Francisco in 2018.
 
 Despite the request of the survivors, the names have not been changed. Out of respect for our customers, the story has been told exactly as it occurred.
  • 18. Background • REMINDER: BigPanda’s SLA • New Access Control (RBAC) service • Not all customers migrated • Sunday: Multi-service deployment
  • 19. [MON 05:03 PM] SOC
 multiple tickets:“cannot update environments” [05:05 PM] On-call
 Asks SOC for details, opens a dedicated Slack channel [05:08 PM] On-call
 Identifies as Auth-related, notifies TLOCs
  • 20.
  • 21.
  • 22. [05:35 PM] On-call
 “we think it’s related to a deploy, working on a fix” [05:33 PM] SOC
 considers opening a status page, but “might be a P3” [06:16 PM] SOC
 Opens status page
  • 23. Stick to the Plan TA K EAW AY
  • 24. [07:41 PM] TLOCs
 Deploy fix to production [06:50-07:30 PM] TLOCs
 Fix is tested, not reproduced debate fix or revert [07:45-08:05 PM] SOC
 Verifies together with TLOCs the issue is resolved [08:10 PM] SOC
 Closes status page
 On-call and TLOCs leaving
  • 25. REVERT FIRST Rule of Thumb TA K EAW AY
  • 26. [12:57 AM] SOC
 “So it appears to be just a UI issue”. Notifies On-call [12:45 AM] Support
 “Some customers can’t see roles in the env editor” [12:59 AM] On-call
 Notifies TLOC [01:01 AM] TLOC
 Starts investigating the issue
  • 27.
  • 28. – Someone smart If it looks like an outage, and (support) sounds like an outage, then it might be just a bug“
  • 29. Do not Assume an Outage TA K EAW AY
  • 30. [01:54 AM] TLOCs
 Deploy fix to production, 
 ask SOC to verify with customers [01:20 AM] TLOCs
 Identifying the cause, 
 starting to work on a fix
  • 31. If you think this has a happy ending, you haven’t been paying attention. — Ramsay Bolton “
  • 32. [02:00 AM] SOC + Support 
 Debating on StatusPage re-open [01:57 AM] Support
 customers reporting the initial issue - “cannot update environments” [02:03 AM] TLOCs
 Start investigating the issue
  • 33. [02:15-02:51 AM] TLOC
 Manually adds missing permissions to customers DB [02:10 AM] TLOCs
 Identifying the cause - lack of permissions (migration)
  • 34.
  • 35. Time to Call it a Night TA K EAW AY
  • 36. [02:56 AM] SOC
 Verifies this customer is facing the issue [02:52 AM] TLOC
 Having problems with a specific customer [02:56-03:25 AM] TLOCs
 Identify the problem - edge case involving FT and manual customizations [03:25 PM] SOC
 Asks TLOC to discuss the situation on a phone call
  • 37. [-04:07 AM] SOC+TLOC
 SOC asks TLOC to commit to fix by EOD [03:29- AM] SOC + TLOC
 Sensitive customer, no changes ,issue remains [09:30 AM - 05:12 PM] TLOCs
 Implemented a fix, deploy to production, ask SOC to verify [05:25 PM] SOC
 Verifies issue resolved
  • 38. Do not Commit to Action Items TA K EAW AY
  • 39. [19:00 PM] CS + R&D + PM
 Joint postmortem,
 Preparing customer’s updates [WED 11:00 AM] R&D
 Conduct a postmortem,
 Share with R&D and CS
  • 40. Chaos isn’t a pit. Chaos is a ladder. — Petyr “Littlefinger” Baelish “
  • 41. Recap
  • 42. • Stick to the plan • Rule of thumb: REVERT FIRST • Do not assume an outage • Time to call it a night • Do not commit to action items