SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
How we improved our monitoring so that
everyone likes to be on-call
Page 1 / 31
What you can expect
Why on-call can be disrespecting
A real world on-call transformation example
That on-call can mean alot of different things
Observability
Engineers who like to be on-call
Page 2 / 31
About me
Daniel Uhlmann
T-Systems Multimedia Solutions
GmbH
passion for Linux and
OpenSource
Twitter: @xfuturecs
Blog: xfuture-blog.com
Page 3 / 31
What we are working on
we maintain several customer services and applications
our monitoring is very distributed with various services and environments
meaning that we need to context switch and to adapt fast a lot
Page 4 / 31
"Why should I take the on-call duty. I thought someone
else will do this for us."
"If you haven't debugged the live database system at
3:00 in the morning, you're not a real developer."
"I didn't sign up for this."
"I sacrificed so much sleep and lost my mental health
being on-call. But this is okay because it is for my/our
product"
Page 5 / 31
This is not acceptable - so what can we
learn from this?
there are a lot of toxic patterns about being on-call
being on-call can be disrespecting
no sleep
impacting personal lives
flappy alerts will drive you crazy
maybe no training
if you don't take care every check will alert you
Page 6 / 31
Where we came from
...well we had nearly the same problems:
a lot of false positives checks
lack of detailed monitoring
wakeful nights and scared junior engineers with a resting pulse rate of 180
beats per minute been there, done that
Page 7 / 31
But we managed to change it
Page 8 / 31
Page 9 / 31
Keep in mind that
The ultimate goal is not to never get notified again!
Page 10 / 31
Every check alarmed us
we've set a appointment as a team to figure out which checks are truly
business critical
implemented 2 "hotlines" to separate 24/7 and business hour calls
resulted in lesser calls during night time
Page 11 / 31
Our learnings
delete every check without any meaningful information for you
not all checks are really business critical set the bar high for waking
people up at 2 AM
Page 12 / 31
Lack of detailed monitoring
check more than just the end to end connection of your application
figure out the business critical components for your customers is a good
first step
Page 13 / 31
Our Learnings
think from a customers perspective first
even better: talk with your customers what is crucial for their business
Page 14 / 31
Missing experience on a real outage
most uncertainties arise from a lack of preparation
utilize the expertise of already experienced colleagues
new colleagues get a backup colleague with experience for the first on-call
duties
simulate a real outage a la chaos engineering
Page 15 / 31
Our Learnings
remember to breath
check if the alert have some linked documentation
the biggest obstacle is fear
Page 16 / 31
Chaos Engineering
experiment on a distributed system to build confidence
discover new issues that could impact your services by injecting failures
and errors
Page 17 / 31
What is the difference between chaos
engineering and failure testing?
Page 18 / 31
Test in production
don't over invest in staging systems and under invest in your production
system
most bugs will only ever be found with enough user interactions
Page 19 / 31
Fix bugs at 2pm and not 2am!
failure testing and chaos engineering can help you to fix some of them
if you can't track down what's happening in a few minutes you need
better observability
Page 20 / 31
Measure your paging alerts
collect statistics for incoming calls especially out-of-hours
track, graph and talk about your paging alerts
Page 21 / 31
Qualitative Tracking
success is not about "not having incidents"
it's about how confident people feel while being on-call
Page 22 / 31
Ask your engineers
qualitative feedback plays an important role for success
ensures that you are on the right track
Page 23 / 31
Page 24 / 31
Page 25 / 31
Predictive alarming
for example: checks that alarm you if the disk slowly becomes too full
only alert if users have real pain reduced our alert frequency even more
Page 26 / 31
Assign a role to your monitoring...
to keep your monitoring clean
to create tickets for occuring events
to fix quickwins
to update your colleagues about the current state
Page 27 / 31
What happens on on-call rotation
define a process for the transfer
clean up your monitoring
Page 28 / 31
Align engineering pain with user pain
migrate to SLO based monitoring
adopt alerting best practices
gain profit through tracking down your pain and pay it down
Page 29 / 31
Remember our initial situation?
Page 30 / 31
Thank you for listening!
Page 31 / 31

Mais conteúdo relacionado

Semelhante a OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann

Software That Matters - Agile Product Management with Impact Mapping
Software That Matters - Agile Product Management with Impact MappingSoftware That Matters - Agile Product Management with Impact Mapping
Software That Matters - Agile Product Management with Impact MappingNils Wloka
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolutionLaxman Marathe
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolutionLaxman Marathe
 
Trigger4th industrialrevolution
Trigger4th industrialrevolutionTrigger4th industrialrevolution
Trigger4th industrialrevolutionLaxman Marathe
 
Contingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATMContingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATMWajahat Ali Khan
 
Ins and Outs of Cloud Contact Centers
Ins and Outs of Cloud Contact CentersIns and Outs of Cloud Contact Centers
Ins and Outs of Cloud Contact CentersDave Howard
 
Convercent Case Management Guide
Convercent Case Management GuideConvercent Case Management Guide
Convercent Case Management GuideBrooke Webster
 
Industrial Report - Ndlovu Kevin Mehluli
Industrial Report - Ndlovu Kevin MehluliIndustrial Report - Ndlovu Kevin Mehluli
Industrial Report - Ndlovu Kevin MehluliKevin Ndlovu
 
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEMAbuSyeedRaihan
 
Why you need Application Performance Monitoring
Why you need Application Performance MonitoringWhy you need Application Performance Monitoring
Why you need Application Performance MonitoringSunil Vanmullem
 
Digital Workspaces and the Customer Experience
Digital Workspaces and the Customer ExperienceDigital Workspaces and the Customer Experience
Digital Workspaces and the Customer ExperienceeG Innovations
 
Cyber Rangers S1 E2
Cyber Rangers S1 E2Cyber Rangers S1 E2
Cyber Rangers S1 E2JudyEvans8
 
Observability at Scale
Observability at Scale Observability at Scale
Observability at Scale Knoldus Inc.
 
Project on lead generation and sales
Project on lead generation and salesProject on lead generation and sales
Project on lead generation and salesSushant Kohli
 
How To Build Mature SM - final
How To Build Mature SM - finalHow To Build Mature SM - final
How To Build Mature SM - finalDanijel Božić
 
Phone systems brisbane the 9 most important things
Phone systems brisbane the 9 most important thingsPhone systems brisbane the 9 most important things
Phone systems brisbane the 9 most important thingsGreg Eicke
 

Semelhante a OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann (20)

20180528 reflex presentation
20180528 reflex presentation20180528 reflex presentation
20180528 reflex presentation
 
Software That Matters - Agile Product Management with Impact Mapping
Software That Matters - Agile Product Management with Impact MappingSoftware That Matters - Agile Product Management with Impact Mapping
Software That Matters - Agile Product Management with Impact Mapping
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolution
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolution
 
Trigger4th industrialrevolution
Trigger4th industrialrevolutionTrigger4th industrialrevolution
Trigger4th industrialrevolution
 
Contingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATMContingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATM
 
Ins and Outs of Cloud Contact Centers
Ins and Outs of Cloud Contact CentersIns and Outs of Cloud Contact Centers
Ins and Outs of Cloud Contact Centers
 
Dit yvol5iss32
Dit yvol5iss32Dit yvol5iss32
Dit yvol5iss32
 
Convercent Case Management Guide
Convercent Case Management GuideConvercent Case Management Guide
Convercent Case Management Guide
 
Industrial Report - Ndlovu Kevin Mehluli
Industrial Report - Ndlovu Kevin MehluliIndustrial Report - Ndlovu Kevin Mehluli
Industrial Report - Ndlovu Kevin Mehluli
 
Service Excellence Frankfurt
Service Excellence FrankfurtService Excellence Frankfurt
Service Excellence Frankfurt
 
Monitor magazine
Monitor magazineMonitor magazine
Monitor magazine
 
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM
8 BIGGEST MISTAKES IT PRACTITIONERS MAKE AND HOW TO AVOID THEM
 
Why you need Application Performance Monitoring
Why you need Application Performance MonitoringWhy you need Application Performance Monitoring
Why you need Application Performance Monitoring
 
Digital Workspaces and the Customer Experience
Digital Workspaces and the Customer ExperienceDigital Workspaces and the Customer Experience
Digital Workspaces and the Customer Experience
 
Cyber Rangers S1 E2
Cyber Rangers S1 E2Cyber Rangers S1 E2
Cyber Rangers S1 E2
 
Observability at Scale
Observability at Scale Observability at Scale
Observability at Scale
 
Project on lead generation and sales
Project on lead generation and salesProject on lead generation and sales
Project on lead generation and sales
 
How To Build Mature SM - final
How To Build Mature SM - finalHow To Build Mature SM - final
How To Build Mature SM - final
 
Phone systems brisbane the 9 most important things
Phone systems brisbane the 9 most important thingsPhone systems brisbane the 9 most important things
Phone systems brisbane the 9 most important things
 

Último

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 

Último (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 

OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann

  • 1. How we improved our monitoring so that everyone likes to be on-call Page 1 / 31
  • 2. What you can expect Why on-call can be disrespecting A real world on-call transformation example That on-call can mean alot of different things Observability Engineers who like to be on-call Page 2 / 31
  • 3. About me Daniel Uhlmann T-Systems Multimedia Solutions GmbH passion for Linux and OpenSource Twitter: @xfuturecs Blog: xfuture-blog.com Page 3 / 31
  • 4. What we are working on we maintain several customer services and applications our monitoring is very distributed with various services and environments meaning that we need to context switch and to adapt fast a lot Page 4 / 31
  • 5. "Why should I take the on-call duty. I thought someone else will do this for us." "If you haven't debugged the live database system at 3:00 in the morning, you're not a real developer." "I didn't sign up for this." "I sacrificed so much sleep and lost my mental health being on-call. But this is okay because it is for my/our product" Page 5 / 31
  • 6. This is not acceptable - so what can we learn from this? there are a lot of toxic patterns about being on-call being on-call can be disrespecting no sleep impacting personal lives flappy alerts will drive you crazy maybe no training if you don't take care every check will alert you Page 6 / 31
  • 7. Where we came from ...well we had nearly the same problems: a lot of false positives checks lack of detailed monitoring wakeful nights and scared junior engineers with a resting pulse rate of 180 beats per minute been there, done that Page 7 / 31
  • 8. But we managed to change it Page 8 / 31
  • 9. Page 9 / 31
  • 10. Keep in mind that The ultimate goal is not to never get notified again! Page 10 / 31
  • 11. Every check alarmed us we've set a appointment as a team to figure out which checks are truly business critical implemented 2 "hotlines" to separate 24/7 and business hour calls resulted in lesser calls during night time Page 11 / 31
  • 12. Our learnings delete every check without any meaningful information for you not all checks are really business critical set the bar high for waking people up at 2 AM Page 12 / 31
  • 13. Lack of detailed monitoring check more than just the end to end connection of your application figure out the business critical components for your customers is a good first step Page 13 / 31
  • 14. Our Learnings think from a customers perspective first even better: talk with your customers what is crucial for their business Page 14 / 31
  • 15. Missing experience on a real outage most uncertainties arise from a lack of preparation utilize the expertise of already experienced colleagues new colleagues get a backup colleague with experience for the first on-call duties simulate a real outage a la chaos engineering Page 15 / 31
  • 16. Our Learnings remember to breath check if the alert have some linked documentation the biggest obstacle is fear Page 16 / 31
  • 17. Chaos Engineering experiment on a distributed system to build confidence discover new issues that could impact your services by injecting failures and errors Page 17 / 31
  • 18. What is the difference between chaos engineering and failure testing? Page 18 / 31
  • 19. Test in production don't over invest in staging systems and under invest in your production system most bugs will only ever be found with enough user interactions Page 19 / 31
  • 20. Fix bugs at 2pm and not 2am! failure testing and chaos engineering can help you to fix some of them if you can't track down what's happening in a few minutes you need better observability Page 20 / 31
  • 21. Measure your paging alerts collect statistics for incoming calls especially out-of-hours track, graph and talk about your paging alerts Page 21 / 31
  • 22. Qualitative Tracking success is not about "not having incidents" it's about how confident people feel while being on-call Page 22 / 31
  • 23. Ask your engineers qualitative feedback plays an important role for success ensures that you are on the right track Page 23 / 31
  • 24. Page 24 / 31
  • 25. Page 25 / 31
  • 26. Predictive alarming for example: checks that alarm you if the disk slowly becomes too full only alert if users have real pain reduced our alert frequency even more Page 26 / 31
  • 27. Assign a role to your monitoring... to keep your monitoring clean to create tickets for occuring events to fix quickwins to update your colleagues about the current state Page 27 / 31
  • 28. What happens on on-call rotation define a process for the transfer clean up your monitoring Page 28 / 31
  • 29. Align engineering pain with user pain migrate to SLO based monitoring adopt alerting best practices gain profit through tracking down your pain and pay it down Page 29 / 31
  • 30. Remember our initial situation? Page 30 / 31
  • 31. Thank you for listening! Page 31 / 31