Ever wonder why your Engineers don’t necessarily like being on call? There can be many different reasons for this, and one cause could be a poorly configured monitoring system. In this talk I would like to share with you the different stages we went through as a team to get from an inadequate monitoring to a solution that provides real value not only for the customer but also for us as a team.
OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann
1. How we improved our monitoring so that
everyone likes to be on-call
Page 1 / 31
2. What you can expect
Why on-call can be disrespecting
A real world on-call transformation example
That on-call can mean alot of different things
Observability
Engineers who like to be on-call
Page 2 / 31
3. About me
Daniel Uhlmann
T-Systems Multimedia Solutions
GmbH
passion for Linux and
OpenSource
Twitter: @xfuturecs
Blog: xfuture-blog.com
Page 3 / 31
4. What we are working on
we maintain several customer services and applications
our monitoring is very distributed with various services and environments
meaning that we need to context switch and to adapt fast a lot
Page 4 / 31
5. "Why should I take the on-call duty. I thought someone
else will do this for us."
"If you haven't debugged the live database system at
3:00 in the morning, you're not a real developer."
"I didn't sign up for this."
"I sacrificed so much sleep and lost my mental health
being on-call. But this is okay because it is for my/our
product"
Page 5 / 31
6. This is not acceptable - so what can we
learn from this?
there are a lot of toxic patterns about being on-call
being on-call can be disrespecting
no sleep
impacting personal lives
flappy alerts will drive you crazy
maybe no training
if you don't take care every check will alert you
Page 6 / 31
7. Where we came from
...well we had nearly the same problems:
a lot of false positives checks
lack of detailed monitoring
wakeful nights and scared junior engineers with a resting pulse rate of 180
beats per minute been there, done that
Page 7 / 31
10. Keep in mind that
The ultimate goal is not to never get notified again!
Page 10 / 31
11. Every check alarmed us
we've set a appointment as a team to figure out which checks are truly
business critical
implemented 2 "hotlines" to separate 24/7 and business hour calls
resulted in lesser calls during night time
Page 11 / 31
12. Our learnings
delete every check without any meaningful information for you
not all checks are really business critical set the bar high for waking
people up at 2 AM
Page 12 / 31
13. Lack of detailed monitoring
check more than just the end to end connection of your application
figure out the business critical components for your customers is a good
first step
Page 13 / 31
14. Our Learnings
think from a customers perspective first
even better: talk with your customers what is crucial for their business
Page 14 / 31
15. Missing experience on a real outage
most uncertainties arise from a lack of preparation
utilize the expertise of already experienced colleagues
new colleagues get a backup colleague with experience for the first on-call
duties
simulate a real outage a la chaos engineering
Page 15 / 31
16. Our Learnings
remember to breath
check if the alert have some linked documentation
the biggest obstacle is fear
Page 16 / 31
17. Chaos Engineering
experiment on a distributed system to build confidence
discover new issues that could impact your services by injecting failures
and errors
Page 17 / 31
18. What is the difference between chaos
engineering and failure testing?
Page 18 / 31
19. Test in production
don't over invest in staging systems and under invest in your production
system
most bugs will only ever be found with enough user interactions
Page 19 / 31
20. Fix bugs at 2pm and not 2am!
failure testing and chaos engineering can help you to fix some of them
if you can't track down what's happening in a few minutes you need
better observability
Page 20 / 31
21. Measure your paging alerts
collect statistics for incoming calls especially out-of-hours
track, graph and talk about your paging alerts
Page 21 / 31
22. Qualitative Tracking
success is not about "not having incidents"
it's about how confident people feel while being on-call
Page 22 / 31
23. Ask your engineers
qualitative feedback plays an important role for success
ensures that you are on the right track
Page 23 / 31
26. Predictive alarming
for example: checks that alarm you if the disk slowly becomes too full
only alert if users have real pain reduced our alert frequency even more
Page 26 / 31
27. Assign a role to your monitoring...
to keep your monitoring clean
to create tickets for occuring events
to fix quickwins
to update your colleagues about the current state
Page 27 / 31
28. What happens on on-call rotation
define a process for the transfer
clean up your monitoring
Page 28 / 31
29. Align engineering pain with user pain
migrate to SLO based monitoring
adopt alerting best practices
gain profit through tracking down your pain and pay it down
Page 29 / 31