Our goal is to eliminate alerts and a common mantra is “alert only if absolutely needed”. But that’s only for the monitoring-related people, whereas it seems as if the mantra for others is “whatever”. Excessive alerts are like Stephen King’s Langoliers – they eat our time. Decommissioned servers, flapping alerts, mysterious emails.
Let’s bring on “Operation Forest” and see what – and even more importantly, how – can be done about that. The monitoring tool You choose is just a part of the solution. What good is your tool if nobody pays attention to it?
We’ll talk about the usage of weapons like Terminology, Persistence and Involvement, as well as CommonSense and Policy as potential problem-solvers. Case studies (aka anecdotal examples) will be brought forward to illustrate ways alerts can become useless, be ignored and lead to serious outage.
Also find out: what’s “Operation Forest”.
24. oh, those are just alerts about my systems
=
oh, that’s just my trash in the forest
25. 16:30 – Failures multiply
Disk full →
logrotate creates an incomplete file →
never rotates the source file again
https://bugzilla.redhat.com/show_bug.cgi?id=1374550
26. 19:00 – What causes alerts?
Technical challenges
Lack of time
Humans
●Attitude?
●Culture?
●Organisation?
28. 00:13 - Let’s change the tools
Let’s use new alerting solution, new...
If emails can be ignored, so can be other alerts.
It’s culture, not the tool.
29. 01:00 – What’s this thing?
I must understand what I am running -
and monitoring
I must know my monitoring solution
●So I don’t spend two days writing a script to
replicate a feature
●“Haha, a week of training”
30. 01:20 – Do we know our tools?
Next version of grep rumoured to include graphs
Monitoring solution matters, but it is my decision
...I’ll try to pick one
31.
32. From: somebody@example.com
Subject: Re: High: oldappserver.example.com Server unreachable
Logrotate is configured but files are not removed.
I deleted them.
--
Somebody
Starting to see a pattern here
37. 02:40 – Rock rock till you drop
I will tirelessly work on making alerts:
●Less frequent
●More useful and meaningful
38. 02:50 – What’s a useful alert?
“System x down” is mostly clear, but others not
I’ll make no typos in alert instructions
write stupid, think stupid
39. 03:10 – Auto-resolve that crap
I estimate this will happen 3 more times
I add automatic action
Because I know it will happen 30 times
40. 03:30 – Monitoring is not a tasklist
These occurrences are useful to know about
So I’ll make it logged
If it really is useful, I can look at the log *
* usually nobody does
41. 03:50 – What’s that noise?
Do we really have
a problem with alerts?
42. 05:30 – We so trendy
Agile = “I won’t have to maintain it” *
Devops = “production is free-for-all” **
●Bring developers in on my monitoring
* Yes, it’s all misappropriated and misunderstood
** Still the same
43. Problems in /etc/motd * = “in your face”
* Credit for the idea to Ilya Ableev (Badoo)
44. 05:52 – Mind the language
False -
from Latin falsus (“counterfeit, false; falsehood”)
Never say “false positive”! (Thank you, Aaron)
Misconfiguration != false positive
45.
46. "oh, it's a false positive" = oh, all is good here
"oh, it's misconfiguration" = how do we fix it?
47. 06:14 – Minding the language
Is "this is bad" fine?
In many countries, but maybe not in others
“Hate useless alerts” - Europe vs USA?
48. 06:20 – Tact filter?
Normal people with outgoing tact filter
Geeks with incoming tact filter
http://www.mit.edu/~jcb/tact.html
54. When I get mad
And I get pissed
I grab my pen
And I write out a list
55. □Oncall is alerted during maintenance
□Old systems are not cleaned up from monitoring
□Alert comments/instructions not too useful
□Alert comments/instructions with typos
□“false positive” is thrown around frequently
□Many monitoring solutions in place
□Problems are delayed, not fixed
□Problems are fixed in parts of the environment,