OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups

A day in oncall life
or
Operation Forest

No views are endorsed by employers,
past or present
Any resemblance to real-life entities is
coincidental

The storyteller
Involved with monitoring for 20 years
Worked at Zabbix for 6 years
OpenStreetMap mapper, tea drinker

Chapter 1: The Morning
on how I understood
that alerts are bad

10:00 - This I know on a Friday
Alerts are good, because they inform us
Alerts are always useful

From: somebody@example.com
Subject: Re: High: appserver.example.com Important service down
thought it was short restart and it won't alert
--
Somebody

Subject: Re: High: oldappserver.example.com Server unreachable
This server has not been used for a year.
--
Somebody

Subject: Re: Average: zomgserver.example.com ZOMG workflow issue
ASDF and QWER workflow counts differ for last
ZOMG cycle. ASDF: 6762 QWER: 6760
Not much difference, hence ignoring
--
Somebody

Uncommon habits can bring
benefits in certain situations
We have to be careful with
declaring habits as bad

13:00 – Habits = risks, time wasters
chmod 777
sudo su
"I'll keep an eye on this alert"

14:13 – Alert fatigue
Imagine frequent fire alarms
Imagine frequent monitoring alerts

14:15 – All alerts are bad
Two possibilities:
a) something bad has happened or will happen
b) it’s a useless alert

Chapter 2: Afternoon
on how I understood
why bad alerts happen

15:03 – A minute of reading
Afternoon /ɑːftəˈnuːn/ – noun
The time from noon or lunchtime to evening

currently i don’t see a disk space issue

Currently I don’t
see a road issue
...

Trash attracts trash
Alerts attract alerts

oh, those are just alerts about my systems
=
oh, that’s just my trash in the forest

16:30 – Failures multiply
Disk full →
logrotate creates an incomplete file →
never rotates the source file again
https://bugzilla.redhat.com/show_bug.cgi?id=1374550

19:00 – What causes alerts?
Technical challenges
Lack of time
Humans
●Attitude?
●Culture?
●Organisation?

Chapter 3: Evening
on how I tried to
battle the alerts (and lost)

00:13 - Let’s change the tools
Let’s use new alerting solution, new...
If emails can be ignored, so can be other alerts.
It’s culture, not the tool.

01:00 – What’s this thing?
I must understand what I am running -
and monitoring
I must know my monitoring solution
●So I don’t spend two days writing a script to
replicate a feature
●“Haha, a week of training”

01:20 – Do we know our tools?
Next version of grep rumoured to include graphs
Monitoring solution matters, but it is my decision
...I’ll try to pick one

Subject: Re: High: oldappserver.example.com Server unreachable
Logrotate is configured but files are not removed.
I deleted them.
--
Somebody
Starting to see a pattern here

not delay
(unless in a speeding bus)
Fix
I like that

Fix it for everybody
don’t hack “your server” only

Email
is not
"the worst notification method”

02:40 – Rock rock till you drop
I will tirelessly work on making alerts:
●Less frequent
●More useful and meaningful

02:50 – What’s a useful alert?
“System x down” is mostly clear, but others not
I’ll make no typos in alert instructions
write stupid, think stupid

03:10 – Auto-resolve that crap
I estimate this will happen 3 more times
I add automatic action
Because I know it will happen 30 times

03:30 – Monitoring is not a tasklist
These occurrences are useful to know about
So I’ll make it logged
If it really is useful, I can look at the log *
* usually nobody does

03:50 – What’s that noise?
Do we really have
a problem with alerts?

05:30 – We so trendy
Agile = “I won’t have to maintain it” *
Devops = “production is free-for-all” **
●Bring developers in on my monitoring
* Yes, it’s all misappropriated and misunderstood
** Still the same

Problems in /etc/motd * = “in your face”
* Credit for the idea to Ilya Ableev (Badoo)

05:52 – Mind the language
False -
from Latin falsus (“counterfeit, false; falsehood”)
Never say “false positive”! (Thank you, Aaron)
Misconfiguration != false positive

"oh, it's a false positive" = oh, all is good here
"oh, it's misconfiguration" = how do we fix it?

06:14 – Minding the language
Is "this is bad" fine?
In many countries, but maybe not in others
“Hate useless alerts” - Europe vs USA?

06:20 – Tact filter?
Normal people with outgoing tact filter
Geeks with incoming tact filter
http://www.mit.edu/~jcb/tact.html

Most teams slip like cows on ice
if "cleanliness" rules are not
constantly refreshed

Make Alerts Meaningful Again
MAMA

Respect your oncall
Respect yourself

When I get mad
And I get pissed
I grab my pen
And I write out a list

□Oncall is alerted during maintenance
□Old systems are not cleaned up from monitoring
□Alert comments/instructions not too useful
□Alert comments/instructions with typos
□“false positive” is thrown around frequently
□Many monitoring solutions in place
□Problems are delayed, not fixed
□Problems are fixed in parts of the environment,

OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups

Semelhante a OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups (20)

Último

Último (20)

OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups