SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
A day in oncall life
or
Operation Forest
No views are endorsed by employers,
past or present
Any resemblance to real-life entities is
coincidental
The storyteller
Involved with monitoring for 20 years
Worked at Zabbix for 6 years
OpenStreetMap mapper, tea drinker
Chapter 1: The Morning
on how I understood
that alerts are bad
10:00 - This I know on a Friday
Alerts are good, because they inform us
Alerts are always useful
From: somebody@example.com
Subject: Re: High: appserver.example.com Important service down
thought it was short restart and it won't alert
--
Somebody
From: somebody@example.com
Subject: Re: High: oldappserver.example.com Server unreachable
This server has not been used for a year.
--
Somebody
From: somebody@example.com
Subject: Re: Average: zomgserver.example.com ZOMG workflow issue
ASDF and QWER workflow counts differ for last
ZOMG cycle. ASDF: 6762 QWER: 6760
Not much difference, hence ignoring
--
Somebody
Uncommon habits can bring
benefits in certain situations
We have to be careful with
declaring habits as bad
13:00 – Habits = risks, time wasters
chmod 777
sudo su
"I'll keep an eye on this alert"
14:13 – Alert fatigue
Imagine frequent fire alarms
Imagine frequent monitoring alerts
14:15 – All alerts are bad
Two possibilities:
a) something bad has happened or will happen
b) it’s a useless alert
Chapter 2: Afternoon
on how I understood
why bad alerts happen
15:03 – A minute of reading
Afternoon /ɑːftəˈnuːn/ – noun
The time from noon or lunchtime to evening
currently i don’t see a disk space issue
Currently I don’t
see a road issue
...
Trash attracts trash
Alerts attract alerts
oh, those are just alerts about my systems
=
oh, that’s just my trash in the forest
16:30 – Failures multiply
Disk full →
logrotate creates an incomplete file →
never rotates the source file again
https://bugzilla.redhat.com/show_bug.cgi?id=1374550
19:00 – What causes alerts?
Technical challenges
Lack of time
Humans
●Attitude?
●Culture?
●Organisation?
Chapter 3: Evening
on how I tried to
battle the alerts (and lost)
00:13 - Let’s change the tools
Let’s use new alerting solution, new...
If emails can be ignored, so can be other alerts.
It’s culture, not the tool.
01:00 – What’s this thing?
I must understand what I am running -
and monitoring
I must know my monitoring solution
●So I don’t spend two days writing a script to
replicate a feature
●“Haha, a week of training”
01:20 – Do we know our tools?
Next version of grep rumoured to include graphs
Monitoring solution matters, but it is my decision
...I’ll try to pick one
From: somebody@example.com
Subject: Re: High: oldappserver.example.com Server unreachable
Logrotate is configured but files are not removed.
I deleted them.
--
Somebody
Starting to see a pattern here
not delay
(unless in a speeding bus)
Fix
I like that
Fix it for everybody
don’t hack “your server” only
Email
is not
"the worst notification method”
02:40 – Rock rock till you drop
I will tirelessly work on making alerts:
●Less frequent
●More useful and meaningful
02:50 – What’s a useful alert?
“System x down” is mostly clear, but others not
I’ll make no typos in alert instructions
write stupid, think stupid
03:10 – Auto-resolve that crap
I estimate this will happen 3 more times
I add automatic action
Because I know it will happen 30 times
03:30 – Monitoring is not a tasklist
These occurrences are useful to know about
So I’ll make it logged
If it really is useful, I can look at the log *
* usually nobody does
03:50 – What’s that noise?
Do we really have
a problem with alerts?
05:30 – We so trendy
Agile = “I won’t have to maintain it” *
Devops = “production is free-for-all” **
●Bring developers in on my monitoring
* Yes, it’s all misappropriated and misunderstood
** Still the same
Problems in /etc/motd * = “in your face”
* Credit for the idea to Ilya Ableev (Badoo)
05:52 – Mind the language
False -
from Latin falsus (“counterfeit, false; falsehood”)
Never say “false positive”! (Thank you, Aaron)
Misconfiguration != false positive
"oh, it's a false positive" = oh, all is good here
"oh, it's misconfiguration" = how do we fix it?
06:14 – Minding the language
Is "this is bad" fine?
In many countries, but maybe not in others
“Hate useless alerts” - Europe vs USA?
06:20 – Tact filter?
Normal people with outgoing tact filter
Geeks with incoming tact filter
http://www.mit.edu/~jcb/tact.html
just a test
Most teams slip like cows on ice
if "cleanliness" rules are not
constantly refreshed
Make Alerts Meaningful Again
MAMA
Respect your oncall
Respect yourself
When I get mad
And I get pissed
I grab my pen
And I write out a list
□Oncall is alerted during maintenance
□Old systems are not cleaned up from monitoring
□Alert comments/instructions not too useful
□Alert comments/instructions with typos
□“false positive” is thrown around frequently
□Many monitoring solutions in place
□Problems are delayed, not fixed
□Problems are fixed in parts of the environment,

Mais conteúdo relacionado

Semelhante a OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups

LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
Mike Julian
 
Wagner whats buggingyou-voyager
Wagner whats buggingyou-voyagerWagner whats buggingyou-voyager
Wagner whats buggingyou-voyager
ENUG
 
PLC Class project Lab Brett Bloomberg
PLC Class project Lab Brett BloombergPLC Class project Lab Brett Bloomberg
PLC Class project Lab Brett Bloomberg
Brett Bloomberg
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Chris Gates
 

Semelhante a OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups (20)

Five Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our UsersFive Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our Users
 
top developer mistakes
top developer mistakes top developer mistakes
top developer mistakes
 
Killing the golden calf of coding - We are Developers keynote
Killing the golden calf of coding - We are Developers keynoteKilling the golden calf of coding - We are Developers keynote
Killing the golden calf of coding - We are Developers keynote
 
A Big Dashboard of Problems.pdf
A Big Dashboard of Problems.pdfA Big Dashboard of Problems.pdf
A Big Dashboard of Problems.pdf
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
 
Christian Heilmann - Building human interfaces powered by AI - Codemotion Ber...
Christian Heilmann - Building human interfaces powered by AI - Codemotion Ber...Christian Heilmann - Building human interfaces powered by AI - Codemotion Ber...
Christian Heilmann - Building human interfaces powered by AI - Codemotion Ber...
 
Angus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent SystemsAngus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent Systems
 
Wagner whats buggingyou-voyager
Wagner whats buggingyou-voyagerWagner whats buggingyou-voyager
Wagner whats buggingyou-voyager
 
Anytime, Anywhere, Any Device: Developing a Mobile Website for Your Library
Anytime, Anywhere, Any Device: Developing a Mobile Website for Your LibraryAnytime, Anywhere, Any Device: Developing a Mobile Website for Your Library
Anytime, Anywhere, Any Device: Developing a Mobile Website for Your Library
 
Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014
 
Bugs Aren't Random
Bugs Aren't RandomBugs Aren't Random
Bugs Aren't Random
 
Git Makes Me Angry Inside
Git Makes Me Angry InsideGit Makes Me Angry Inside
Git Makes Me Angry Inside
 
PLC Class project Lab Brett Bloomberg
PLC Class project Lab Brett BloombergPLC Class project Lab Brett Bloomberg
PLC Class project Lab Brett Bloomberg
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
Artificial intelligence for humans… #AIDC2018 keynote
Artificial intelligence for humans… #AIDC2018 keynoteArtificial intelligence for humans… #AIDC2018 keynote
Artificial intelligence for humans… #AIDC2018 keynote
 
Seven Classic Startup Failure Modes
Seven Classic Startup Failure ModesSeven Classic Startup Failure Modes
Seven Classic Startup Failure Modes
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
 
Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures
Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad FailuresJoy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures
Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures
 
Building a site for people with big imaginations
Building a site for people with big imaginationsBuilding a site for people with big imaginations
Building a site for people with big imaginations
 
Tickets Make Ops Unnecessarily Miserable: The Journey to Self-Service
Tickets Make Ops Unnecessarily Miserable: The Journey to Self-ServiceTickets Make Ops Unnecessarily Miserable: The Journey to Self-Service
Tickets Make Ops Unnecessarily Miserable: The Journey to Self-Service
 

Último

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Último (20)

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups

  • 1. A day in oncall life or Operation Forest
  • 2. No views are endorsed by employers, past or present Any resemblance to real-life entities is coincidental
  • 3. The storyteller Involved with monitoring for 20 years Worked at Zabbix for 6 years OpenStreetMap mapper, tea drinker
  • 4. Chapter 1: The Morning on how I understood that alerts are bad
  • 5. 10:00 - This I know on a Friday Alerts are good, because they inform us Alerts are always useful
  • 6. From: somebody@example.com Subject: Re: High: appserver.example.com Important service down thought it was short restart and it won't alert -- Somebody
  • 7. From: somebody@example.com Subject: Re: High: oldappserver.example.com Server unreachable This server has not been used for a year. -- Somebody
  • 8. From: somebody@example.com Subject: Re: Average: zomgserver.example.com ZOMG workflow issue ASDF and QWER workflow counts differ for last ZOMG cycle. ASDF: 6762 QWER: 6760 Not much difference, hence ignoring -- Somebody
  • 9. Uncommon habits can bring benefits in certain situations We have to be careful with declaring habits as bad
  • 10. 13:00 – Habits = risks, time wasters chmod 777 sudo su "I'll keep an eye on this alert"
  • 11.
  • 12. 14:13 – Alert fatigue Imagine frequent fire alarms Imagine frequent monitoring alerts
  • 13.
  • 14. 14:15 – All alerts are bad Two possibilities: a) something bad has happened or will happen b) it’s a useless alert
  • 15.
  • 16. Chapter 2: Afternoon on how I understood why bad alerts happen
  • 17. 15:03 – A minute of reading Afternoon /ɑːftəˈnuːn/ – noun The time from noon or lunchtime to evening
  • 18.
  • 19.
  • 20. currently i don’t see a disk space issue
  • 21. Currently I don’t see a road issue ...
  • 22. Trash attracts trash Alerts attract alerts
  • 23.
  • 24. oh, those are just alerts about my systems = oh, that’s just my trash in the forest
  • 25. 16:30 – Failures multiply Disk full → logrotate creates an incomplete file → never rotates the source file again https://bugzilla.redhat.com/show_bug.cgi?id=1374550
  • 26. 19:00 – What causes alerts? Technical challenges Lack of time Humans ●Attitude? ●Culture? ●Organisation?
  • 27. Chapter 3: Evening on how I tried to battle the alerts (and lost)
  • 28. 00:13 - Let’s change the tools Let’s use new alerting solution, new... If emails can be ignored, so can be other alerts. It’s culture, not the tool.
  • 29. 01:00 – What’s this thing? I must understand what I am running - and monitoring I must know my monitoring solution ●So I don’t spend two days writing a script to replicate a feature ●“Haha, a week of training”
  • 30. 01:20 – Do we know our tools? Next version of grep rumoured to include graphs Monitoring solution matters, but it is my decision ...I’ll try to pick one
  • 31.
  • 32. From: somebody@example.com Subject: Re: High: oldappserver.example.com Server unreachable Logrotate is configured but files are not removed. I deleted them. -- Somebody Starting to see a pattern here
  • 33.
  • 34. not delay (unless in a speeding bus) Fix I like that
  • 35. Fix it for everybody don’t hack “your server” only
  • 36. Email is not "the worst notification method”
  • 37. 02:40 – Rock rock till you drop I will tirelessly work on making alerts: ●Less frequent ●More useful and meaningful
  • 38. 02:50 – What’s a useful alert? “System x down” is mostly clear, but others not I’ll make no typos in alert instructions write stupid, think stupid
  • 39. 03:10 – Auto-resolve that crap I estimate this will happen 3 more times I add automatic action Because I know it will happen 30 times
  • 40. 03:30 – Monitoring is not a tasklist These occurrences are useful to know about So I’ll make it logged If it really is useful, I can look at the log * * usually nobody does
  • 41. 03:50 – What’s that noise? Do we really have a problem with alerts?
  • 42. 05:30 – We so trendy Agile = “I won’t have to maintain it” * Devops = “production is free-for-all” ** ●Bring developers in on my monitoring * Yes, it’s all misappropriated and misunderstood ** Still the same
  • 43. Problems in /etc/motd * = “in your face” * Credit for the idea to Ilya Ableev (Badoo)
  • 44. 05:52 – Mind the language False - from Latin falsus (“counterfeit, false; falsehood”) Never say “false positive”! (Thank you, Aaron) Misconfiguration != false positive
  • 45.
  • 46. "oh, it's a false positive" = oh, all is good here "oh, it's misconfiguration" = how do we fix it?
  • 47. 06:14 – Minding the language Is "this is bad" fine? In many countries, but maybe not in others “Hate useless alerts” - Europe vs USA?
  • 48. 06:20 – Tact filter? Normal people with outgoing tact filter Geeks with incoming tact filter http://www.mit.edu/~jcb/tact.html
  • 50.
  • 51. Most teams slip like cows on ice if "cleanliness" rules are not constantly refreshed
  • 54. When I get mad And I get pissed I grab my pen And I write out a list
  • 55. □Oncall is alerted during maintenance □Old systems are not cleaned up from monitoring □Alert comments/instructions not too useful □Alert comments/instructions with typos □“false positive” is thrown around frequently □Many monitoring solutions in place □Problems are delayed, not fixed □Problems are fixed in parts of the environment,