SlideShare a Scribd company logo
1 of 94
What’s going on?
@ablythe
Huh?
@ablythe
Huh?
• Does anyone know what movie that was?
@ablythe
@ablythe
World Record
• Highest Profit to Cost Ratio Ever
• But before that…
@ablythe
@ablythe
Zabbix: Beyond Thunderdome
Aaron Blythe
This presentation is about…
@ablythe
This presentation is about…
@ablythe
This presentation is about…
@ablythe
This presentation is about…
@ablythe
Past
Now
Future
@ablythe
Past
Now
Future
@ablythe
What is Zabbix?
@ablythe
What is Mad Max?
@ablythe
Why Zabbix?
@ablythe
Why Zabbix?
Necessity
@ablythe
Why Zabbix?
@ablythe
Why Zabbix?
Open Source
Linus’s Law
Given enough ‘s all ‘s are
Community Based
@ablythe
Why Zabbix?
@ablythe
Why Zabbix?
@ablythe
Why Zabbix?
Mission Statement
To contribute to the systemic improvement of
health care delivery and the health of
communities.
@ablythe
@ablythe
Zabbix Linux Template - Cost
• Connect Host as Agent to Zabbix Server (Via
Chef)
• Download Template from Zabbix
• Upload Template to Zabbix Server
• Apply Template to Host
____________________
• Cost = 4 steps
2 Steps 1 Step
@ablythe
Zabbix Linux Template - Return
• ~ 11 applications
• ~ 90 items
• ~ 120 triggers
• ~ 20 graphs
@ablythe
Profit to Cost Ratio
• Mad Max
– $100 million worldwide/A$400,000
• Zabbix Linux Template
– 120 Triggers/2 Steps
@ablythe
Benefit
• 80% full alerts
– Disk space/inodes
– RAM
• Make better decisions on size needed
Decision
Find file or
process
Extend LVM
@ablythe
Chase Scenes and Crashes
@ablythe
Creators
Byron Kennedy
George Miller Alexei Vladishev
Zabbix (Latvia)
Mad Max(Australia)
@ablythe
Past
Now
Future
@ablythe
Mad Max 2 – The Road Warrior
@ablythe
@ablythe
Scale
@ablythe
Highly Available Deployments
Proxy Layer
Service Layer
@ablythe
Highly Available Deployments
Proxy Layer
Service Layer
@ablythe
Highly Available Deployments
Proxy Layer
Service Layer
@ablythe
Highly Available Deployments
@ablythe
Email Alerts to uCern Discussions
@ablythe
Screens/Graphs – ack rates
@ablythe
Screens/Graphs
@ablythe
Brahe Hubble
{
“{INDEX_MACRO}"=>”name]}",
“{VERSION_MACRO}"=>” version",
“{ERROR_MACRO}"=>"#{error}"
}
@ablythe
Zabbix Low Level Discovery
@ablythe
Zabbix Host
Zabbix Agent
UserParameter
Shell Script or
RubyGem
Zabbix Server
json
Document Template
w/ Macro
Zabbix Low Level Discovery
@ablythe
Zabbix Low Level Discovery
@ablythe
@ablythe
Who?
Kalin Hicks – Set up original GCL VM – countless
explanations whiteboard sessions
Brian Cook – Set up original Sepsis Zabbix VM’s
John Breese – Set up 2.0 templates spanning hosts
Brad Beam – Many dashboards, alerts and triggers
Chris Rooney – Brahe-hubble gem
Nidhi Bhargava – Low level discovery on 2.0
Dev – White Ops - Yellow
@ablythe
@ablythe
It’s not all dogs…
@ablythe
…and Gyrocopters
@ablythe
Sometimes my email inbox…
@ablythe
Has me feeling like
@ablythe
Bus Factor
@ablythe
Bus Factor
Dystopian Future Where The Survival of Many is
in the Hands of One Man
@ablythe
The Information Model
@ablythe
Host Group Host Group
Host
Template
Template (0..n)
Item TriggerGraph
Applications
0..n
Action
email command
Items
1..n
… has a learning curve
Mad Max 2: The Road Warrior
@ablythe
Past
Now
Future
@ablythe
We Want Tina Turner!
@ablythe
Beyond Thunderdome
@ablythe
Virtualization thru Skybox Labs
@ablythe
Dashboards
chapters
divided by
types of
data rather
than types
of display
chapters on
multi-variables,
correlationand
proportions
Honestly a
little too
textbook-
ish for me
from more
than two
dozen experts,
real world case
studies,
beautiful
layers, how to’s
@ablythe
Pull Data External?
@ablythe
Zabbix Maps
http://workaround.org/zabbix/maps
@ablythe
Alert Exhaustion
Ain’t Nobody Got
@ablythe
Two Men Enter, One Man Leaves
@ablythe
Correlation of Alerts
Proxy Layer
Service Layer
@ablythe
Trigger Dependencies
• Sometimes the availability of one host
depends on another. A server that is behind
some router will become unreachable if the
router goes down. With triggers configured for
both, you might get notifications about two
hosts down - while only the router was the
guilty party.
@ablythe
“Flap Detection” and a Grace Period
Nagios uses "flap detection" to prevent many
ERROR's and OK's being sent right after each
other.
Zabbix calls this "hysteresis".
@ablythe
Hysteresis
Hysteresis is the dependence of a system not
only on its current environment but also on its
past environment
@ablythe
Delaying Notifications
@ablythe
Correlation of Alerts
We need to get to the point where:
100’s of Related Alerts Enter,
One Causal Alert Leaves
@ablythe
What if someone misses something?
With 100+ alert emails per day, they are almost
guaranteed to miss something.
@ablythe
“Why on earth was I not notified?!”
On http://blog.zabbix.com/
Trends of Flakiness
These should not be dealt with by alerts/alarms.
Rather by daily/weekly reports.
Unfortunately Zabbix is not strong in this area yet.
There is a thread:
https://www.zabbix.com/forum/showthread.php?t
=18901
@ablythe
False Alarms Due to Chef Restarts
Current – Manual
Maintenance Periods
Potentially – Automated
Automate the Maintenance Periods
Delaying Notifications
Hysteresis
Promise Theory
@ablythe
Highly Available Deployments
Delayed Notifications/Hystersis
Proxy Layer
Service Layer
Delay Alert
120 seconds
Works!! @ablythe
Highly Available Deployments
Delayed Notifications/Hystersis
Proxy Layer
Service Layer
Delay Alert
120 seconds
Delay Alert
120 seconds
Delay Alert
120 seconds
No Delay
Doesn’t Work @ablythe
Beyond Thunderdome
@ablythe
Promise Theory
@ablythe
Deconstructing Promises
@ablythe
Promise Theory
+data
a1
a2
My Service
Zabbix
@ablythe
Leveraging Init.d to Manage State
…
case "$1" in
start)
touch /var/<service>/start
…
rm -f /var/<service>/start
;;
stop)
touch /var/<service>/stop
;;
rm -f /var/<service>/stop
restart)
touch /var/<service>/restart
$0 stop
$0 start
rm -f /var/<service>/restart
;;
…
This of course is messy if the service
ever hangs during a restart.
More discussion needs to be had in this
area.
@ablythe
Mark Burgess – Book of Promises
http://cfengine.com/markburgess/BookOfPromi
ses.pdf
Draft published on January 21st 2013
@ablythe
For the Project Managers
Nobody
PLANS TO FAIL
Some just
FAIL TO PLAN
@ablythe
For the Project Managers
Everybody should
PLAN TO FAIL
PRACTICE LOCALIZED FAILURE
And
MINIMIZE RECOVERY TIME
@ablythe
The Phoenix Project: A Novel About
IT, DevOps, and Helping Your Business
Win
@ablythe
The Brent Effect
Brent is the one person who understands the
how the entire system fits together.
Brent is the one person who fixes most of the
issues.
Being spread so thin, Brent is also the one
person who causes most of the issues.
@ablythe
Dystopian Future Where The Survival of Many is
in the Hands of One Man
The system or crucial parts of the system
Man or Woman
@ablythe
What is OpsInfra?
A team built on enablement of DevOps.
@ablythe
Other tools
As needed
Build an Ecosystem
Tool Virtualization
Repeatable Deployment
Documentation
Discussion
Auxiliary Tooling
Education
The Success of:
Population Health
Millennium+
Project Go
Incubator
• https://wiki.ucern.com/display/OPIT/Incubato
r
• 4 steps
– Log a Jira with the intent to research a tool
– Write a wiki article on how to use it
– Write a blog on how it is awesome
– Record a demo of the tool
@ablythe
For the Architects
Monitoring is only “technical debt” if you
choose to carry it that way.
Depending on when you invest, it easily can be
“technical capital”
@ablythe
Beyond Thunderdome
@ablythe
Past – Hackers - Craft
Now – SysAdmin - Trade
Future – Devops - Science
@ablythe
The Tell
The years travel fast
And time after time, I've done the tell
But this ain't one body’s tell
It's the tell of us all
And you gotta listen it and 'member
Cuz what you hears today
You gotta tell the newborn tomorrow
@ablythe
What’d ya think?
@ablythe

More Related Content

Similar to Zabbix: Beyond Thunderdome

Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
museums and the web
 
Open source software for startups
Open source software for startupsOpen source software for startups
Open source software for startups
victorneo
 

Similar to Zabbix: Beyond Thunderdome (20)

Open Source Craft at Twitter
Open Source Craft at TwitterOpen Source Craft at Twitter
Open Source Craft at Twitter
 
Jr devsurvivalguide
Jr devsurvivalguideJr devsurvivalguide
Jr devsurvivalguide
 
The Junior Developer Survival Guide - GDI Ann Arbor 2/10/15
The Junior Developer Survival Guide -  GDI Ann Arbor 2/10/15The Junior Developer Survival Guide -  GDI Ann Arbor 2/10/15
The Junior Developer Survival Guide - GDI Ann Arbor 2/10/15
 
Interns What Is DevOps
Interns What Is DevOpsInterns What Is DevOps
Interns What Is DevOps
 
Blackmagic Open Source Intelligence OSINT
Blackmagic Open Source Intelligence OSINTBlackmagic Open Source Intelligence OSINT
Blackmagic Open Source Intelligence OSINT
 
OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 
Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...
 
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
 
Open source software for startups
Open source software for startupsOpen source software for startups
Open source software for startups
 
Lifestream: The New Future of Blogging?
Lifestream: The New Future of Blogging?Lifestream: The New Future of Blogging?
Lifestream: The New Future of Blogging?
 
Akshay Anand - Using Cynefin to make sense of ITSM
Akshay Anand -  Using Cynefin to make sense of ITSMAkshay Anand -  Using Cynefin to make sense of ITSM
Akshay Anand - Using Cynefin to make sense of ITSM
 
Building the Orchard Community
Building the Orchard CommunityBuilding the Orchard Community
Building the Orchard Community
 
Lifestreaming: The New Future of Blogging?
Lifestreaming: The New Future of Blogging?Lifestreaming: The New Future of Blogging?
Lifestreaming: The New Future of Blogging?
 
Open source-secret-sauce-rit-2010
Open source-secret-sauce-rit-2010Open source-secret-sauce-rit-2010
Open source-secret-sauce-rit-2010
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptx
 
Troublefree troubleshooting ian campbell sps jhb 2019
Troublefree troubleshooting ian campbell sps jhb 2019Troublefree troubleshooting ian campbell sps jhb 2019
Troublefree troubleshooting ian campbell sps jhb 2019
 
Re-Building a Tech Community - Post Pandemic!
Re-Building a Tech Community - Post Pandemic!Re-Building a Tech Community - Post Pandemic!
Re-Building a Tech Community - Post Pandemic!
 
Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019
 
Devops at scale is a hard problem challenges, insights and lessons learned
Devops at scale is a hard problem  challenges, insights and lessons learnedDevops at scale is a hard problem  challenges, insights and lessons learned
Devops at scale is a hard problem challenges, insights and lessons learned
 

More from Aaron Blythe (7)

Creating a Pipeline - LeanAgileKC 2015
Creating a Pipeline - LeanAgileKC 2015Creating a Pipeline - LeanAgileKC 2015
Creating a Pipeline - LeanAgileKC 2015
 
Guerrilla Marketing: Selling Splunk Internally to your Enterprise
Guerrilla Marketing: Selling Splunk Internally to your EnterpriseGuerrilla Marketing: Selling Splunk Internally to your Enterprise
Guerrilla Marketing: Selling Splunk Internally to your Enterprise
 
Continuous Delivery: Delivering Client Value at Light Speed - DevCon 2015
Continuous Delivery: Delivering Client Value at Light Speed - DevCon 2015Continuous Delivery: Delivering Client Value at Light Speed - DevCon 2015
Continuous Delivery: Delivering Client Value at Light Speed - DevCon 2015
 
ChefConf 2015 Cleaning up the Kitchen
ChefConf 2015 Cleaning up the KitchenChefConf 2015 Cleaning up the Kitchen
ChefConf 2015 Cleaning up the Kitchen
 
Devops KC October Lightning Talk
Devops KC October Lightning TalkDevops KC October Lightning Talk
Devops KC October Lightning Talk
 
Semantic Versioning Lightning Talk
Semantic Versioning Lightning TalkSemantic Versioning Lightning Talk
Semantic Versioning Lightning Talk
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013
 

Recently uploaded

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Zabbix: Beyond Thunderdome

Editor's Notes

  1. CLICK PLAY
  2. READ THE SLIDE
  3. That was the Blair Witch Project
  4. Blair Witch at one point held the record for the highest profit to cost ratio ever. &lt;enter&gt;But before that…
  5. Mad Max held that record for a couple decades.
  6. My name is Aaron Blythe, and this presentation is CalledZabbix: Beyond Thunderdome.
  7. READ SLIDE
  8. Mad Max
  9. READ SLIDE
  10. ZabbixBy show of hands who has logged into a Zabbix instance?And who has received email alerts from Zabbix?
  11. We will go through where we have been.Where we are.And where we can go with Zabbix.I will try to not give too many spoilers on the Mad Max series of films, merely just lay down the story line.
  12. First I want to go through how we got here with Zabbix so far, using the original Mad Max as a guide.
  13. Zabbix is an Open Source Monitoring ToolWebsite claims:Up-to 100,000 monitored devicesUp-to 1,000,000 of metrics
  14. Mad Max is set in Australia in a dystopian future where earth’s oil supply has been nearly exhausted.Max Rockatansky is the top driver in the Main Force Patrol (basically the police). Gangs have taken over the highway. In a car chase, Max kills one of the gang members, so they want revenge.Honestly the story is sort of dis-jointed. The movie was edited in the home of one of the producers on a home made editing machine, created by his father (an engineer).
  15. Brian Cook told me a story of when they were first working on one of our cloud applications.  It was memory bound.  When a lot of data was being pumped through in batches it would actually clobber the machine.  He would have to call someone in the data center at 2 in the morning to physically reboot the machine.  Oh, and after doing this a few times he would always make sure to tell them to bring a pencil so they could actually get to the button
  16. Kalin Hicks and Brian Cook told me:Zabbix was originally installed to bridge the gap in our monitoring for the Sepsis project, while we waited for a permanent solution, we just chose to use another monitoring tool instead of a bunch of scripts.It was a Skunkworks project that went viral and certainly was not ever intended to become such a big project.
  17. Necessity helps us create or adapt great fun thingsDavid Eggby, responsible for much of the footage for Mad Max had this to say about filming.“… [Shooting from the back of the Goose bike] I couldn&apos;t have a helmet on because you can&apos;t operate a camera, it gets in the way… They put a seat belt strap around us and we went for it, and you can see on the speedo that it&apos;s cracking 180kph.” From: http://sideburnmag.blogspot.com/2012/06/mad-max.htmlSpeedo is ‘stra’in for spedometer…
  18. Unlike proprietary monitoring tools that we use now or have used in the past, we don’t have to worry about paying a license for every stakeholder that has a business need to see the data. &lt;enter&gt;&lt;enter&gt;Fixes on the 2.0 line have so far been decently timely. With a community of hundreds of contributors Linus’s law applies.Which is given enough eyeballs all bugs are shallow.&lt;enter&gt;Zabbix is community based
  19. Community based means there are forums, where we can ask questions and get answers ourselves or see the answers to others questions. &lt;Enter&gt;Yes that is almost 40,000 posts to over 10,000 threads. We could never expect this level of interaction and support for a internally developed monitoring tool.
  20. The number of users in the freenode IRC channel continues to grow to nearly 200 people on average.This is a place to ask advanced questions in real time from users around the world.Oh and this graph was created and gathered in Zabbix over 7 years.
  21. We providehealth care solutions, if we can integrate tools that solve software and hardware problems, that gets us to our goal faster.
  22. For those of you who now want to see the movie because of this talk I don’t want to ruin it for you.But some bad things happen to people Max knows in this movie.This causes Max to quit the force, but he is talked into just taking a holiday instead. At this point Max is just a regular guy. He is trying to keep the peace and lead a good life with his girlfriend.
  23. There are 4 steps to get your host connected to the Zabbix Server and use the Linux OS Template. &lt;enter&gt;However 2 of them have likely been done for you on the Zabbix Server already &lt;enter&gt;And soon we plan to automate Application of the Template to the Host using Auto-Discovery of Linux nodes.So we are left with one step.
  24. For those couple steps you get (roughly depending on the layout of the host):11 applications90 items120 triggersAnd20 graphs
  25. As I said at the beginning Mad Max made a ton of money for the amount of money spent. About 500 to 1000 dollars for every dollar spent.With the Zabbix Linux Template, we are talking about a couple hours of work for 120 Triggers. Once you’ve set this up before it is really only about 10 minutes work to set it up for future nodes.
  26. The 80% full alerts have been extremely beneficial.In the case of disk space and inodes, these alerts give us the time and ability to troubleshoot the issue and make a decision if we Extend the Logical Volume or Find the offending large file or processIn the case of the volume reaching 100% the only choice is extend the LVMIn the case that I spoke of before that Brian Cook ran into with RAM, we can make better decisions on the size and number of nodes we need for Map Reduce.
  27. The entire Mad Max series is built on Car chases, which are awesome to watch.So far it has been awesome to watch Zabbix grow so prolifically throughout Cerner.
  28. What impresses me most about Zabbix and Mad Max is that something so simple and easy could gain so much mindshare.The Creators of each poured time and effort into something that has universal and world wide appeal.We are adaptors of there work and I want to thank them.
  29. So that is where we have been and howwe got started.Now let’s talk about where we now using Mad Max 2: The Road Warrior
  30. Mad Max 2 The Road Warrior picks up a few years later. Max is older and hardened from the tragedy at the end of the first movie. Oil is still scarce. There are still street gangs.Max is now a Lone Wolf.He is looking for more ammunition for his sawed off.
  31. Oh and the villians have slightly better costumes… more budget.
  32. We have well over 2000 nodes currently in the ProductionZabbix 2.0 instance currently.And we believe we can scale that much incredibly higher with our current deployment structure.
  33. A common setup for a highly available system (or HA) is to have N+1 nodes.Here we see 2 proxy layer nodes fronting 3 service layer nodes.
  34. If one of the service layer nodes goes down that is a problem, that needs to be addressed and likely quickly.However the system as a whole is still functioning.
  35. However if all 3 nodes go down that is a disaster that needs to be addressed immediately and someone needs to be paged to fix it.
  36. John Breese was able to set this up for us on Semantic Solutions using templates.We receive high alerts in the event that any single node goes down.We receive disaster alerts in the event that all of there servers or proxies are down.
  37. The alerts go to auCern Space set up specifically for monitoring our system. Associates are free to subscribe or unsubscribe from this space as they need.The discussion can occur in the open and the URL can quickly be pasted on other discussions or Jiras that are occurring on other related issues.
  38. Brad Beam created these graphs that anyone who can access the production Zabbix system can see. Meaning if you have the need to see this, you only have to log an issue in Jira.This graph is monitoring the Real Time processing of data through Storm.The Storm acknowledgement rates (or ack rates) are away to gauge system healthA low ack rate and a sufficient backlog in notifications, it is indicative of an issue.I’ll be honest, I am not sure how exactly these graphs were created, nor that many details about it specifically. What I do know is that many people have been watching this information to understand the system behavior and improve it over the last couple months.
  39. Another Dashboard created by Brad BeamWe currently have a bug in the JVM reuse for the M/R jobs The resources for the finished JVMs wouldn&apos;t be reclaimed which would eventually exhaust the resources on the box. So with this graph we can identify if a server has bogus JVMs out there and need to be addressed.Development of basic monitoring features can now be measured in hours or days, as opposed to months.We need the freedom to change these metrics daily/weekly as we learn more.
  40. Brahe Hubble is a Ruby Gem created by Chris Rooney here are Cerner&lt;enter&gt;Not to steal any thunder from Ben Brown and KartikVishwanath presenting on Brahe later in this conference, Brahe is named after the astronomer Tycho Brahe (similar to the project Kepler, which many of you may be more familiar with).Brahe Solr is a cloud based indexing application also created here at Cerner &lt;enter&gt;presents at least 2 replicas &lt;enter&gt; That are fronted by a Brahe REST services &lt;enter&gt; to manage and query their state &lt;enter&gt;Brahehubble uses this rest services &lt;enter&gt;To present a Json document &lt;enter&gt;To be used by a Zabbix TemplateSo why not have Zabbix call the rest interface directly?Basically the logic done by Brahe Hubble is too complicated for Zabbix to complete on it’s own.
  41. With the help of Kalin and Brad Beam, NidhiBhargava worked through this for our Brahe Hubble deploymentYou have your Host or Node and aZabbix Server &lt;enter&gt;First you have to get the Zabbix Agent Installed (preferably through Chef) &lt;enter&gt;Then a script (or in the case of Brahe Hubble a RubyGem) that does the gathering of information and outputs a json documentBut how will the Zabbix Agent know about the script or command line? &lt;enter&gt;Easy you will have to configure the UserParameter for Zabbix Agent (simple to do if your are using the zabbix_agent_chef cookbook) &lt;enter&gt;This will allow you to present a json document to the Zabbix Server &lt;enter&gt;The Zabbix Server then uses this json document in a Template with a Macro.
  42. In Templates &lt;enter&gt;The important part is that this is created under “discovery” &lt;enter&gt;In Discovery we created an item and a trigger &lt;enter&gt;The item &lt;enter&gt;
  43. It is here where you can use the name value pairs presented in json from the script or RubyGem.
  44. Let me stop for a minute and tell you about my 2 favorite characters in Mad Max 2Max meets this guy that we refer to as the “Gyro Captain” because no one says his name in the movie and Max never asks.Oh and probably because he drives a gyro copter.Character development is starting to become part of the Mad Max movie this time around. Even if names are not. I personally like names and would love to celebrate things you do with Zabbix as I just did with the cool stuff I have seen done with Zabbix.
  45. Names I have already said so far. &lt;enter&gt;There are many more, but notice that there are 3 dev and 3 ops. Each of us have learned a lot from one another.
  46. There is also The Feral Kid, named for similar reasons. Max gives the feral kid a music box. Max’s heart is starting to soften some and he decides to help this village of people protecting their oil try to get away from the road gang.Max has become more invested in the village. Over the past couple years Zabbix has moved from that side project, or Skunkworks project to an investment in the health of our system.
  47. Max tries to leave the village once, but does not make it. He comes back after a pretty severe beating.
  48. Remember that Max was the best driver on the Main Force Patrol.Max is the only one who is going to be able to drive the tankard of oil out of the protected village.Oh and there is an epic oil tanker chase scene. It goes on for like 20 minutes.In Software we often refer to situations where only one or a few can do something critical as having a low “Bus Factor”. Which put simply is the total number of key developers who would need to be hit by a bus (or tankard) before the project would not be able to proceed.
  49. I would describe Mad Max 2 as aREAD SLIDE
  50. The Zabbix Information model has a rather steep learning curve. But I believe it is one worth climbing.From https://www.zabbix.com/forum/showthread.php?t=21030
  51. As I often do,I asked Kalin to talk to me like I&apos;m a 3rd grader and he boiled it down to this for me.* A Host can be part of many Host Groups.A Host can have many Templates applied to itA Template can have Graphs, Items, and TriggersYou can define actions for TriggersKyle McGovern and Ben Hemphill mentioned yesterday that they are using Zabbix to restart Hadoop Region Servers.So Self healing system of the future? We have that now.
  52. The Road Warrior won critical acclaim, and is an incredibly better movie than the first. The story line is cohesive and somewhat compelling. Max truly comes out a hero.By putting in more work, we have a better story and done some awesome stuff with Zabbix so far…
  53. Let’s talk about where we want to go with Zabbix in the next couple years.
  54. We want Tina Turner level success…In the third installment of the saga, Mad Max: Beyond Thunderdome, Tina Turner is the leader of Bartertown. She plays Aunty Entity.
  55. Bartertown has regained some technology through the use of methane.Years have past and an aging Max has some of his supplies stolen and becomes involved in the local political power struggle.
  56. Recently Nimesh Subramanian created a Skybox Labs virtual cluster with a Chef Server and a Zabbix Server.You can check this out upload the cookbook for your app or service and start playing around with Zabbix without affecting a shared domain where others are working.When you are finished you can just throw the image away.
  57. Dashboards are an area that could use a lot of work. Each of these titles are available on Safari Online. The way people read books is a personal decision. I personally use my library card and each of these 4 are available on Safari Online so I can read them on my iPad.How do we convey the most information in the least amount of space to make only the real problems gain attention?
  58. Zabbix has a full API.Many have been pulling Jira and Splunk data already into Dashing from Shopify which can be optimized It should be rather trivial.
  59. Zabbix does have some interesting features.A couple weeks ago, in the workaround.org blog, Zabbix Maps were explained fairly well.We have not made use of this very heavily however this could potentially give us a graphical relational way to reason about the data that Zabbix is gathering.
  60. Seriously…http://serverfault.com/questions/327472/zabbix-server-sends-too-many-notifications
  61. In Mad Max Beyond Thunderdome there is a cage match between Max and a huge opponent named Blaster.The crowd chants “Two men enter, one man leaves”
  62. Remember back to my example of High Alerts vs. Disaster for the Service Layer? In the disaster scenario I get 4 alerts. 3 for each of the host, and one for the disaster.However this is likely all from one cause. Meaning those alerts are correlated, but how to do I get the system to only email me once?Sometimes a single cause can result in hundreds of emails from Zabbix. I heard one system engineer recently refer to this as “Getting Zabbixed”
  63. Straight from the Zabbix Documentationhttps://www.zabbix.com/documentation/2.0/manual/config/triggers/dependencies
  64. http://meinit.nl/zabbix-triggers-flap-detection-and-grace-periodSystems can get into states where they send Error then immediately send OK’s.A different monitoring system, Nagios, calls this “Flap detection”.In these cases real time alerts are not of much value, Because the system is doing one of two things:Correcting itself somehow faster than a human can interveneOr these are just the downstream effect of the network or another factor (that we should be using the previously mentioned trigger dependency for)Zabbix calls this Hysteresis pronounced “Historee Sis”
  65. Hysteresis is the dependence of a system not only on its current environment but also on its past environment &lt;Enter&gt;For alerts such as this we can use the unix pipe command to chain. &lt;enter&gt;Problem: being less than 10GB for 5 minutes &lt;enter&gt;notice you set this a max of 5 minutes &lt;enter&gt;Recovery: being more than 40 GB in the last 10 minutes &lt;enter&gt;notice the min of 10 minutes &lt;enter&gt;
  66. https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notificationsFrom the Zabbix documentation (I have not fully tested this myself).First check the box to Schedule Actions – This allows the actions on the right sideNext, set a period (maybe 120 seconds)Enable a recovery messageMake sure Trigger value = “PROBLEM” or you will delay the recovery messageStep 2 happens after 120 seconds (step 1 is not defined) so nothing happens.
  67. We need Thunderdome for our alerts100’s of related alerts enterOne causal alert leaves
  68. In discussing these methods of correlation, suppression, and delaying messages, I often get asked, “What if someone misses something?” &lt;enter&gt;A monitoring system that cries wolf too often is almost guaranteed not to get listened to. When I hear a car alarm these days I unfortunately almost never think that someone is trying to steal a car.While this is a valid question, it is not the most interesting question to me. It seems like a question that could stunt progress.The Zabbix community is working through an Action Simulator that may be part of a future release of Zabbix. Look for the blog entry entitled: “Why on earth was I not notified?!”
  69. Trends of flapping are better dealt with in an wholistic manner.Zabbix is not yet great at daily/weekly reports, but it appears that the community has made a lot of headway and it will be in a near future release.
  70. So let’s return to my previous example. &lt;enter&gt;If I delay the notification by 120 seconds and the node recovers in time, then I get no notification – this is good as it will cut down on a number of notificationsIf the node does not recover in that time - the system as a whole is still up and I can deal with the problematic node individually &lt;enter&gt;
  71. If all 3 nodes are down at the same time, I would not however delay the notifications of the Disaster.In this case, the system is not likely to recover in 2 minutes so I would just be delaying the other 3 emails. &lt;enter&gt;I may be able to set up a trigger dependency, however that would sort of be circular in my current opinion. Remember trigger dependency was for a separate host. &lt;enter&gt;
  72. In beyond Thunderdome, Max is banished from Bartertown. He is found by a tribe of children who have a “tell” that prophesizes his arrival. Again Max becomes a reluctant hero to this tribe of people.
  73. When Adam Jacob from OpsCode was visiting our campus he walked through an example that we had been working through with proxies.He mentioned Promise Theory. &lt;enter&gt;I am going to use an example I lifted from John Willis of the DevOps Café Podcast.A promise of B from agent 1 to agent 2.http://www.socallinuxexpo.org/sites/default/files/presentations/scale11x-historyofmgmt-130222175623-phpapp01.pdf
  74. There are promises to give and promises to receiveLet’s use + for give and – for receiveI (a1) promise to feed my neighbor’s cat (a2) My neighbor (a2) promises to grant me access to his house.Trust comes in:That my neighbor gave me the correct code and I will not get arrested.That I will not drink his 25 year old scotch
  75. My Service promises to publish state.
  76. If you think this subject is interesting Mark Burgess (who wrote cfengine – a precursor to Chef - well before it’s time) recently published a 303 page Draft of his book on the subject.
  77. I have had the opportunity to read many books and take classes on project management.We see this quote many times Nobody Plans to fail, some just Fail to Plan &lt;Enter&gt;This is cute &lt;Enter&gt;But it is wrong
  78. Read the slideSchedule strategic iteration time to work through monitoring…So you are not scheduling weekend war rooms
  79. The Phoenix Project is a novel about IT and DevOps.It is about a company on the brink of complete failure.
  80. Beyond Thunderdome is yet again a Dystopian Future where the Survival of many is in the hands of one Man &lt;enter&gt;It makes a great action movie, but not a great way to do business.
  81. Our team is built on enablement. We are structured around understanding, harnessing and providing the capabilities needed to deliver software in the Big Data world.There are many tools already in use by a large number of teams. Each of the tools used have a large open community outside of Cerner.We are focused on building an ecosystem within Cerner to solve the large scale problems we are facing with these large scale deployments.
  82. I have been asked many times in the past couple months “Have you seen monitoring tool X? It is awesome.”I am sure that it is. Please show me why it is awesome. We have set up a way that you can do this.Visit the our Incubator link on the uCern wiki. We would like to collect the awesome DevOps tools you are looking into, in a place where you can compare the capabilities to make the best decisions on which ones should be applied to your team.
  83. I had an architect recently refer to working on a monitoring solution as “technical debt” when his system was not yet in production.READ SLIDE
  84. The third installment closes with yet another epic chase in all sorts of vehicles and epic explosions. Max again comes out a hero…
  85. So to relate this back to Chris Brown’s Keynote yesterday?