This document discusses the benefits of a top-down approach to monitoring systems compared to a bottom-up approach. It provides examples of companies like Netflix and GitHub that use key performance indicators (KPIs) and high-level metrics to monitor overall system health from the top-down rather than monitoring individual components from the bottom-up. The document also discusses how BigPanda uses a pipeline latency metric as a KPI to monitor the reliability and performance of its unified monitoring dashboard.
2. 1996
2
Tivoli Software
acquired by IBM
Patrol Software
acquired by BMC
Ethan Galstad creates a simple
MS-DOS application designed to
"ping" Novell Netware servers
“HOW to monitor?” is the primary question
15. 15
KPIs UX
Overall System Health Network Servers Apps
Overall System Health
• Selective
• Proactive
• Exhaustive
• Reactive
vs
Bottom-UpTop-Down
16. A key performance indicator (KPI)
is a business metric used to
evaluate factors that are crucial to
the success of an organization.
KPIs differ per organization;
Definition of KPI
16
17. Let’s play a game!
17
CPU Utilization # Clicks on
a button
TemperatureThis is Sam
What does Sam’s company do?
18. We sought out a single indicator that closely approximated our most
important activity: viewing. We discovered that a server-side metric
related to playback starts (the act of “clicking play”) had both a
predictable pattern and fluctuated significantly when UI/device/server
problems were happening. The Netflix streaming pulse was created.
The Pulse of Netflix
18
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
We named it “SPS” for “starts per second”.
21. What’s so special about SPS?
21
• SPS is easy to understand by all stakeholders
• One metric that covers different point of failure: server
problems, device problems, etc.
• Most important: it’s a clear KPI that indicates when user
experience is compromised
22. But what about root cause analysis?
22
KPIs UX
Overall System Health
Network Servers Apps
23. Github: need for speed
23
https://github.com/blog/1252-how-we-keep-github-fast
The most important factor in web application
design is responsiveness. And the first step
toward responsiveness is speed. But speed
within a web application is complicated.
24. Start from the Top:
Response Times Dashboard
24
https://github.com/blog/1252-how-we-keep-github-fast
• Each row represented a different major
component
• Clicking one of the rows allows you to dive in
and see the mean, 98th percentile, and 99.9th
percentile response times
25. Digging Deeper:
Mission Control Bar
25
https://github.com/blog/1252-how-we-keep-github-fast
Total Time Render Time Cache & Database JS & CSS Size
28. BigPanda
28
Because..
• We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops)
• We feel the pain!
• Our KPIs are easy to describe and understand (especially if you’re an
Ops person)
29. BigPanda
29
As a unified dashboard on top of all your
monitoring systems, and eventually a single
point of truth for production incidents, our data
pipeline has to be reliable and fast.
KPI: Low data pipeline latency
30. Pipeline Latency Metric
30
• Metric are sent from within the apps
• Stored in Graphite
• Sum of all the average latencies of
all alerts that went through the
pipeline
• Monitored by Nagios
31. • Very good indicator of possible service outage
• Must have for detection of SLA violation
• Very good indicator of performance
bottlenecks (can be broken down to sub-
pipelines / specific organizations etc)
• Simple and high-level: easy to explain to non-
technical stakeholders (e.g. sales)
Pipeline Latency Metric
31
32. • Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough
leads to alert fatigue and disorientation.
• Top-down approach requires thought and custom instrumentation, but keeps you
focused on what’s important.
• High level metrics can be complemented by low level metrics. Trying to deduce the
former from the latter is futile.
• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the
tools dictate to you what you need to measure.
• Monitoring is - first of all - about your business.
TL;DR
32