Top-Down Approach to Monitoring

Top-Down Approach to Monitoring
July 30, 2015

1996
2
Tivoli Software
acquired by IBM
Patrol Software
acquired by BMC
Ethan Galstad creates a simple 
MS-DOS application designed to  
"ping" Novell Netware servers
“HOW to monitor?” is the primary question

2015
3
https://www.bigpanda.io/monitoringscape/

Shifting from “How?” to “What?”
4

Bottom-Up Approach
6
Network Servers Apps
Overall System Health

Problem #1: Inﬂation of Tools
7

Problem #2: Inﬂation of “Whats”
8

Problem #3: Inﬂation of Alerts
9

11
We’re trying to answer a simple question:
Is our system in a healthy state?

12
No Alerts
Many Alerts Unhealthy System≠
≠ Healthy System

13
Healthy System =
A system that continuously  
generates value for its users 
under a well known set of KPIs

Top-Down Approach
14
KPIs UX

15
KPIs UX
Overall System Health Network Servers Apps
• Selective
• Proactive
• Exhaustive
• Reactive
vs
Bottom-UpTop-Down

A key performance indicator (KPI)
is a business metric used to
evaluate factors that are crucial to
the success of an organization.
KPIs differ per organization;
Deﬁnition of KPI
16

Let’s play a game!
17
CPU Utilization # Clicks on  
a button
TemperatureThis is Sam
What does Sam’s company do?

We sought out a single indicator that closely approximated our most
important activity: viewing. We discovered that a server-side metric
related to playback starts (the act of “clicking play”) had both a
predictable pattern and fluctuated significantly when UI/device/server
problems were happening. The Netflix streaming pulse was created.  
 
The Pulse of Netflix
18
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
We named it “SPS” for “starts per second”.

Healthy SPS Pattern
19

Unhealthy SPS Pattern
20

What’s so special about SPS?
21
• SPS is easy to understand by all stakeholders
• One metric that covers different point of failure: server
problems, device problems, etc.
• Most important: it’s a clear KPI that indicates when user
experience is compromised

But what about root cause analysis?
22
KPIs UX
Network Servers Apps

Github: need for speed
23
https://github.com/blog/1252-how-we-keep-github-fast
The most important factor in web application
design is responsiveness. And the ﬁrst step
toward responsiveness is speed. But speed
within a web application is complicated.

Start from the Top: 
Response Times Dashboard
24
• Each row represented a different major 
component
• Clicking one of the rows allows you to dive in  
and see the mean, 98th percentile, and 99.9th  
percentile response times

Digging Deeper: 
Mission Control Bar
25
Total Time Render Time Cache & Database JS & CSS Size

And Deeper
26
Render Breakdown
SQL Query Viewer

27
Why talk about BigPanda?
Because Pandas  
are awesome!

BigPanda
28
Because..
• We’re not Netﬂix or Github: growing startup (7 devs, 1 full-time Ops)
• We feel the pain!
• Our KPIs are easy to describe and understand (especially if you’re an
Ops person)

BigPanda
29
As a uniﬁed dashboard on top of all your
monitoring systems, and eventually a single
point of truth for production incidents, our data
pipeline has to be reliable and fast.
KPI: Low data pipeline latency

Pipeline Latency Metric
30
• Metric are sent from within the apps
• Stored in Graphite
• Sum of all the average latencies of
all alerts that went through the
pipeline
• Monitored by Nagios

• Very good indicator of possible service outage
• Must have for detection of SLA violation
• Very good indicator of performance
bottlenecks (can be broken down to sub-
pipelines / speciﬁc organizations etc)
• Simple and high-level: easy to explain to non-
technical stakeholders (e.g. sales)
Pipeline Latency Metric
31

• Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough
leads to alert fatigue and disorientation.
• Top-down approach requires thought and custom instrumentation, but keeps you
focused on what’s important.
• High level metrics can be complemented by low level metrics. Trying to deduce the
former from the latter is futile.
• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the
tools dictate to you what you need to measure.
• Monitoring is - ﬁrst of all - about your business.
TL;DR
32

Top-Down Approach to Monitoring

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Top-Down Approach to Monitoring

Semelhante a Top-Down Approach to Monitoring (20)

Último

Último (20)

Top-Down Approach to Monitoring