SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Top-Down Approach to Monitoring
July 30, 2015
1996
2
Tivoli Software
acquired by IBM
Patrol Software
acquired by BMC
Ethan Galstad creates a simple

MS-DOS application designed to 

"ping" Novell Netware servers
“HOW to monitor?” is the primary question
2015
3
https://www.bigpanda.io/monitoringscape/
Shifting from “How?” to “What?”
4
5
Bottom-Up Approach
6
Network Servers Apps
Overall System Health
Problem #1: Inflation of Tools
7
Problem #2: Inflation of “Whats”
8
Problem #3: Inflation of Alerts
9
10
11
We’re trying to answer a simple question:
Is our system in a healthy state?
12
No Alerts
Many Alerts Unhealthy System≠
≠ Healthy System
13
Healthy System =
A system that continuously 

generates value for its users

under a well known set of KPIs
Top-Down Approach
14
KPIs UX
Overall System Health
15
KPIs UX
Overall System Health Network Servers Apps
Overall System Health
• Selective
• Proactive
• Exhaustive
• Reactive
vs
Bottom-UpTop-Down
A key performance indicator (KPI)
is a business metric used to
evaluate factors that are crucial to
the success of an organization.
KPIs differ per organization;
Definition of KPI
16
Let’s play a game!
17
CPU Utilization # Clicks on 

a button
TemperatureThis is Sam
What does Sam’s company do?
We sought out a single indicator that closely approximated our most
important activity: viewing. We discovered that a server-side metric
related to playback starts (the act of “clicking play”) had both a
predictable pattern and fluctuated significantly when UI/device/server
problems were happening. The Netflix streaming pulse was created. 



The Pulse of Netflix
18
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
We named it “SPS” for “starts per second”.
Healthy SPS Pattern
19
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
Unhealthy SPS Pattern
20
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
What’s so special about SPS?
21
• SPS is easy to understand by all stakeholders
• One metric that covers different point of failure: server
problems, device problems, etc.
• Most important: it’s a clear KPI that indicates when user
experience is compromised
But what about root cause analysis?
22
KPIs UX
Overall System Health
Network Servers Apps
Github: need for speed
23
https://github.com/blog/1252-how-we-keep-github-fast
The most important factor in web application
design is responsiveness. And the first step
toward responsiveness is speed. But speed
within a web application is complicated.
Start from the Top:

Response Times Dashboard
24
https://github.com/blog/1252-how-we-keep-github-fast
• Each row represented a different major

component
• Clicking one of the rows allows you to dive in 

and see the mean, 98th percentile, and 99.9th 

percentile response times
Digging Deeper:

Mission Control Bar
25
https://github.com/blog/1252-how-we-keep-github-fast
Total Time Render Time Cache & Database JS & CSS Size
And Deeper
26
https://github.com/blog/1252-how-we-keep-github-fast
Render Breakdown
SQL Query Viewer
27
Why talk about BigPanda?
Because Pandas 

are awesome!
BigPanda
28
Because..
• We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops)
• We feel the pain!
• Our KPIs are easy to describe and understand (especially if you’re an
Ops person)
BigPanda
29
As a unified dashboard on top of all your
monitoring systems, and eventually a single
point of truth for production incidents, our data
pipeline has to be reliable and fast.
KPI: Low data pipeline latency
Pipeline Latency Metric
30
• Metric are sent from within the apps
• Stored in Graphite
• Sum of all the average latencies of
all alerts that went through the
pipeline
• Monitored by Nagios
• Very good indicator of possible service outage
• Must have for detection of SLA violation
• Very good indicator of performance
bottlenecks (can be broken down to sub-
pipelines / specific organizations etc)
• Simple and high-level: easy to explain to non-
technical stakeholders (e.g. sales)
Pipeline Latency Metric
31
• Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough
leads to alert fatigue and disorientation.
• Top-down approach requires thought and custom instrumentation, but keeps you
focused on what’s important.
• High level metrics can be complemented by low level metrics. Trying to deduce the
former from the latter is futile.
• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the
tools dictate to you what you need to measure.
• Monitoring is - first of all - about your business.
TL;DR
32
33
Questions?
34
Thanks!

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Threat hunting 101 by Sandeep Singh
Threat hunting 101 by Sandeep SinghThreat hunting 101 by Sandeep Singh
Threat hunting 101 by Sandeep Singh
 
SOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations CenterSOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations Center
 
Cyber Security Needs and Challenges
Cyber Security Needs and ChallengesCyber Security Needs and Challenges
Cyber Security Needs and Challenges
 
Walk This Way: CIS CSC and NIST CSF is the 80 in the 80/20 rule
Walk This Way: CIS CSC and NIST CSF is the 80 in the 80/20 ruleWalk This Way: CIS CSC and NIST CSF is the 80 in the 80/20 rule
Walk This Way: CIS CSC and NIST CSF is the 80 in the 80/20 rule
 
PaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPaloAlto Enterprise Security Solution
PaloAlto Enterprise Security Solution
 
Security operation center (SOC)
Security operation center (SOC)Security operation center (SOC)
Security operation center (SOC)
 
Extended Detection and Response (XDR) An Overhyped Product Category With Ulti...
Extended Detection and Response (XDR)An Overhyped Product Category With Ulti...Extended Detection and Response (XDR)An Overhyped Product Category With Ulti...
Extended Detection and Response (XDR) An Overhyped Product Category With Ulti...
 
Understanding SASE
Understanding SASE Understanding SASE
Understanding SASE
 
OWASP based Threat Modeling Framework
OWASP based Threat Modeling FrameworkOWASP based Threat Modeling Framework
OWASP based Threat Modeling Framework
 
Compliance in the Cloud Using “Security by Design” Principles
Compliance in the Cloud Using “Security by Design” PrinciplesCompliance in the Cloud Using “Security by Design” Principles
Compliance in the Cloud Using “Security by Design” Principles
 
DTS Solution - Building a SOC (Security Operations Center)
DTS Solution - Building a SOC (Security Operations Center)DTS Solution - Building a SOC (Security Operations Center)
DTS Solution - Building a SOC (Security Operations Center)
 
Security Metrics Program
Security Metrics ProgramSecurity Metrics Program
Security Metrics Program
 
Bitglass Webinar - Top 6 CASB Use Cases
Bitglass Webinar - Top 6 CASB Use CasesBitglass Webinar - Top 6 CASB Use Cases
Bitglass Webinar - Top 6 CASB Use Cases
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
 
Threat Modeling Using STRIDE
Threat Modeling Using STRIDEThreat Modeling Using STRIDE
Threat Modeling Using STRIDE
 
Effective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza AdinehEffective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza Adineh
 
Introduction to NIST Cybersecurity Framework
Introduction to NIST Cybersecurity FrameworkIntroduction to NIST Cybersecurity Framework
Introduction to NIST Cybersecurity Framework
 
AIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-PilotAIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-Pilot
 
Meet the New IBM i2 QRadar Offense Investigator App and Start Threat Hunting ...
Meet the New IBM i2 QRadar Offense Investigator App and Start Threat Hunting ...Meet the New IBM i2 QRadar Offense Investigator App and Start Threat Hunting ...
Meet the New IBM i2 QRadar Offense Investigator App and Start Threat Hunting ...
 
Identity and Access Management (IAM): Benefits and Best Practices 
Identity and Access Management (IAM): Benefits and Best Practices Identity and Access Management (IAM): Benefits and Best Practices 
Identity and Access Management (IAM): Benefits and Best Practices 
 

Destaque

Vejthani HR : KPI (Key Performance Indicator) (PPT)
Vejthani HR : KPI (Key Performance Indicator) (PPT)Vejthani HR : KPI (Key Performance Indicator) (PPT)
Vejthani HR : KPI (Key Performance Indicator) (PPT)
porche123
 
Fundamental analysis ppt
Fundamental analysis pptFundamental analysis ppt
Fundamental analysis ppt
Dharmik
 
Fundamental analysis
Fundamental analysisFundamental analysis
Fundamental analysis
eshabhatia
 

Destaque (20)

Crisis Communication Webinar
Crisis Communication WebinarCrisis Communication Webinar
Crisis Communication Webinar
 
Aisling Foley Marketing brochure
Aisling Foley Marketing brochureAisling Foley Marketing brochure
Aisling Foley Marketing brochure
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
SplunkLive! London 2016 Splunk for IT Ops
SplunkLive! London 2016 Splunk for IT OpsSplunkLive! London 2016 Splunk for IT Ops
SplunkLive! London 2016 Splunk for IT Ops
 
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan TurchinService Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
 
Improving DevOps through better monitoring
Improving DevOps through better monitoringImproving DevOps through better monitoring
Improving DevOps through better monitoring
 
Cobit 4.1 Highlights
Cobit 4.1 HighlightsCobit 4.1 Highlights
Cobit 4.1 Highlights
 
SplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT OperationsSplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT Operations
 
Using data science to automate event correlation - June 2016 - Dan Turchin - ...
Using data science to automate event correlation - June 2016 - Dan Turchin - ...Using data science to automate event correlation - June 2016 - Dan Turchin - ...
Using data science to automate event correlation - June 2016 - Dan Turchin - ...
 
SplunkLive! Milano 2016 - customer presentation - Unicredit
SplunkLive! Milano 2016 -  customer presentation - UnicreditSplunkLive! Milano 2016 -  customer presentation - Unicredit
SplunkLive! Milano 2016 - customer presentation - Unicredit
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Webinar - How to Get Real-Time Network Management Right?
Webinar - How to Get Real-Time Network Management Right?Webinar - How to Get Real-Time Network Management Right?
Webinar - How to Get Real-Time Network Management Right?
 
Vejthani HR : KPI (Key Performance Indicator) (PPT)
Vejthani HR : KPI (Key Performance Indicator) (PPT)Vejthani HR : KPI (Key Performance Indicator) (PPT)
Vejthani HR : KPI (Key Performance Indicator) (PPT)
 
Machine Learning + Analytics in Splunk
Machine Learning + Analytics in SplunkMachine Learning + Analytics in Splunk
Machine Learning + Analytics in Splunk
 
textile design
textile designtextile design
textile design
 
Draping
DrapingDraping
Draping
 
Apostila do Senai modelagem
Apostila do Senai modelagem Apostila do Senai modelagem
Apostila do Senai modelagem
 
Fundamental analysis ppt
Fundamental analysis pptFundamental analysis ppt
Fundamental analysis ppt
 
Fundamental analysis
Fundamental analysisFundamental analysis
Fundamental analysis
 

Semelhante a Top-Down Approach to Monitoring

Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?
Reveille Software
 
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
NETWAYS
 

Semelhante a Top-Down Approach to Monitoring (20)

NetIQ AppManager & NetIQ Operations Center. NCU Ltd
NetIQ AppManager & NetIQ Operations Center. NCU LtdNetIQ AppManager & NetIQ Operations Center. NCU Ltd
NetIQ AppManager & NetIQ Operations Center. NCU Ltd
 
Improving Lean Manufacturing Through a KPI Analysis System
Improving Lean Manufacturing Through a KPI Analysis SystemImproving Lean Manufacturing Through a KPI Analysis System
Improving Lean Manufacturing Through a KPI Analysis System
 
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
 
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations CenterAppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Our Journey To Continuous Delivery
Our Journey To Continuous DeliveryOur Journey To Continuous Delivery
Our Journey To Continuous Delivery
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
Six cigma AJAL
Six cigma AJALSix cigma AJAL
Six cigma AJAL
 
AppSphere 15 - Achieving Stability and End-to-End Monitoring
AppSphere 15 - Achieving Stability and End-to-End MonitoringAppSphere 15 - Achieving Stability and End-to-End Monitoring
AppSphere 15 - Achieving Stability and End-to-End Monitoring
 
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
 
The benefits of ALM and PLM Integration
The benefits of ALM and PLM IntegrationThe benefits of ALM and PLM Integration
The benefits of ALM and PLM Integration
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
What’s New with NGINX Controller Load Balancing Module 2.0?
What’s New with NGINX Controller Load Balancing Module 2.0?What’s New with NGINX Controller Load Balancing Module 2.0?
What’s New with NGINX Controller Load Balancing Module 2.0?
 
Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?
 
Software Engineering 2 lecture slide
Software Engineering 2 lecture slideSoftware Engineering 2 lecture slide
Software Engineering 2 lecture slide
 
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
 
OSMC 2016 - Application Performance Management with Open-Source-Tooling by M...
OSMC 2016 -  Application Performance Management with Open-Source-Tooling by M...OSMC 2016 -  Application Performance Management with Open-Source-Tooling by M...
OSMC 2016 - Application Performance Management with Open-Source-Tooling by M...
 
Take Control of Application Performance
Take Control of Application PerformanceTake Control of Application Performance
Take Control of Application Performance
 
System design
System designSystem design
System design
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Top-Down Approach to Monitoring