SlideShare a Scribd company logo
1 of 76
Download to read offline
Brought to you
by
Misery Metrics and
Consequences
Gil Tene
CTO & co-Founder, Azul
Systems @giltene
Misery Metrics
and
Consequences
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
Agenda
Some intro stuff
How much of monitoring is simply broken
How it’s so bad that there is no fixing it
lights and tunnels
appreciating misery
Understanding Latency
and
Application Responsiveness
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
(Micro?)services
and
Response Time
Behavior
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How NOT to
Measure Latency
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
The “Oh S@%#!”
talk
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How I learned
to
stop worrying
and
love Misery
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How I learned
to
stop worrying
and
love Misery
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
(Micro)services
Source: “Visualizing and Tracking your Microservices”, AppDynamics blog
Source: “Architecting for Speed: How agile innovators accelerate growth through microservices”, Jesper Nordstrom & Tomas Betzholtzblog, 3gamma.com
Source: “Microservices at Amazon”, Alan Ho (@karlunho), apigee developer blog, image from circa 2009
This is a Microservice
This is a Microservices
Architecture
?
?
?
Source: “Netflix OSS, Spring Cloud, or Kubernetes? How About All of Them!”, Christian Posta, DZone
Remember: all I'm offering is the truth.
We like to look at pretty charts…
95%’lie
The “We only want to show good things” chart
What you find when you dig deeper into the 9s
What you find when you dig deeper into the 9s
What (TF) does the Average
of the 95%’lie mean?
What (TF) does the Average
of the 95%’lie mean?
Lets do the same with 100%’ile; Suppose we have a set of
100%’ile values for each minute:
[1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2]
“The average 100%’ile over the past 15 minutes was 42”
Same nonsense applies to averaging any other %’lie
#LatencyTipOfTheDay:
You can't average percentiles.
Period.
What percentile levels do YOU monitor?
99%’lie: a good indicator, right?
What are the chances of a single web page view
experiencing >99%’lie latency of:
- A single search engine node?
- A single Key/Value store node?
- A single Database node?
- A single CDN request?
- A single (micro)service call?
#LatencyTipOfTheDay:
MOST page loads will experience worse
than the 99%'lie server response
So why are we measuring it?
Gauging user experience
Example: If a typical user session involves 5 page loads,
averaging 40 resources per page.
-How many of our users will NOT experience something
worse than the 95%’lie of http requests?
Answer: ~0.003%
-How many of our users will NOT experience something
worse than the 99%’lie of http requests?
Answer: ~13%
Gauging user experience
Example: If a typical user session involves 5 page loads,
averaging 40 resources per page.
-What http response percentile will be experienced by the
95%’ile of users?
Answer: ~99.97%
-What http response percentile will be experienced by the
99%’ile of users
Answer: ~99.995%
Why don’t we have response time
or latency stats with multiple 9s
in them???
You can’t average percentiles…
And you also can’t get an hour’s
99.999%’lie out of lots
of 10 second interval 99%’lie
reports…
Why don’t we have response time
or latency stats with multiple 9s
in them???
You can’t average percentiles…
And you also can’t get an hour’s 99.999%’lie
out of lots
of 10 second interval 99%’lie reports…
Check out HdrHistogram
It lets you have nice things….
Why don’t we have response time or
latency stats with multiple 9s in
them???
0% 90% 99% 99.99% 99.999% 99.9999%
140
120
100
80
60
40
20
0
Hiccup
Duration
(msec)
99.9%
Percentile
Hiccups by Percentile Distribution
Hiccups by Percentile SLA
The coordinated omission
problem
An accidental conspiracy.
.
The lie in the 99%’lies
Coordinated Omission in Monitoring Code
Long operations only get measured once
delays outside of timing window do not get measured at all
How bad can this get?
Avg. is 1
msec over 1st
100 sec
System
Stalled for
100 Sec
System easily
handles 100
requests/sec
Responds to
each in
1msec
How would you characterize this
system?
Elapsed Time
~75%‘ile is 50
sec
~50%‘ile is 1
msec
99.99%‘ile is
~100sec
Avg. is 50 sec.
over next 100
sec
Overall Average response time is ~25
sec.
Measurement in practice
System
Stalled for
100 Sec
System easily
handles 100
requests/sec
Responds to
each in
1msec
What actually gets
measured?
Elapsed Time
75%‘lie is 1
msec
99.99%‘lie is 1
msec
50%‘ile is 1
msec
10,000
measurements
@ 1 msec each
1
measurement
@ 100 sec
Overall Average is 10.9 msec
(!!!)
Proper measurement
System
Stalled for
100 Sec
Elapsed
Time
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000
results
Varying
linearly from
100 sec
to 10 msec
10,000
results @ 1
msec each
~50%‘ile is 1
msec
~75%‘ile is 50
sec
99.99%‘ile is
~100sec
Proper measurement
System
Stalled for
100 Sec
Elapsed
Time
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000
results
Varying
linearly from
100 sec
to 10 msec
10,000
results @ 1
msec each
~50%‘ile is 1
msec
~75%‘ile is 50
sec
99.99%‘ile is
~100sec
Coordinated
Omission
1 msec 1 msec
1 result
@ 100
sec
“Better” can look “Worse”
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000 @
1msec
10,000 @ 5
msec
System
Slowed for
100 Sec
Still easily
handles 100
requests/sec
50%‘ile is 1
msec
99.99%‘lie is
~5msec
Responds to
each in 5
msec
Elapsed Time
75%‘lie is
2.5msec
Service Time vs. Response
Time
Response Time vs. Service Time @60K/sec
How “real” people react
How does the world
keep working ?!?!?
Luckily, I think that there is
a rainbow at the end of the
tunnel…
How does the world
even keep working
?!?!?
Something we are
doing actually works…
Something we are
doing actually works…
3000 BC
Graduated with honors
Witch Doctor Academy
Timeout
s
Retries Misses
Abandoned
Shopping Carts
Loss of
Engagemen
t
Failed
Queries
Angry
Phone Calls
It is nonsense to ask:
“was my 99%’lie a timeout?”
Instead Ask:
“How many timeouts
did I have?”
&
“What % of my
requests timed out?”
TransactionLoad
Success Rate
Looking at misery
Focus on measuring Misery
@gilten
Love
Misery
Love
Misery
Don’t aim to make a specific %’ile acceptable
Instead, aim for an acceptable %
of misery
final #LatencyTipOfTheDay:
Brought to you
by
Gil Tene
@gilten
e

More Related Content

What's hot

Cloud computing.pptx
Cloud computing.pptxCloud computing.pptx
Cloud computing.pptx
andrewbourget
 
R2서버정진욱
R2서버정진욱R2서버정진욱
R2서버정진욱
jungjinwouk
 

What's hot (20)

VMware vSphere 6.0 - Troubleshooting Training - Day 4
VMware vSphere 6.0 - Troubleshooting Training - Day 4VMware vSphere 6.0 - Troubleshooting Training - Day 4
VMware vSphere 6.0 - Troubleshooting Training - Day 4
 
Virtualization Uses - Server Consolidation
Virtualization Uses - Server Consolidation Virtualization Uses - Server Consolidation
Virtualization Uses - Server Consolidation
 
JET ENGINE PPT BY SANDEEP YADAV
JET ENGINE PPT BY SANDEEP YADAVJET ENGINE PPT BY SANDEEP YADAV
JET ENGINE PPT BY SANDEEP YADAV
 
IP Address Management Best Practices
IP Address Management Best PracticesIP Address Management Best Practices
IP Address Management Best Practices
 
six stroke engine powerpoint presentation
six stroke engine powerpoint presentationsix stroke engine powerpoint presentation
six stroke engine powerpoint presentation
 
DevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to KubernetesDevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to Kubernetes
 
Cloud computing.pptx
Cloud computing.pptxCloud computing.pptx
Cloud computing.pptx
 
HPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdf
HPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdfHPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdf
HPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdf
 
SolarWinds IP Address Manager New Version 3.1 is here!
SolarWinds IP Address Manager New Version 3.1 is here!SolarWinds IP Address Manager New Version 3.1 is here!
SolarWinds IP Address Manager New Version 3.1 is here!
 
Learning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNILearning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNI
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromised
 
R2서버정진욱
R2서버정진욱R2서버정진욱
R2서버정진욱
 
VCR Engine
VCR EngineVCR Engine
VCR Engine
 
GCGC- CGCII 서버 엔진에 적용된 기술 (1)
GCGC- CGCII 서버 엔진에 적용된 기술 (1)GCGC- CGCII 서버 엔진에 적용된 기술 (1)
GCGC- CGCII 서버 엔진에 적용된 기술 (1)
 
Docker Advanced registry usage
Docker Advanced registry usageDocker Advanced registry usage
Docker Advanced registry usage
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Server training
Server trainingServer training
Server training
 
Understanding MicroK8s as a Beginner.pdf
Understanding MicroK8s as a Beginner.pdfUnderstanding MicroK8s as a Beginner.pdf
Understanding MicroK8s as a Beginner.pdf
 
3500BB(1).ppt
3500BB(1).ppt3500BB(1).ppt
3500BB(1).ppt
 
hosting.ppt
hosting.ppthosting.ppt
hosting.ppt
 

Similar to Misery Metrics & Consequences

The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
Alois Reitbauer
 
10 impact of uncertainty in matching
10 impact of uncertainty in matching10 impact of uncertainty in matching
10 impact of uncertainty in matching
Rishi Mathur
 

Similar to Misery Metrics & Consequences (20)

How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
MeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World RumMeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World Rum
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on AzureVSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Coradiant
CoradiantCoradiant
Coradiant
 
When down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureWhen down is not good enough. SRE On Azure
When down is not good enough. SRE On Azure
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
 
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday seasonMeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
 
Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
 
MeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPIMeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPI
 
Understanding Latency Response Time and Behavior
Understanding Latency Response Time and BehaviorUnderstanding Latency Response Time and Behavior
Understanding Latency Response Time and Behavior
 
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and ToolsIntelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
10 impact of uncertainty in matching
10 impact of uncertainty in matching10 impact of uncertainty in matching
10 impact of uncertainty in matching
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 

More from ScyllaDB

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Misery Metrics & Consequences

  • 1. Brought to you by Misery Metrics and Consequences Gil Tene CTO & co-Founder, Azul Systems @giltene
  • 2. Misery Metrics and Consequences Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 3. Agenda Some intro stuff How much of monitoring is simply broken How it’s so bad that there is no fixing it lights and tunnels appreciating misery
  • 4. Understanding Latency and Application Responsiveness Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 5. (Micro?)services and Response Time Behavior Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 6. How NOT to Measure Latency Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 7. The “Oh S@%#!” talk Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 8. How I learned to stop worrying and love Misery Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 9. How I learned to stop worrying and love Misery Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 11. Source: “Visualizing and Tracking your Microservices”, AppDynamics blog
  • 12. Source: “Architecting for Speed: How agile innovators accelerate growth through microservices”, Jesper Nordstrom & Tomas Betzholtzblog, 3gamma.com
  • 13. Source: “Microservices at Amazon”, Alan Ho (@karlunho), apigee developer blog, image from circa 2009
  • 14. This is a Microservice
  • 15. This is a Microservices Architecture
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. ? ? ?
  • 21. Source: “Netflix OSS, Spring Cloud, or Kubernetes? How About All of Them!”, Christian Posta, DZone
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Remember: all I'm offering is the truth.
  • 27. We like to look at pretty charts… 95%’lie The “We only want to show good things” chart
  • 28.
  • 29. What you find when you dig deeper into the 9s
  • 30. What you find when you dig deeper into the 9s
  • 31.
  • 32.
  • 33.
  • 34. What (TF) does the Average of the 95%’lie mean?
  • 35. What (TF) does the Average of the 95%’lie mean? Lets do the same with 100%’ile; Suppose we have a set of 100%’ile values for each minute: [1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2] “The average 100%’ile over the past 15 minutes was 42” Same nonsense applies to averaging any other %’lie
  • 37. What percentile levels do YOU monitor?
  • 38. 99%’lie: a good indicator, right? What are the chances of a single web page view experiencing >99%’lie latency of: - A single search engine node? - A single Key/Value store node? - A single Database node? - A single CDN request? - A single (micro)service call?
  • 39.
  • 40.
  • 41. #LatencyTipOfTheDay: MOST page loads will experience worse than the 99%'lie server response So why are we measuring it?
  • 42. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. -How many of our users will NOT experience something worse than the 95%’lie of http requests? Answer: ~0.003% -How many of our users will NOT experience something worse than the 99%’lie of http requests? Answer: ~13%
  • 43. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. -What http response percentile will be experienced by the 95%’ile of users? Answer: ~99.97% -What http response percentile will be experienced by the 99%’ile of users Answer: ~99.995%
  • 44. Why don’t we have response time or latency stats with multiple 9s in them???
  • 45. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Why don’t we have response time or latency stats with multiple 9s in them???
  • 46. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Check out HdrHistogram It lets you have nice things…. Why don’t we have response time or latency stats with multiple 9s in them??? 0% 90% 99% 99.99% 99.999% 99.9999% 140 120 100 80 60 40 20 0 Hiccup Duration (msec) 99.9% Percentile Hiccups by Percentile Distribution Hiccups by Percentile SLA
  • 47. The coordinated omission problem An accidental conspiracy. . The lie in the 99%’lies
  • 48. Coordinated Omission in Monitoring Code Long operations only get measured once delays outside of timing window do not get measured at all
  • 49. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? Elapsed Time ~75%‘ile is 50 sec ~50%‘ile is 1 msec 99.99%‘ile is ~100sec Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 50. Measurement in practice System Stalled for 100 Sec System easily handles 100 requests/sec Responds to each in 1msec What actually gets measured? Elapsed Time 75%‘lie is 1 msec 99.99%‘lie is 1 msec 50%‘ile is 1 msec 10,000 measurements @ 1 msec each 1 measurement @ 100 sec Overall Average is 10.9 msec (!!!)
  • 51. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 52. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Coordinated Omission 1 msec 1 msec 1 result @ 100 sec
  • 53. “Better” can look “Worse” System easily handles 100 requests/sec Responds to each in 1msec 10,000 @ 1msec 10,000 @ 5 msec System Slowed for 100 Sec Still easily handles 100 requests/sec 50%‘ile is 1 msec 99.99%‘lie is ~5msec Responds to each in 5 msec Elapsed Time 75%‘lie is 2.5msec
  • 54. Service Time vs. Response Time
  • 55. Response Time vs. Service Time @60K/sec
  • 57.
  • 58. How does the world keep working ?!?!?
  • 59. Luckily, I think that there is a rainbow at the end of the tunnel…
  • 60. How does the world even keep working ?!?!?
  • 61. Something we are doing actually works…
  • 62. Something we are doing actually works…
  • 63.
  • 64. 3000 BC Graduated with honors Witch Doctor Academy
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70. Timeout s Retries Misses Abandoned Shopping Carts Loss of Engagemen t Failed Queries Angry Phone Calls
  • 71. It is nonsense to ask: “was my 99%’lie a timeout?” Instead Ask: “How many timeouts did I have?” & “What % of my requests timed out?”
  • 75. Love Misery Don’t aim to make a specific %’ile acceptable Instead, aim for an acceptable % of misery final #LatencyTipOfTheDay:
  • 76. Brought to you by Gil Tene @gilten e