SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Confidential + ProprietaryConfidential + Proprietary
Finding (and Fixing!) Performance Anomalies
in Large Scale Distributed Systems
Victor Marmol
vmarmol@google.com
Confidential + Proprietary
Today
App
? ? ?
Confidential + Proprietary
Containers Infrastructure
Manage containers @ Google
Everything runs in a container
2B+ containers started per week
Images by Connie Zhou
Confidential + Proprietary
You may Know Some of our OSS Work
Let Me Contain That For You
Confidential + Proprietary
What about at Google?
Images by Connie Zhou
Confidential + Proprietary
Borg
Confidential + Proprietary
What is Borg?
Large-scale cluster management at Google with Borg
Confidential + Proprietary
Borglet
Google’s node agent
Borglet = init + Docker + a few other things
Primary goals
➔ Talk to master
➔ Manage tasks
➔ Manage resources (containers)
Confidential + Proprietary
How do we get to task performance management?
Dremel: Interactive Analysis of Web-Scale Datasets
Confidential + Proprietary
Task Performance Analysis (TPA)
Our system for container-based black-box application performance analysis
Containers are the main enabler
Manage, monitor, and improve application performance
Today’s Talk
➔ How does it work
➔ User stories: stories from the front-lines!
Container
App
Confidential + ProprietaryConfidential + Proprietary
How does it work?
Confidential + Proprietary
Overall Flow
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Key: collect lots of container-based low-level metrics from the kernel
Custom kernel patches to give us even more stats and metrics
Sources
➔ cgroups
➔ /proc
➔ perf_events
➔ misc (e.g.: netlink, ioctls, etc)
Container
App
low-level performance metrics and telemetry
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Histograms are our favorite: number, breakdown, and tail of operations
➔ CPU latencies
➔ Memory reclaim, page faults, re-faults
➔ I/O wait time and service time
Metrics collected every 1s - 10s
➔ 1s: Used for on-machine control loops
➔ 10s: Exported for off-machine analysis
Collection is very low-overhead
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Cluster-Wide Aggregation
Cluster service that collects all metrics and exports them to Dremel
Push data for all tasks on all machines, keep them for a while
Single-handedly our most valuable resource
➔ SQL is very expressive and flexible
➔ Ability to query all that data in seconds: priceless
Best news: You can use it too! Google BigQuery
Performance
Data DB
BigQuery
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Performance Baselines
Cluster-level service: slice & dice data
➔ Types of tasks
➔ Distributions across replicas
➔ Per compute cluster (Borg cell)
➔ Historical trends
Gives us insights into performance trends and helps us develop performance
baselines
Performance baseline: performance we can achieve given different parameters
➔ CPU: How quickly can we schedule you on the CPU
➔ Disk I/O: What disk I/O latency can we achieve
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Baselines → SLOs
From baselines we provide performance SLOs:
promise to the user
You promise to do X
➔ CPU: Use at most as much CPU as you asked for
➔ Disk I/O: Issue less than X I/Os per second
We promise to give you Y performance
➔ CPU: You will get scheduled on a CPU within Yms of requesting it
➔ Disk I/O: You will get I/O wait time of at most Yms
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Enacting SLOs
Monitor SLOs closely and aggressively ensure they are met
Per-node
➔ Give more resources or better quality resources
➔ Throttle bad actors (antagonists)
Cluster-wide
➔ Ask for help!
➔ Move task to a different machine
➔ Move antagonist to a different machine
Container
App
Container
App
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Metrics
➔ CPU
➔ NUMA
➔ Disk I/O
Confidential + Proprietary
CPU
Low-level metrics
➔ Wakeup latency: time between
wanting to run and running
➔ Round-robin latency: how well
you share CPU within your app
➔ Load: how much work you
wanted to do
➔ Time per state: how much time
your spent in each state (e.g.:
sleep, wait, run, queue)
Confidential + Proprietary
CPU
SLOs
➔ Wakeup latency when
well-behaved
➔ CPU usage rate when
well-behaved
Confidential + Proprietary
NUMA
Low-level metrics
➔ CPU locality: how much of your CPU (and
usage) was in local vs remote nodes
➔ Memory locality: how much of your memory
(and accesses) was in local vs remote
nodes
➔ NUMA score: resource-product of both
above (0.0 - 1.0)
SLOs
➔ NUMA score of 0.85 or above given certain
job shapes
The NUMA Experience
Confidential + Proprietary
Disk I/O
Low-level metrics
➔ Service time latency: time it took kernel to service request to disk
➔ Wait time latency: time it took kernel to queue and service request
to disk
➔ Queued: how much work you wanted to do
➔ Usage: how much work did you actually did
SLOs
➔ Small amount of disk time when well-behaved
Confidential + ProprietaryConfidential + Proprietary
User Stories
Confidential + Proprietary
Performance Regression
User: VM environment
User Problem: … silence ...
SLO not met: CPU
Signal: CPU queue other
Root cause: Subtle, but expensive, new periodic operation
Make it better: Give the application more debug information
Confidential + Proprietary
Performance Variation #1
User: Flight search
User Problem: QPS variation on some tasks
SLO not met: NUMA
Signal: CPU and memory locality
Root cause: Bad NUMA allocation by infrastructure
Make it better: Improve NUMA allocation
Confidential + Proprietary
Performance Variation #2
User: Web search
User Problem: Latency variation on some task
SLO not met: CPI variation
Signal: CPI from perf_events
Root cause: Bad actors co-scheduled on the machine
Make it better: Throttle or move these bad actors
Confidential + Proprietary
Performance Degradation Under Load
User: Borglet
User Problem: Stuckness under heavy load
SLO not met: Disk access
Signal: Disk I/O wait time latencies
Root cause: Heavy disk operations blocking other operations
Make it better: Move disk operations away from latency sensitive operations
Confidential + Proprietary
Future Work
➔ Signals for more resources (e.g.: memory)
➔ Using the right signals
➔ Better reporting and fleet-wide view to catch regressions across various
components
Helping apps more
➔ Where are the problems?
➔ Suggest how to fix problems we can’t fix ourselves
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Confidential + Proprietary
Questions?
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Victor Marmol
vmarmol@google.com
● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable
Confidential + Proprietary
Questions?
Images by
Connie Zhou

Mais conteúdo relacionado

Mais procurados

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Use case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in productionUse case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in production
知教 本間
 
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
DataStax
 

Mais procurados (19)

Pydata2014
Pydata2014Pydata2014
Pydata2014
 
(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014
(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014
(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
deep learning in production cff 2017
deep learning in production cff 2017deep learning in production cff 2017
deep learning in production cff 2017
 
Carlos Conde : AWS Game Days - TIAD Paris
Carlos Conde : AWS Game Days - TIAD ParisCarlos Conde : AWS Game Days - TIAD Paris
Carlos Conde : AWS Game Days - TIAD Paris
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Use case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in productionUse case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in production
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
 
使用ZooKeeper打造軟體式負載平衡
使用ZooKeeper打造軟體式負載平衡使用ZooKeeper打造軟體式負載平衡
使用ZooKeeper打造軟體式負載平衡
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
 
Container Orchestration with Amazon ECS
Container Orchestration with Amazon ECSContainer Orchestration with Amazon ECS
Container Orchestration with Amazon ECS
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 

Semelhante a ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at Splunk
Docker, Inc.
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
Kyle Hailey
 

Semelhante a ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems (20)

[@IndeedEng] Redundant Array of Inexpensive Datacenters
[@IndeedEng] Redundant Array of Inexpensive Datacenters[@IndeedEng] Redundant Array of Inexpensive Datacenters
[@IndeedEng] Redundant Array of Inexpensive Datacenters
 
DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
 
Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at Splunk
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408
 
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
 
Practical DMD Scripting
Practical DMD Scripting Practical DMD Scripting
Practical DMD Scripting
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
 
Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...
Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...
Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
 

Último

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 

Último (20)

Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdf
 
Lesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsxLesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsx
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdf
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
 

ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

  • 1. Confidential + ProprietaryConfidential + Proprietary Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems Victor Marmol vmarmol@google.com
  • 3. Confidential + Proprietary Containers Infrastructure Manage containers @ Google Everything runs in a container 2B+ containers started per week Images by Connie Zhou
  • 4. Confidential + Proprietary You may Know Some of our OSS Work Let Me Contain That For You
  • 5. Confidential + Proprietary What about at Google? Images by Connie Zhou
  • 7. Confidential + Proprietary What is Borg? Large-scale cluster management at Google with Borg
  • 8. Confidential + Proprietary Borglet Google’s node agent Borglet = init + Docker + a few other things Primary goals ➔ Talk to master ➔ Manage tasks ➔ Manage resources (containers)
  • 9. Confidential + Proprietary How do we get to task performance management? Dremel: Interactive Analysis of Web-Scale Datasets
  • 10. Confidential + Proprietary Task Performance Analysis (TPA) Our system for container-based black-box application performance analysis Containers are the main enabler Manage, monitor, and improve application performance Today’s Talk ➔ How does it work ➔ User stories: stories from the front-lines! Container App
  • 11. Confidential + ProprietaryConfidential + Proprietary How does it work?
  • 12. Confidential + Proprietary Overall Flow Collection → Aggregation → Baselines → SLOs → Enforcement
  • 13. Confidential + Proprietary Low-Level Performance Metrics Key: collect lots of container-based low-level metrics from the kernel Custom kernel patches to give us even more stats and metrics Sources ➔ cgroups ➔ /proc ➔ perf_events ➔ misc (e.g.: netlink, ioctls, etc) Container App low-level performance metrics and telemetry Collection → Aggregation → Baselines → SLOs → Enforcement
  • 14. Confidential + Proprietary Low-Level Performance Metrics Histograms are our favorite: number, breakdown, and tail of operations ➔ CPU latencies ➔ Memory reclaim, page faults, re-faults ➔ I/O wait time and service time Metrics collected every 1s - 10s ➔ 1s: Used for on-machine control loops ➔ 10s: Exported for off-machine analysis Collection is very low-overhead Collection → Aggregation → Baselines → SLOs → Enforcement
  • 15. Confidential + Proprietary Cluster-Wide Aggregation Cluster service that collects all metrics and exports them to Dremel Push data for all tasks on all machines, keep them for a while Single-handedly our most valuable resource ➔ SQL is very expressive and flexible ➔ Ability to query all that data in seconds: priceless Best news: You can use it too! Google BigQuery Performance Data DB BigQuery Collection → Aggregation → Baselines → SLOs → Enforcement
  • 16. Confidential + Proprietary Performance Baselines Cluster-level service: slice & dice data ➔ Types of tasks ➔ Distributions across replicas ➔ Per compute cluster (Borg cell) ➔ Historical trends Gives us insights into performance trends and helps us develop performance baselines Performance baseline: performance we can achieve given different parameters ➔ CPU: How quickly can we schedule you on the CPU ➔ Disk I/O: What disk I/O latency can we achieve Collection → Aggregation → Baselines → SLOs → Enforcement
  • 17. Confidential + Proprietary Baselines → SLOs From baselines we provide performance SLOs: promise to the user You promise to do X ➔ CPU: Use at most as much CPU as you asked for ➔ Disk I/O: Issue less than X I/Os per second We promise to give you Y performance ➔ CPU: You will get scheduled on a CPU within Yms of requesting it ➔ Disk I/O: You will get I/O wait time of at most Yms Collection → Aggregation → Baselines → SLOs → Enforcement
  • 18. Confidential + Proprietary Enacting SLOs Monitor SLOs closely and aggressively ensure they are met Per-node ➔ Give more resources or better quality resources ➔ Throttle bad actors (antagonists) Cluster-wide ➔ Ask for help! ➔ Move task to a different machine ➔ Move antagonist to a different machine Container App Container App Collection → Aggregation → Baselines → SLOs → Enforcement
  • 19. Confidential + Proprietary Metrics ➔ CPU ➔ NUMA ➔ Disk I/O
  • 20. Confidential + Proprietary CPU Low-level metrics ➔ Wakeup latency: time between wanting to run and running ➔ Round-robin latency: how well you share CPU within your app ➔ Load: how much work you wanted to do ➔ Time per state: how much time your spent in each state (e.g.: sleep, wait, run, queue)
  • 21. Confidential + Proprietary CPU SLOs ➔ Wakeup latency when well-behaved ➔ CPU usage rate when well-behaved
  • 22. Confidential + Proprietary NUMA Low-level metrics ➔ CPU locality: how much of your CPU (and usage) was in local vs remote nodes ➔ Memory locality: how much of your memory (and accesses) was in local vs remote nodes ➔ NUMA score: resource-product of both above (0.0 - 1.0) SLOs ➔ NUMA score of 0.85 or above given certain job shapes The NUMA Experience
  • 23. Confidential + Proprietary Disk I/O Low-level metrics ➔ Service time latency: time it took kernel to service request to disk ➔ Wait time latency: time it took kernel to queue and service request to disk ➔ Queued: how much work you wanted to do ➔ Usage: how much work did you actually did SLOs ➔ Small amount of disk time when well-behaved
  • 24. Confidential + ProprietaryConfidential + Proprietary User Stories
  • 25. Confidential + Proprietary Performance Regression User: VM environment User Problem: … silence ... SLO not met: CPU Signal: CPU queue other Root cause: Subtle, but expensive, new periodic operation Make it better: Give the application more debug information
  • 26. Confidential + Proprietary Performance Variation #1 User: Flight search User Problem: QPS variation on some tasks SLO not met: NUMA Signal: CPU and memory locality Root cause: Bad NUMA allocation by infrastructure Make it better: Improve NUMA allocation
  • 27. Confidential + Proprietary Performance Variation #2 User: Web search User Problem: Latency variation on some task SLO not met: CPI variation Signal: CPI from perf_events Root cause: Bad actors co-scheduled on the machine Make it better: Throttle or move these bad actors
  • 28. Confidential + Proprietary Performance Degradation Under Load User: Borglet User Problem: Stuckness under heavy load SLO not met: Disk access Signal: Disk I/O wait time latencies Root cause: Heavy disk operations blocking other operations Make it better: Move disk operations away from latency sensitive operations
  • 29. Confidential + Proprietary Future Work ➔ Signals for more resources (e.g.: memory) ➔ Using the right signals ➔ Better reporting and fleet-wide view to catch regressions across various components Helping apps more ➔ Where are the problems? ➔ Suggest how to fix problems we can’t fix ourselves
  • 30. Confidential + Proprietary Takeaways ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper
  • 31. Confidential + Proprietary Takeaways ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper You can do this too!
  • 32. Confidential + Proprietary Questions? ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper You can do this too! Victor Marmol vmarmol@google.com
  • 33. ● Friday 8am - 1pm @ Google's Toronto office ● Hear real life experiences of two companies using GKE ● Share war stories with your peers ● Learn about future plans for microservice management from Google ● Help shape our roadmap g.co/microservicesroundtable † Must be able to sign digital NDA Join our Microservices Customer Roundtable