Borg provides a common runtime layer for Containers at Google. We try to guarantee a performance baseline for each class of tasks without looking into the task's runtime details or any metric from the application itself. This talk will cover the methodology we use to collect black-box performance monitoring information from Containers and presents case studies of interesting performance problems we detect and ways to mitigate them.
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems
1. Confidential + ProprietaryConfidential + Proprietary
Finding (and Fixing!) Performance Anomalies
in Large Scale Distributed Systems
Victor Marmol
vmarmol@google.com
3. Confidential + Proprietary
Containers Infrastructure
Manage containers @ Google
Everything runs in a container
2B+ containers started per week
Images by Connie Zhou
8. Confidential + Proprietary
Borglet
Google’s node agent
Borglet = init + Docker + a few other things
Primary goals
➔ Talk to master
➔ Manage tasks
➔ Manage resources (containers)
9. Confidential + Proprietary
How do we get to task performance management?
Dremel: Interactive Analysis of Web-Scale Datasets
10. Confidential + Proprietary
Task Performance Analysis (TPA)
Our system for container-based black-box application performance analysis
Containers are the main enabler
Manage, monitor, and improve application performance
Today’s Talk
➔ How does it work
➔ User stories: stories from the front-lines!
Container
App
13. Confidential + Proprietary
Low-Level Performance Metrics
Key: collect lots of container-based low-level metrics from the kernel
Custom kernel patches to give us even more stats and metrics
Sources
➔ cgroups
➔ /proc
➔ perf_events
➔ misc (e.g.: netlink, ioctls, etc)
Container
App
low-level performance metrics and telemetry
Collection → Aggregation → Baselines → SLOs → Enforcement
14. Confidential + Proprietary
Low-Level Performance Metrics
Histograms are our favorite: number, breakdown, and tail of operations
➔ CPU latencies
➔ Memory reclaim, page faults, re-faults
➔ I/O wait time and service time
Metrics collected every 1s - 10s
➔ 1s: Used for on-machine control loops
➔ 10s: Exported for off-machine analysis
Collection is very low-overhead
Collection → Aggregation → Baselines → SLOs → Enforcement
15. Confidential + Proprietary
Cluster-Wide Aggregation
Cluster service that collects all metrics and exports them to Dremel
Push data for all tasks on all machines, keep them for a while
Single-handedly our most valuable resource
➔ SQL is very expressive and flexible
➔ Ability to query all that data in seconds: priceless
Best news: You can use it too! Google BigQuery
Performance
Data DB
BigQuery
Collection → Aggregation → Baselines → SLOs → Enforcement
16. Confidential + Proprietary
Performance Baselines
Cluster-level service: slice & dice data
➔ Types of tasks
➔ Distributions across replicas
➔ Per compute cluster (Borg cell)
➔ Historical trends
Gives us insights into performance trends and helps us develop performance
baselines
Performance baseline: performance we can achieve given different parameters
➔ CPU: How quickly can we schedule you on the CPU
➔ Disk I/O: What disk I/O latency can we achieve
Collection → Aggregation → Baselines → SLOs → Enforcement
17. Confidential + Proprietary
Baselines → SLOs
From baselines we provide performance SLOs:
promise to the user
You promise to do X
➔ CPU: Use at most as much CPU as you asked for
➔ Disk I/O: Issue less than X I/Os per second
We promise to give you Y performance
➔ CPU: You will get scheduled on a CPU within Yms of requesting it
➔ Disk I/O: You will get I/O wait time of at most Yms
Collection → Aggregation → Baselines → SLOs → Enforcement
18. Confidential + Proprietary
Enacting SLOs
Monitor SLOs closely and aggressively ensure they are met
Per-node
➔ Give more resources or better quality resources
➔ Throttle bad actors (antagonists)
Cluster-wide
➔ Ask for help!
➔ Move task to a different machine
➔ Move antagonist to a different machine
Container
App
Container
App
Collection → Aggregation → Baselines → SLOs → Enforcement
20. Confidential + Proprietary
CPU
Low-level metrics
➔ Wakeup latency: time between
wanting to run and running
➔ Round-robin latency: how well
you share CPU within your app
➔ Load: how much work you
wanted to do
➔ Time per state: how much time
your spent in each state (e.g.:
sleep, wait, run, queue)
22. Confidential + Proprietary
NUMA
Low-level metrics
➔ CPU locality: how much of your CPU (and
usage) was in local vs remote nodes
➔ Memory locality: how much of your memory
(and accesses) was in local vs remote
nodes
➔ NUMA score: resource-product of both
above (0.0 - 1.0)
SLOs
➔ NUMA score of 0.85 or above given certain
job shapes
The NUMA Experience
23. Confidential + Proprietary
Disk I/O
Low-level metrics
➔ Service time latency: time it took kernel to service request to disk
➔ Wait time latency: time it took kernel to queue and service request
to disk
➔ Queued: how much work you wanted to do
➔ Usage: how much work did you actually did
SLOs
➔ Small amount of disk time when well-behaved
25. Confidential + Proprietary
Performance Regression
User: VM environment
User Problem: … silence ...
SLO not met: CPU
Signal: CPU queue other
Root cause: Subtle, but expensive, new periodic operation
Make it better: Give the application more debug information
26. Confidential + Proprietary
Performance Variation #1
User: Flight search
User Problem: QPS variation on some tasks
SLO not met: NUMA
Signal: CPU and memory locality
Root cause: Bad NUMA allocation by infrastructure
Make it better: Improve NUMA allocation
27. Confidential + Proprietary
Performance Variation #2
User: Web search
User Problem: Latency variation on some task
SLO not met: CPI variation
Signal: CPI from perf_events
Root cause: Bad actors co-scheduled on the machine
Make it better: Throttle or move these bad actors
28. Confidential + Proprietary
Performance Degradation Under Load
User: Borglet
User Problem: Stuckness under heavy load
SLO not met: Disk access
Signal: Disk I/O wait time latencies
Root cause: Heavy disk operations blocking other operations
Make it better: Move disk operations away from latency sensitive operations
29. Confidential + Proprietary
Future Work
➔ Signals for more resources (e.g.: memory)
➔ Using the right signals
➔ Better reporting and fleet-wide view to catch regressions across various
components
Helping apps more
➔ Where are the problems?
➔ Suggest how to fix problems we can’t fix ourselves
30. Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
31. Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
32. Confidential + Proprietary
Questions?
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Victor Marmol
vmarmol@google.com
33. ● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable