If you have tried Docker but are unsure about how to run it at scale, you will benefit from this session. Like virtualization before, containerization (à; la Docker) is increasing the elastic nature of cloud infrastructure by an order of magnitude. But maybe you still have questions: How many containers can you run on a given Amazon EC2 instance type? Which metric should you look at to measure contention? How do you manage fleets of containers at scale?
Datadog is a monitoring service for IT, operations, and development teams who write and run applications at scale. In this session, the cofounder of Datadog presents the challenges and benefits of running containers at scale and how to use quantitative performance patterns to monitor your infrastructure at this magnitude and complexity. Sponsored by Datadog.
7. Containers in a nutshell
•Been around for a long time
–jails, zones, cgroups
•No full-virtualization overhead
•Used for runtime isolation (e.g.,jails)
•Docker: escape from dependency hell
11. (Some) Docker use cases
•Continuous integration
–eliminate dependency variance
–same code from dev laptop to production
–Git-like workflow
•Continuous delivery
–(quasi) stateless components
–web workers, video encoders, etc.
–not for data stores (Amazon RDS a better fit)
12. Instance types
20%
20%
19%
13%
8%
21%
c3.2xl
m3.medium
m3.large
m3.xlarge
m1.large
the rest
Source: Datadog, October 2014
13. Containers per instance
•Average: 5 (October,2014)
•Highly dependent on the workload
•This is just the beginning…
•Expect higher container density going forward
Source: Datadog, October 2014
16. Memory
Name
Why it matters
pgmajfault
Paging to/from disk is slow
pgfault
Context switches hurt application performance
resident set size (rss)
Too much RSS causes paging and swapping
swap
Swapping in/out is slow
17. CPU
Name
Why it matters
user
Measures work being done
system
System calls, a necessary evil
18. Block I/O
Name
Why it matters
blkio.io_service_bytes
I/O is (often) bottleneck
blkio.io_queued
Measures saturation
19. Network
Name
Why it matters
tx/rx_errors
Because…errors are bad
tx/rx_dropped
Measures contention
tx/rx_bytes
Measures traffic
20. How to collect metrics
•https://github.com/google/cadvisor
22. Combinatorial multiplication
Hardware
OS
Off-the-shelf
Your Application
Hardware
Hypervisor
Off-the- shelf
App
OS
OS
Off-the- shelf
App
Hardware
Hypervisor
OS
OS
A
A
A
A
Containers
O
O
O
O
23. Operational complexity
•Average containers per instance: N (N=5, 10/2014)
•N-times as many “hosts” to manage
•Affects
–provisioning: prep’ing & building containers
–configuration: passing config to containers
–orchestration: deciding where/when containers run
–monitoring: making sure containers run properly
37. Layers of monitoring
CloudWatch
Infrastructure Monitoring
APM
Hypervisor
OS
OS
A
A
A
A
Containers
O
O
O
O
38. Layers of monitoring
cpu/net/io
filesystem
docker mem
docker cpu
db queries
web requests
app throughput
CloudWatch
Infrastructure Monitoring
APM
e.g.
Hypervisor
OS
OS
A
A
A
A
Containers
O
O
O
O
39. Layers of monitoring
•Access to metrics from all the layers
•Amazon CloudWatch, OS metrics, Docker metrics, app metrics in 1 place
•Shared timeline
42. Tags
•Monitoring is like Auto-Scaling Groups
•Monitoring is like Docker orchestration
•From imperativeto declarative
•Query-based
•Queries operate on tags
43. Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… and make sure resident set size < 1GB on c3.xl”
44. Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2across all availability zones”
“… and make sure resident set size < 1GB on c3.xl”
45. Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2across all availability zones”
“… that use more than 1.5x the average on c3.xl”
50. Take-aways
1.Docker increases operational complexity by an order of magnitude unless…
2.You have layered monitoring, from the instance to the container and to the application, and…
3.You monitor using tags and queries