In this presentation I offer an overview of host monitoring over the years, then explain how Docker and container schedulers truly improve the state of the art for failure monitoring and recovery.
3. Health Monitoring
circa 2012
• AMI
• Chef / Ansible
• ELB / Health Check
• Protocol: HTTP (or HTTPS, TCP, SSL)
• Port: 80
• Path: /index.html
• Timeout / Interval: 5s / 30s
• Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks
• Loss of network
• Loss of power
• Host software problems
• Host hardware problems
• ASG photo credit:
http://aws.amazon.com/architecture/
http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
4. But you probably still
need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit:
http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
5. Health Monitoring
circa 2016, the age of containers
• Generic AMI
• Docker
• ECS
• Container scheduling
and re-scheduling as a
service
• ASG / EC2 / Status Checks
• Simple monitoring
container
photo credit:
https://github.com/docker/swarm
9. Container Schedulers are
the new watchman
• Container process
monitoring
• Service health check
monitoring
• Automatic re-scheduling
photo credit:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
10. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need to configure an ASG
to maintain capacity…
11. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need a monitor…
12. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring
circa 2016, the age of containers
• Schedule a monitor process in
container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered
with ECS unhealthy
• `docker run` a user space health
check on every instance
• Mark instances that fail to
connect to Docker unhealthy
• Mark instances that fail user
space health check unhealthy
No Nagios server + plugins!
13. Partial Failure Scenarios
battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
14. User Space Health Check
$ docker run busybox sh -c
'dmesg | grep "Remounting filesystem read-only"'
# why not:
$ docker run health-check
To package, distribute and run common top, netstat,
smartmontools, etc. binaries and scripts
15. Thanks!
Slides available on Medium / SlideShare
https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHub
https://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or noah@convox.com