SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Host Health
Monitoring
with `docker run`
Noah Zoschke
@nzoschke
noah@convox.com
10 / 28 / 2015
Health
Monitoring
circa 1999
• Nagios Core
• Event scheduler
• Event processor
• Alert manager
• Host groups config
• Ping
• HTTP
• SSH
• Nagios Remote Plugin Executor
• SNMP
• load
• disk
photo credit:
https://en.wikipedia.org/wiki/Nagios
Health Monitoring
circa 2012
• AMI
• Chef / Ansible
• ELB / Health Check
• Protocol: HTTP (or HTTPS, TCP, SSL)
• Port: 80
• Path: /index.html
• Timeout / Interval: 5s / 30s
• Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks
• Loss of network
• Loss of power
• Host software problems
• Host hardware problems
• ASG photo credit:
http://aws.amazon.com/architecture/
http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
But you probably still
need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit:
http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
Health Monitoring
circa 2016, the age of containers
• Generic AMI
• Docker
• ECS
• Container scheduling
and re-scheduling as a
service
• ASG / EC2 / Status Checks
• Simple monitoring
container
photo credit:
https://github.com/docker/swarm
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
photo credit:
http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
>	reschedule	task
Container Schedulers are
the new watchman
• Container process
monitoring
• Service health check
monitoring
• Automatic re-scheduling
photo credit:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need to configure an ASG
to maintain capacity…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need a monitor…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring
circa 2016, the age of containers
• Schedule a monitor process in
container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered
with ECS unhealthy
• `docker run` a user space health
check on every instance
• Mark instances that fail to
connect to Docker unhealthy
• Mark instances that fail user
space health check unhealthy
No Nagios server + plugins!
Partial Failure Scenarios
battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
User Space Health Check
$	docker	run	busybox	sh	-c		
				'dmesg	|	grep	"Remounting	filesystem	read-only"'	
#	why	not:	
$	docker	run	health-check
To package, distribute and run common top, netstat,
smartmontools, etc. binaries and scripts
Thanks!
Slides available on Medium / SlideShare
https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHub
https://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or noah@convox.com

Mais conteúdo relacionado

Mais procurados

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 

Mais procurados (20)

Microservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerMicroservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-docker
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for Docker
 
Lesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoLesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at Pronto
 
Managing Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with AnsibleManaging Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with Ansible
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmus
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
 
London HUG 12/4
London HUG 12/4London HUG 12/4
London HUG 12/4
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Automating the Network
Automating the NetworkAutomating the Network
Automating the Network
 
A complete guide to Node.js
A complete guide to Node.jsA complete guide to Node.js
A complete guide to Node.js
 
Cyansible
CyansibleCyansible
Cyansible
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
About Node.js
About Node.jsAbout Node.js
About Node.js
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
 
ApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration library
 
Ansible Crash Course
Ansible Crash CourseAnsible Crash Course
Ansible Crash Course
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
 

Semelhante a Host Health Monitoring with Docker Run

Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
evilmike
 

Semelhante a Host Health Monitoring with Docker Run (20)

Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
 
Monitoring Docker with ELK
Monitoring Docker with ELKMonitoring Docker with ELK
Monitoring Docker with ELK
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
 
Event machine
Event machineEvent machine
Event machine
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
Rack
RackRack
Rack
 
Hacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesHacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sites
 
Доклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysДоклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDays
 
Building production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackBuilding production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stack
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
 

Mais de Noah Zoschke

Mais de Noah Zoschke (6)

DevOps for Humans
DevOps for HumansDevOps for Humans
DevOps for Humans
 
Bootstrapping Microservices
Bootstrapping MicroservicesBootstrapping Microservices
Bootstrapping Microservices
 
Minimum Viable Infrastructure
Minimum Viable InfrastructureMinimum Viable Infrastructure
Minimum Viable Infrastructure
 
Open Source SLAs
Open Source SLAsOpen Source SLAs
Open Source SLAs
 
Choose Your Own AWS Adventure
Choose Your Own AWS AdventureChoose Your Own AWS Adventure
Choose Your Own AWS Adventure
 
Convox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSConvox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECS
 

Último

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 

Host Health Monitoring with Docker Run

  • 1. Host Health Monitoring with `docker run` Noah Zoschke @nzoschke noah@convox.com 10 / 28 / 2015
  • 2. Health Monitoring circa 1999 • Nagios Core • Event scheduler • Event processor • Alert manager • Host groups config • Ping • HTTP • SSH • Nagios Remote Plugin Executor • SNMP • load • disk photo credit: https://en.wikipedia.org/wiki/Nagios
  • 3. Health Monitoring circa 2012 • AMI • Chef / Ansible • ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10 • EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems • ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
  • 4. But you probably still need… • Nagios for monitoring • or Zabbix, Ganglia, Sensu… • or OpsView, SolarWinds… • or Pingdom, Datadog… • To provide system feedback • ASG SetInstanceHealth photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
  • 5. Health Monitoring circa 2016, the age of containers • Generic AMI • Docker • ECS • Container scheduling and re-scheduling as a service • ASG / EC2 / Status Checks • Simple monitoring container photo credit: https://github.com/docker/swarm
  • 6. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB
  • 7. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky
  • 8. Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693 > reschedule task
  • 9. Container Schedulers are the new watchman • Container process monitoring • Service health check monitoring • Automatic re-scheduling photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
  • 10. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need to configure an ASG to maintain capacity…
  • 11. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need a monitor…
  • 12. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Health Monitoring circa 2016, the age of containers • Schedule a monitor process in container cluster • Describe ASG an ECS membership • Mark all instances unregistered with ECS unhealthy • `docker run` a user space health check on every instance • Mark instances that fail to connect to Docker unhealthy • Mark instances that fail user space health check unhealthy No Nagios server + plugins!
  • 13. Partial Failure Scenarios battle scars • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky • Disk full • Disk partition corrupt / read-only • Network packet loss • CPU steal • Kernel bugs triggered • Security vulnerabilities • Security breaches • …
  • 14. User Space Health Check $ docker run busybox sh -c 'dmesg | grep "Remounting filesystem read-only"' # why not: $ docker run health-check To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts
  • 15. Thanks! Slides available on Medium / SlideShare https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286 http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run Open source Golang monitor available on GitHub https://github.com/convox/rack/blob/master/api/workers/cluster.go Questions / feedback to @nzoschke or noah@convox.com