This document summarizes lessons learned from running large Docker environments in three or fewer sentences per section:
1. Dependencies between services can break architecture if not properly versioned.
2. A hardware defect in a single network card caused retransmissions under heavy load, affecting inter-container communication.
3. Logs from containers consumed all disk space when log management was not configured, preventing new containers from running.
4. Slowdowns occurred when a orchestration system stored excessive versions of services due to configuration.
5. Massive load testing exposed dependencies between over 800 billion components, requiring automation to analyze problems at scale.
8. App #1
App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies
9. Use proper versioning for
services, APIs, and images
#1 – The Death Star of Service Dependencies
10. Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
12. • Hardware defect in a single network interface card
• NIC worked well under low load
• Retransmissions only under heavy load
• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode
15. Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
16. #3 – The Hungry Container Breakdown
Low disk space
Low disk space
17. • Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown
18. • Container health checks failed
• Marathon terminated task and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown
19. • Log management tools for app logs, e.g. Fluentd and Logstash
--log-driver=none|syslog
• Remove container
--rm=true
• Run Mesos slave with
--docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown
20. Use log management tools
Empty /var/lib/docker
#3 – The Hungry Container Breakdown
21. Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
22. #4 – The Day Orchestration Stood Still
Queue and deployment
methods are slow
23. • Marathon 0.8.x keeps all versions of applications for recovery (by default)
• High frequency of microservices deployments
• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still
24. • Respective parameter (zk_max_versions) was not set to proper limit
--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still
25. Track orchestration layer performance
Separate Mesos clusters
#4 – The Day Orchestration Stood Still
26. Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
27. #5 – The Mushroom Cloud Effect
Way too many
components involved
820 BILLION dependencies!
28. • Massive load testing in preparation for Black Friday
• Tests ran for 3 days
• No impact to real users, only backend services affected
• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect
29.
30. Automation needed for problem
analysis in large environments
#5 – The Mushroom Cloud Effect
31. Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
32. Free trial - https://ruxit.com/docker-monitoring/
Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?