Lessons learned running large real-world Docker environments

•Transferir como PPTX, PDF•

3 gostaram•911 visualizações

This document summarizes lessons learned from running large Docker environments in three or fewer sentences per section: 1. Dependencies between services can break architecture if not properly versioned. 2. A hardware defect in a single network card caused retransmissions under heavy load, affecting inter-container communication. 3. Logs from containers consumed all disk space when log management was not configured, preventing new containers from running. 4. Slowdowns occurred when a orchestration system stored excessive versions of services due to configuration. 5. Massive load testing exposed dependencies between over 800 billion components, requiring automation to analyze problems at scale.

Software

Lessons learned running large
real-world Docker environments
Oct 27th 2015
Alois Mayr
@mayralois
alois.mayr@ruxit.com
Dec 3rd 2015

Campfire stories
#1 – The Death Star of Service Dependencies

#1 – Death Star of Service Dependencies
Load-balanced service
System-wide service
dependencies

Reverse proxies are essential
#1 – The Death Star of Service Dependencies

App #1
App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies

Use proper versioning for
services, APIs, and images
#1 – The Death Star of Service Dependencies

Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode

#2 – The Network Retransmission Episode
Retransmissions
Retransmissions Retransmissions
Retransmissions Retransmissions
Retransmissions
Retransmissions

• Hardware defect in a single network interface card
• NIC worked well under low load
• Retransmissions only under heavy load
• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode

Co-locate related containers.
Check network infrastructure.
#2 – The Network Retransmission Episode

Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown

#3 – The Hungry Container Breakdown
Low disk space
Low disk space

• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown

• Container health checks failed
• Marathon terminated task and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown

• Log management tools for app logs, e.g. Fluentd and Logstash
--log-driver=none|syslog
• Remove container
--rm=true
• Run Mesos slave with
--docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown

Use log management tools
Empty /var/lib/docker
#3 – The Hungry Container Breakdown

Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still

#4 – The Day Orchestration Stood Still
Queue and deployment
methods are slow

• Marathon 0.8.x keeps all versions of applications for recovery (by default)
• High frequency of microservices deployments
• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still

• Respective parameter (zk_max_versions) was not set to proper limit
--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still

Track orchestration layer performance
Separate Mesos clusters
#4 – The Day Orchestration Stood Still

#5 – The Mushroom Cloud Effect
Way too many
components involved
820 BILLION dependencies!

• Massive load testing in preparation for Black Friday
• Tests ran for 3 days
• No impact to real users, only backend services affected
• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect

Automation needed for problem
analysis in large environments
#5 – The Mushroom Cloud Effect

Free trial - https://ruxit.com/docker-monitoring/
Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?

Mais conteúdo relacionado

Mais procurados

Kafka Summit NYC 2017 - Deep Dive Into Apache Kafkaconfluent

Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Jeff Holoman

Automated Deployment Using Jenkins Across ClustersNaveen S.R

Container Orchestration with Docker Swarm and KubernetesWill Hall

Windows container securityDocker, Inc.

BlueHat Seattle 2019 || Kubernetes Practical Attack and DefenseBlueHat Security Conference

How to install and use KubernetesLuke Marsden

Docker {at,with} SignalFxMaxime Petazzoni

Securing & Enforcing Network Policy and Encryption with Weave NetLuke Marsden

Accessible hpc for everyone with docker and containersDocker, Inc.

Netflix Container Runtime - Titus - for Container Camp 2016aspyker

Lightning Fast Monitoring against Lightning Fast OutagesMaxime Petazzoni

How and why we got Prometheus working with Docker SwarmLuke Marsden

WebLogic Stability; Detect and Analyse Stuck ThreadsMaarten Smeets

Build your own Service Bus V2Kévin LOVATO

An empirical comparison of dependency issues in open source software packagin...Tom Mens

Locking down your Kubernetes cluster with LinkerdBuoyant

KubeCon London 2016 Ronana Cloud Native SDNRomana Project

How to build a Neutron Plugin (stadium edition)Salvatore Orlando

Docker casual alpine with nim nimlang 박승환_2016_03Seunghwan Park

Mais procurados (20)

Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka

Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Automated Deployment Using Jenkins Across Clusters

Container Orchestration with Docker Swarm and Kubernetes

Windows container security

BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense

How to install and use Kubernetes

Docker {at,with} SignalFx

Securing & Enforcing Network Policy and Encryption with Weave Net

Accessible hpc for everyone with docker and containers

Netflix Container Runtime - Titus - for Container Camp 2016

Lightning Fast Monitoring against Lightning Fast Outages

How and why we got Prometheus working with Docker Swarm

WebLogic Stability; Detect and Analyse Stuck Threads

Build your own Service Bus V2

An empirical comparison of dependency issues in open source software packagin...

Locking down your Kubernetes cluster with Linkerd

KubeCon London 2016 Ronana Cloud Native SDN

How to build a Neutron Plugin (stadium edition)

Docker casual alpine with nim nimlang 박승환_2016_03

Destaque

Blue Whale in an Enterprise PondDigia Plc

Using Docker in the Real WorldTim Haak

Solving Real World Production Problems with DockerMarc Campbell

A Fabric/Puppet Build/Deploy Systemadrian_nye

Real World Experience of Running Docker in Development and ProductionBen Hall

Real-World Docker: 10 Things We've Learned RightScale

Programming the world with DockerPatrick Chanezon

Destaque (7)

Blue Whale in an Enterprise Pond

Using Docker in the Real World

Solving Real World Production Problems with Docker

A Fabric/Puppet Build/Deploy System

Real World Experience of Running Docker in Development and Production

Real-World Docker: 10 Things We've Learned

Programming the world with Docker

Semelhante a Lessons learned running large real-world Docker environments

KubeCon EU 2016: Kubernetes meets Finagle for Resilient MicroservicesKubeAcademy

Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup

ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman

OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...Masaaki Nakagawa

Fasten Industry Meeting with GitHub about Dependancy ManagementFasten Project

Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger

Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH

Orchestrating Linux Containers while tolerating failuresDocker, Inc.

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Remote core locking-Andrea LombardoAndrea Lombardo

Hands on kubernetes_container_orchestrationAmir Hossein Sorouri

Sample Solution BlueprintMike Alvarado

4. system modelsAbDul ThaYyal

Tupperware: Containerized Deployment at FBDocker, Inc.

The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...Docker, Inc.

The Mushroom Cloud Effect - What happens when containers fail?Alois Mayr

Breaking the Monolith Road to ContainersAmazon Web Services

John adams talk cloudyJohn Adams

Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...Docker, Inc.

Cloud orchestration risksGlib Pakharenko

Semelhante a Lessons learned running large real-world Docker environments (20)

KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices

Tokyo azure meetup #12 service fabric internals

ApacheCon BigData - What it takes to process a trillion events a day?

OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...

Fasten Industry Meeting with GitHub about Dependancy Management

Patterns and Pains of Migrating Legacy Applications to Kubernetes

Orchestrating Linux Containers while tolerating failures

Stephan Ewen - Experiences running Flink at Very Large Scale

Remote core locking-Andrea Lombardo

Hands on kubernetes_container_orchestration

Sample Solution Blueprint

4. system models

Tupperware: Containerized Deployment at FB

The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...

The Mushroom Cloud Effect - What happens when containers fail?

Breaking the Monolith Road to Containers

John adams talk cloudy

Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...

Cloud orchestration risks

Mais de Alois Mayr

Automated distributed tracing - a first class citizen of monitoringAlois Mayr

Monitoring a cloud native platform featureAlois Mayr

When containers failAlois Mayr

Running microservice environments is no free lunchAlois Mayr

Managing and Scaling Microservices with Docker in the WildAlois Mayr

Scaling and Monitoring Docker environmentsAlois Mayr

Mais de Alois Mayr (6)

Automated distributed tracing - a first class citizen of monitoring

Monitoring a cloud native platform feature

When containers fail

Running microservice environments is no free lunch

Managing and Scaling Microservices with Docker in the Wild

Scaling and Monitoring Docker environments

Último

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

Architecture decision records - How not to get lost in the pastPapp Krisztián

tonesoftglanshi9

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

WSO2CON 2024 - Does Open Source Still Matter?WSO2

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba

WSO2CON2024 - It's time to go PlatformlessWSO2

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba

Lessons learned running large real-world Docker environments

1. Lessons learned running large real-world Docker environments Oct 27th 2015 Alois Mayr @mayralois alois.mayr@ruxit.com Dec 3rd 2015

2. Source: http://www.schoonoart.de/

3. What is a “large” environment?

5. Campfire stories #1 – The Death Star of Service Dependencies

6. #1 – Death Star of Service Dependencies Load-balanced service System-wide service dependencies

7. Reverse proxies are essential #1 – The Death Star of Service Dependencies

8. App #1 App #2 App #1 depends on App #2 Where is this specified? Unwanted dependencies break architecture #1 – The Death Star of Service Dependencies

9. Use proper versioning for services, APIs, and images #1 – The Death Star of Service Dependencies

10. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode

11. #2 – The Network Retransmission Episode Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions

12. • Hardware defect in a single network interface card • NIC worked well under low load • Retransmissions only under heavy load • Affected communications to other machines in datacenter • Still not sure about exact defect on NIC What was the problem? #2 – The Network Retransmission Episode

13. #2 – The Network Retransmission Episode

14. Co-locate related containers. Check network infrastructure. #2 – The Network Retransmission Episode

15. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown

16. #3 – The Hungry Container Breakdown Low disk space Low disk space

17. • Shared /logs partition on host • No log rotation, no archiving for app logs • No proper log management used for Docker environment • Shared /logs partition on a single host ran out of space What was the problem? #3 – The Hungry Container Breakdown

18. • Container health checks failed • Marathon terminated task and rescheduled new one • Still no free space on /logs • Termination and rescheduling • /var/lib/docker ran out of space • Mesos slave unable to run Docker tasks How the problem evolved over time #3 – The Hungry Container Breakdown

19. • Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog • Remove container --rm=true • Run Mesos slave with --docker_remove_delay=VALUE How the problem could have been avoided #3 – The Hungry Container Breakdown

20. Use log management tools Empty /var/lib/docker #3 – The Hungry Container Breakdown

21. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still

22. #4 – The Day Orchestration Stood Still Queue and deployment methods are slow

23. • Marathon 0.8.x keeps all versions of applications for recovery (by default) • High frequency of microservices deployments • Slowdown through zk overload What was the problem? #4 – The Day Orchestration Stood Still

24. • Respective parameter (zk_max_versions) was not set to proper limit --zk_max_versions=20 How the problem could have been avoided #4 – The Day Orchestration Stood Still

25. Track orchestration layer performance Separate Mesos clusters #4 – The Day Orchestration Stood Still

26. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect

27. #5 – The Mushroom Cloud Effect Way too many components involved 820 BILLION dependencies!

28. • Massive load testing in preparation for Black Friday • Tests ran for 3 days • No impact to real users, only backend services affected • Many components to take into account What was the problem? 174 / 3.4k 22 / 13.3k Service Container Host 1 1..* * 1 #5 – The Mushroom Cloud Effect

29.

30. Automation needed for problem analysis in large environments #5 – The Mushroom Cloud Effect

31. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect

32. Free trial - https://ruxit.com/docker-monitoring/ Blog - https://blog.ruxit.com/ @ruxit What lessons have you learned?

Lessons learned running large real-world Docker environments

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Lessons learned running large real-world Docker environments

Semelhante a Lessons learned running large real-world Docker environments (20)

Mais de Alois Mayr

Mais de Alois Mayr (6)

Último

Último (20)

Lessons learned running large real-world Docker environments