Presentation at Docker Con on how Aurea Docker team achieved 50%+ cost reduction and grow overall infrastructure utilization from 5% to 72% by moving to Docker. It also describe technical solution for noisy neighbour, configuration compliant and networking for software using legacy protocols such as FTP, SMTP and SIP.
Authors: Matias Lespiau and Lukasz Piatkowski
4. Dockerization at Aurea
Goals
● #1 - Decrease computing expenses by
consolidation and simplification
● #2 - Improve Ops team productivity through
standardization
5. 1 app, 1 host, 2 CPUs
1GB RAM
4x average to peak
utilization
500 containers, 1 host,
128 CPUs, 2TB RAM
1.2x average to peak
utilization
Goal #1 - decrease computing expenses
7. Dockerization at Aurea
●1 year results
○ Replaced 2000+ VMs with 1900+ containers
○ Decrease infrastructure costs from 13M to 6M (53%)
○ Utilization increased from 5% to 72%
9. Dockerization at Aurea
●Focus on the basics
○ Tried out swarm, ECS and plain docker with EE basic license.
○ Teams using plain docker reached our main goal faster
■ Simpler to onboard Ops and Eng
■ Avoids re engineering apps
10. Dense consolidation
● 2000 instances -> 7 docker hosts
● Benefits:
○ Higher utilization
○ Simpler to manage
● Don’t do this at home:
○ Issues when running more than 100 containers per node
○ Interested? Hallway track!
14. Performance & host sharing
●The biggest enemy: a noisy neighbour
●Fight him with resource limits:
○ CPU: --cpu-period, --cpu-quota, --cpus
○ Memory: --memory, --memory-swap (turn on accounting!)
○ IO: --device-[read|write]-bps, --device-[read|write]-iops
15. Performance - lessons learned
●Containers are not Virtual Machines
○ cgroups are not hypervisor
○ Remember the JVM
●Always set container’s resource limits
●Always label your containers
○ Owners info and container’s importance
●But how to make users comply?
17. Docker enforcer
●A tool to run validation rules against
containers and stop ‘bad ones’
● https://github.com/piontec/docker-enforcer
18. Docker enforcer - rules
“Dear users,
We have created a nice big disk for your containers’
data at /opt/big. Please use this location for any docker
volumes.
Admins”
23. Networking - legacy
●Not HTTP+JSON microservices
○ Old friends: FTP, SMTP, SIP, …
●New requirements
○ Individual IPs for containers, but from different subnets and
preserving external (AWS VPC) IP
○ Exposing massive number of ports (SIP)
24.
25.
26.
27.
28.
29. Networking - per container IPs
# interface
ip addr add 10.10.0.2/24 dev eth1
ip route add default via 10.10.0.1 dev eth1 table 101
ip route add 10.10.0.0/24 dev eth1 src 10.10.0.2 table 101
ip rule add from 10.10.0.2 lookup 101
# container
ip r a 172.17.0.0/16 dev docker0 tab 101
ip rule add from 172.17.0.20 lookup 101
iptables -t nat -I POSTROUTING -s 172.17.0.20 -j SNAT --to-source 10.10.0.2
30. Recap
●Outcomes
○ Increase utilization from 5% to 72%
○ Decrease infrastructure costs from 13M to 6M
● Main challenges
○ Noisy neighbours
○ Configuration compliance
○ Networking
31. Roadmap
● Dockerize everything, our goal is to have 0 VMs out of our CaaS
platform
● New platform for stateless containers
○ Orchestration
○ Multi AZ on AWS Spot
● Invest in re-engeering non-dockerizable apps to make them
dockerizable and in dockerized app to make them cloud enabled.
“Wookash Piokowski”
For the past year, we’d been running a Platform Engineering team in charge of Docker Infrastructure for Aurea products
Our plan for today is to share our motivation to use Docker at Aurea,
What goals we’ve set for the team
What we were able to achieve in the past year
Technical challenges we had to solve - not the only ones we had - but the ones that most teams have to address when moving their apps to docker
Aurea owns a portfolio of 15 to 100 enterprise products for multiple domains: energy, telecommunication, marketing, pharmaceutics, IT, etc.
It’s a global, remote first company. We are about 500 Software Engineers which all work from home.
Actually yesterday was the first time I saw Lukasz in person.
We have two graph. On each graph the green line shows the CPU utilization and the yellow lines show the memory utilization footprint.
On the top we have the footprint of a standard application running in a single VM.
We can see that the application has a 21% average utilization but that the utilization peaks to 50%, 75% or even 100% during the course of the day.
Peak to AVG utilization is about 4x
Ops team has to deal with multiple products and multiple backend and front end services which are part of the infrastructure.
These systems are built in different tech stacks - node, java, C#, php, python, etc. and they might have different middlewares that must be operated as well (apache, nginx, tomcat, etc.)
Each service requires a playbook
Ok, in terms of results, this might be similar to what other companies have achieved - but why our case is different?
Legacy apps
Walk before you run approach to learn the technology
So far, all seems like a fairy tale, but let’s see how our docker journey looked like
Intro about challanges we faced and how we solved them
Selection of top challenges we had to fix in
Check next slide
Monitoring
Performance and resource sharing
Networking
Daniel stori
When you run a shared docker host, with high load, you will hit performance problems sooner or later
Why is it so important to set limits?
What’s different between VM and a container? [Image]
Let me show you some real life problems we ran into and how we solved them. Like this one:
Lukasz: So we got one of our users here! Maybe you would like to run a container? Mind the latest email from admins!
Matias: Sure! (Command with typo, container runs)
Lukasz: And now, we will run into problems, probably disk space is gonna run out, as it’s wrong place to store data. Let’s try to enforce a correct config with enforcer (show rule, apply it). OK, Matias, can you please retry?
Matias: Sure (run the previous command it fails, show the message you get, fix the command and run it again)
Lukasz: (show API endpoints)
Lukasz: Now admins have noticed the “lazy neighbor” problem, they want all users to set CPU and memory limits. But not all of them comply…
[A scene like previously]
Big issues with legacy and networking
Why such requirements? One good reason is SMTP. SMTP servers need to have outgoing traffic from a very specific IP addresses - whitelists and such. Additionally, in AWS there’s a limit of about 30 IPs per interface. So let us dig in a little deeper into an example, where you want to run a container including SMTP server, that needs to use an IP from the 2nd interface of your host.
Lukasz: This is how the default docker host setup looks like
Lukasz: And this is a final packets path for container bound to an IP address from the default eth0 interface
Why do we need a 2nd interface
Matias: OK, but if I want to run a new container with IP from the 2nd interface, eth1, I can just do this like this, right?