An examination of a journey to simplify Amazon ECS network environments through use of Weave networking; regaining environment parity with local docker-compose configurations, and some possibilities for hybrid cloud and multi-region transparent network configurations.
3. @cevoaustralia
Who am I, and why are you here?
“Specialization is for insects” - Robert A. Heinlein
Hear a tale of a journey from innocence to experience?
Learn a bit about containers, networking, and how we dug up
from a tangly mess so that you don’t have to
4. @cevoaustralia
A quick summary
Docker in 60 seconds (and why)
ECS in 60 seconds
A story of a journey through complexity
Why this is a reasonable thing to do
Power-ups, bells and whistles
Caveats
And live demos (demo gods willing)
7. @cevoaustralia
Tangled in string (how did we get there???)
Migrating legacy
No re-architecting
Many containers
ECS constraints
Photo by Khara Woods on Unsplash
9. @cevoaustralia
What other options do we have?
docker-compose
swarm
Kubernetes,
Mesos,
Rancher, …
rewrite ecs-task-kite
DNS-based
discovery
urk … Photo by Ishan @seefromthesky on Unsplash
18. @cevoaustralia
Environment parity
Principle of
least surprise
Reduce risk of
promotion
Support custom or
legacy protocols
Hides complex
network
shenanigans
OK, this is all lovely … but why?
Photo by Emily Morter on Unsplash
24. @cevoaustralia
Just because you
can …
Network and
config complexity
beyond MVP
Well-decoupled
systems might not
need it
Existing solutions
may be sufficient
Magic ... but not all unicorns and rainbows
Photo by Annie Spratt on Unsplash
25. @cevoaustralia
Key takeaways
Useful for dev-test
environment
parity as well as
solving some key
legacy issues
Pretty simple to set
up and use
Free! (speech and
beer)
Photo by Clem Onojeghuo on Unsplash
26. @cevoaustralia
Thank you - and some references
https://www.weave.works/
https://github.com/weaveworks/weave
https://aws.amazon.com/ecs/
https://unsplash.com/
Notas do Editor
I’m Colin! I work for Cevo, and we help companies develop and use DevOps mindsets, infrastructure automation, and software delivery process improvements
I’ve done many things; I don’t claim to be excellent at any of them, but I’m ok at some of them, and solving interesting problems in novel ways is one thing I’m not bad at.
I hope you’re here to hear a bit of a story of discovery, and I hope that when you leave, you’ve either got another tool in your belt for dealing with network complexity in containerised applications, or you’ve at least had a pleasant time listening to me tell everyone else about it.
Let’s get cracking
So, no pressure
In the old days, we had physical machines like factories; everything in the one box. If you needed another factory, you had to build one (or buy one).
As system power and technology grew, we developed the ability to virtualise factories. One physical box could now host multiple virtual machines, but each VM still runs a complete operating system, with the associated boot times, configuration management challenges, etc.
Containers are the next logical step; instead of providing whole-(virtual)-operating system isolation, they provide per-process isolation
The step beyond that, Functions-as-a-service (or “serverless”) is not the subject of this talk, but I mention them for completeness
No process runs independently of a filesystem, libraries, etc, and that brings versioning and config management challenges; containers provide a fully-resolved artifact that encapsulates versioned dependencies, and give a reasonably consistent interface for config management
They simplify immutable infrastructure, development-to-deployment consistency
Consistency -> confidence to release -> increase throughput, reduce time-to-value
Containers require orchestration -- where and when to run, connections to other containers, resource constraints, how many to run, restarts, etc …
There are many container orchestration frameworks -- Kubernetes, Mesos, Marathon, Rancher, Docker Swarm … ECS is one of them, from Amazon
Integrated into the AWS ecosystem, though you could use it outside AWS if you were sufficiently perverse
Ok, so now you have the background
Iterative migration of a legacy application; components into containers, containers into ECS, with a goal of automating test environment provisioning, for inclusion in build pipelines. The process of discovery (no one person actually knew how the application worked) resulted in more and more containers, with more coupling between them
No ability to re-architect the application
One working “stack” of application components -> 35 containers (eeek!)
ECS imposes some constraints: “Service” -> multiple “Task Definitions” -> 10 containers per Task Definition
Communication between containers in different task definitions can’t use container-name as a hostname!
Therefore: communication between task definitions requires some kind of routing or proxying capability.
Used the “ambassador” pattern, as described by the Docker documentation: basically, a proxy container
AWS has an “example” one that we tried, called “ecs-task-kite” -- fine for proof of concept, but not supported for prod. Can also only proxy to one service at a time.
For smaller, low-complexity systems, this is fine. However, every task definition that had a container that needed to connect to a container in a different task definition, needed an ambassador container; and with the limit of 10 containers per task definition, we saw a rapid explosion in the number of task definitions, and the number of ambassador containers
At the point where we had only 23 of our 35 final application components containerised, we were running over 52 containers in ECS
Because we didn’t have this limitation on the local system (we were using docker-compose), we had introduced differences between local and ECS-based environments. Exactly the opposite of what we set out to do!
So that was clearly inadequate ...
Docker-compose on EC2 would have worked, but changing or updating an already-running system with a different version of a container would have required fragile cleverness
Swarm was quite immature, and undergoing some fairly rapid change which meant that we would have been tracking a moving target
Organisation had no experience of Kubernetes, Mesos … and poor prior experiences with Rancher … so choosing a different orchestration framework would add complexity, not reduce it
We investigated rewriting or extending ecs-task-kite, but limitations of how the application was written (specific ports, etc) would have made it futile
We looked at using DNS for service discovery, but poor infrastructure (out of our control or influence) would have introduced unpredictable failures
So … were we screwed?
The ambassador container, in this case, ecs-task-kite, first queries the AWS APIs in order to find out the IP address and port of the “actual” container
Connections using the local namespace are proxied via the ambassador container to the “actual” destination
Some advantages:
Destination container could be in another ECS cluster, even potentially in another region;
Multiple destination containers could be load-balanced
Significant drawbacks:
Number of containers goes up, number of ambassadors grows as well
API rate limits from AWS become a problem -- and because they are account-wide, one stack can impact others
This slide is about 1 month into the migration to ECS -- the blue bits are load balancer components, the green bits are actual application components, and the red bits are ambassador containers. Ick!
Remembered this thing called “weave” -- encountered it in passing a few years ago, remembered it was some kind of inter-container network thing for Docker
It’s Open Source (they have a paid offering based on support, monitoring, and more advanced UI)
It’s transparent to containers -- they don’t know it’s there, they can just “magically” connect to another container by name, which means that service discovery is taken care of
It’s pretty fast, and it’s resilient to failures
No need for additional “ambassador” containers
AWS API rate limits are no longer a problem
Same “load balancing” capability via DNS lookups (multiple destination containers of the same name would just return as multiple IP addresses to a single DNS lookup)
… if we didn’t have Weave!
There are zero ambassador containers in this diagram
It does clever things to route traffic, so you don’t need hub-and-spoke configurations -- the network is a mesh
If there’s a path from one node to another, traffic can flow along it
All traffic _within_ a node between containers never goes outside the host -- so it’s fast
It’s not! At least, not for the minimal case using Amazon AutoScaling groups
Weave provide an AMI (Amazon Machine Image) free of charge, which is the basic Amazon ECS AMI plus the weave setup steps already configured (it’s about 7 lines of shell script)
Uses tags applied by AutoScaling to find other instances in the same ASG, and creates the overlay network between them
If you want to connect instances between different autoscaling groups, just use the weave:peerGroupName tag
It made our ECS configuration look exactly like our docker-compose configuration and we get scale-out DNS-based load balancing for free
Keeping local dev and in-cloud dev/test environments (and, eventually, prod) looking the same means that we’re testing the same kind of config that’s being deployed -- risk management
Some applications can’t be migrated to AWS because they require custom IP protocols; overlay network enables that
Security, encryption, route discovery, NAT traversal -- all taken care of
Everyone loves a live demo! In this one, I’ll show you how two containers, running on two different EC2 instances, can communicate with each other with no special dockery configuration, if they’re on a weave cluster.
Weave allows you to configure mesh networks across more than just instances in the same AWS region. You can:
Create networks that span regions, transparent to the containers (with the exception of latency, of course)
You can configure networks that span multiple cloud providers, or are even hybrid-cloud, with a little more work
Implications are potentially significant if you want to have cross-cloud data flows for resilience, performance, or any other reason
Simple to connect legacy on-prem applications to containerised systems in cloud
‘Scope’ allows you to monitor the containers in a topology, and interact with them
You can monitor connections, workloads, and resource utilisation
Scope also allows you to interact with the container in the browser!
Just because you can, doesn’t mean you should
Enabler of bad patterns if not watched carefully -- excessive coupling and poor architectures become easy
There’s complexity beyond the MVP -- network configuration, cross-cloud and cross-region discovery, and so forth; security and access control require key distribution and credential management systems; this is not unique to Weave, any complex system has these challenges
I would not use this:
If you don’t need a transparent overlay network -- eg if your system architecture is well-decoupled
If your container environment is simple, or you already have a service discovery setup that you like, or is well-established within your team or organisation (the cost of changing must be factored in)