In this session, we discuss some of the core architectural principles underlying Amazon ECS, a highly scalable, high performance service to run and manage distributed applications using the Docker container engine. We walk through a number of patterns used by our customers to run their microservices platforms, to run batch jobs, and for deployments and continuous integration. We explore the advanced scheduling capabilities of Amazon ECS and dive deep into the Amazon ECS Service Scheduler, which optimizes for long-running applications by monitoring container health, restarting failed containers, and load balancing across containers.
13. Order UI User UI Shipping UI
Order
Service
User Service
Shipping
Service
14. Order UI User UI UI
Order Service Service
Shipping
Service
Order UI
Order UI
User UI UIShipping
UI
Order Service
Order Service
Service
Service
Service
Service
User
Service
Shipping
Service
39. https://aws.amazon.com/blogs/compute/how-to-create-a-custom-scheduler-for-amazon-ecs/
# Describe all instances in the ECS cluster
containerInstancesArns = getInstanceArns(clusterName)
response = ecs.describe_container_instances(
cluster=clusterName,
containerInstances=containerInstancesArns
)
containerInstances = response['containerInstances']
# Sort instances by number of running tasks
sortedContainerInstances = sorted(
containerInstances,
key=lambda containerInstances: containerInstances['runningTasksCount']
)
# Get the instance with the least number of tasks
startOn.append(sortedContainerInstances[0]['containerInstanceArn'])
logging.info('Starting task on instance %s...', startOn)
# Start a new task
response = ecs.start_task(
cluster=clusterName,
taskDefinition=taskDefinition,
containerInstances=startOn,
startedBy='LeastTasksScheduler'
)
62. “Moving to Amazon ECS significantly improved our
service performance. We reduced service response
times in the 99th percentile by 50%”
Jason Fischl
VP of Engineering
63. “I have managed the orchestration service at Heroku, and experimented
with configuring and running numerous open-source orchestration systems,
and I am relieved that Amazon's world-class engineering is tackling this
problem and offers it as a service.”
Noah Zoschke
Founder
64. “Out of the box ECS lets us run thousands of containers across multiple
availability zones. It's let our development team focus on building the
Meteor-specific services needed for our customers to build amazing
Connected Client apps. Because ECS pairs well with other technologies like
CloudFormation and auto scaling, it dramatically simplified our own devops
compared to other options. It's made it possible to run multiple Galaxies
and to bring up faithful development environments for each person on the
core team in a fraction of the time previously possible.”
Matt DeBergalis
Co-founder and VP Product
66. Amazon EC2 Container Registry
• Fully - managed Docker container
registry
• Integrated with Amazon ECS
• Encrypted in transit and at rest
• IAM users and roles
• Highly available and scalable
• Available in multiple regions
• $0.10/GB/Month + AWS data transfer
costs
aws.amazon.com/ecr
67. ECS CLI
• First version
• Simplify local development
• Easily set up ECS clusters
• Supports Docker Compose
• Open source
github.com/aws/amazon-ecs-cli
$ ecs-cli configure -i
$ ecs-cli up
$ ecs-cli compose up
$ ecs-cli compose ps
68. New: Improved Docker Container Configuration Options
• More Docker options supported in ECS task
definitions
• Ideal for advanced Docker users
• New additions
– Hostname
– Docker labels
– Working directory
– Privileged execution
– Log configuration
– …and more (see Amazon ECS
docs
69. aws.amazon.com/ecr
A system that’s designed to run container-enabled
applications in Production
Without worrying about scalability, performance,
IAM
Why we built the EC2 Container Service
Some of the design decisions we took
What sort of applications our customers are writing that drive these decisions
For years the core primitive of computing has been the server, and in more recent years, the virtual server. A server has a processor with cores. You run a full operating system. Typically you log into the server to debug it, etc.
For years if you needed your app to do more, you got a bigger server
and if your app got even bigger you bought an even bigger server. At some point you can’t find a bigger server (or it’s too expensive).
That was the common approach, but some people had started writing applications that were better suited to scaling horizontally, where you wouldn’t have to worry about hitting the limits of your individual machine, because to scale you just needed to add another one.
Then along came Amazon EC2. EC2 was announced just over 9 years ago.
EC2 completely changed the way you interactive with your primitive. Getting a new server became an API call.
Because adding servers was as easy as just another run-instances call
Your scaling model wasn’t about moving to a bigger server, but about scaling horizontally. This wasn’t a pattern that was invented by EC2, but programmatic instance provisioning and autoscaling definitively made this the pattern that people started using extensively. Your scale was now determined by your traffic
But applications for the most part on were still scaling in a monolithic fashion. What is a monolith? A monolith consists of a single file or hierarchy (WAR files/Rails).
To scale the application, you replicated it on many servers
Split each component up into it’s own service
Scale each service independently on its own fleet
Well this is where containers come into the picture
BSD, Solaris, Linux
Containers allow you to manage your resources via cgroups,
You get resource level isolation via namespaces,
and because containers share an OS and kernel, they are lightweight and REALLY fast
So containers are well and cool, but really hit the mainstream with the launch of Docker in 2013. So what did Docker get right apart from the really cute logo
So given that we now have a mechanism to package, deploy, and run our application components. Over the next few minutes you may get to hear me talk a little bit about how this relates to High Performance Computing, but that’s only because the HPC community has been doing cluster computing for a very long time and we can learn a lot of lessons from them.
We need to think about the problem differently. We need a different primitive
Our old primitive was the server. Our new primitive is a job or a task. A task is essentially your “micro” application. Something made up of one of more containers that need a certain amount of resources.
Tasks can be stitched up from Docker Containers and described declaratively
You can think of your hosts (or EC2 instances in this case) as a pool of resources across which you want to distribute your tasks based on your requirements.
This pool of resources is your cluster and you need a way to manage your cluster.
I like to think of cluster management like the air traffic control problem (or a train system). You need a control tower which is the leader and responsible for making decisions.
Your planes are the tasks that are supposed to be in a particular place at a particular time. Your control tower is responsible for coordinating take off, landing, and where planes are at a particular point in time based on the knowledge they have from GPS and other “state management” systems. Essentially the control tower has the ability to act like a scheduler.
There are a number of cluster and scheduling architectures, well described in this paper. Most HPC schedulers are monolithic but containers are well suited for two-level or shared state systems.
We talked to our customers to try and understand what they wanted out of a cluster management system and it became very clear to us that they needed a system that allowed them to run multiple distributed applications efficiently across a cluster of EC2 instances. They wanted the ability to use multiple schedulers concurrently, be able to make decisions on job priorities, and scaled well not just for individual applications, but for an entire organization. To do this reliably, you need to build a shared state, optimistic concurrency system.
And that’s what we built. This is a high level architectural diagram of Amazon ECS. Let’s look at the core components and concepts and how Amazon ECS is designed to help you run distributed applications.
The first thing you do with ECS is launch a “cluster” of Amazon EC2 instances (really need a better figure for container instances running Docker and the ECS agent with no task running on them). The only requirement for these instances is that they run a modern Linux kernel with Docker and the ECS agent
The ECS agent is written in Go, is open source (under an Apache license) and essentially is the interface for the ECS service on instances that you own.
The ECS service is built upon one of Amazons key distributed systems primitives, a highly available and scalable Paxos-based transaction journal. It has a history of state transitions captured over time. These transitions are offered using a shared state, optimistic concurrency (OCC) system. Accepted offers are replicated, you can make decisions based on it, since it’s consistent, it provides you with exactly which transition you’re working with.
Here is a list of all the ECS APIs
But really the key ones are the ones that allow you to List and Describe your key resources. Clusters, Container Instances, and Tasks. These APIs together help you get the complete state of your system at any point in time.
You can then use the start and stop APIs to place these tasks on the instances that you care about, e.g. instances with sufficient resources to run your tasks.
The numbers represents latencies end-to-end. From the time client submits a request to start a task, till the docker daemon sends an ack back to the client it’s running.
So, most of what we spend time in over the past year and a half or so, was building a very robust system that you can pretty much throw anything at. We can take very large clusters, in multiple zones, and you don’t have to worry about how we scale.
You might be saying at this point that we just spent all this time talking about schedulers. Why are we placing instances manually. That is a great point. You can think of scheduling in ECS as something you build on top of the core ECS APIs.
You can write your own scheduler!
Here is an example of some python code that essentially uses the ECS APIs to build a simple schedulers that places tasks on instances with the list number of running tasks
You can also take a scheudler system you’re using, and have it run on ECS (e.g scheduler driver that can run Marathon, by converting ECS state response into a Mesis offer
ECS provides two schedulers that are available to all customers. And yes, they have their own APIs. But the one we are interested in today is the one designed for distributed applications - The ECS Service Scheduler
The ECS Service Scheduler allows you to create a “service” and you can optionally put this service behind an ELB.
The primitive that the service launches and manages (and this is the core ECS primitive or unit of work) is the Task. Tasks are essentially a group of containers that you want to place together on the same host or container instance. In most cases, tasks consist of containers that do something together.
One of the key things that the service scheduler does is maintain the desired state of the system, which is really today maintaining the number of tasks it needs to maintain and making sure they are healthy behind an ELB. So if you lose your task, it will be restarted.
As you can see in the previous case our desired task count was 1 and the service scheduler focuses on making sure the running count is also 1.
and if you it the DNS record of the ELB you see we have a functioning app. OK this one is super simple but you get the picture.
You can update services. For example you may want to scale it to 3 tasks to take more traffic.
Cluster and service metrics at 1 minute granularity. These are aggregate metrics, they give CPU and memory utilization across the entire cluster, or the services running on the cluster.
You can use those metrics to scale the fleet.
Task definitions in ECS are immutable. When you make a changes, it gets versioned. So you're going from one version of the task to another
So we have the concept of deployments: - Create a new task, drain connections from old one, register in the ELB, delete the old one.
and if you it the DNS record of the ELB you see we have a functioning app. OK this one is super simple but you get the picture.
and if you it the DNS record of the ELB you see we have a functioning app. OK this one is super simple but you get the picture.
There is a third element of running distributed applications, especially in his kind of dynamic environment.
The way I think about service discovery is in this futuristic world where everything is coordinated by machines, so there are no people in this control tower. Planes that enter the system check themselves in and any other planes that have dependencies on those planes (e.g. flightpath) automatically know they are there, etc. In modern distributed architectures service discovery is a critical component of your system.
One way you can do service discovery in ECS is to look for all the services behind the same ELB.
Or you can use a system like etcd or Consul. We have a blog post that shows you how you can use Consul to do so
Our partners also provide some great tools in this space. A great example are Weave who have done some great work on integrating their networking and visualization stack with ECS. They have a great set of blog posts on this integration including one on how you can use Weave to do service discovery on ECS.
We announced ECS 11 months ago at re:Invent. In the months leading up to that and in the time since we’ve learnt a lot from our customers.
Earlier today you saw Remind speak at Werner’s keynote. They were an early customer and wrote a blog post earlier this year when they open sourced their platform, Empire. Empire is an open source platform for deploying and running distributed applications and allows teams to focus on what’s important. Remind runs Empire on top of ECS and perhaps my favorite part in their blog post was the bit where they talked about their response times. We work hard to make sure our service scales well, and more importantly provides good, low jitter, outlier performance. And the best part is that you don’t have to manage anything to do that. It’s just a function of our platform.
Another customer we have worked closely with is Convox, who have developed a super awesome platform and CLI for deploying distributed applications using Docker containers on top of ECS. Similarly to remind, they can focus on the usability of Convox and how applications are deployed and run, while ECS is responsible for scheduling and cluster management.
And last, but certainly not the least, Meteor, who have built their Galaxy platform on top of ECS. They value our scalability, the default multi-AZ setup, and integration with other AWS services. These are just examples of customers that are running distributed apps on top of ECS. The key is that they can focus on what’s important to them. They don’t need to install or management anything, get highly reliable state management, and the ability to runs 1000s of containers across multiple AZs with fast API response times.
I’d like to end this talk with a few announcements that will make it even easier for you to run and manage your ECS services.
The Amazon EC2 Container registry, a new Docker registry service that is designed to be tightly integrated with ECS and other AWS services. You don’t have to use ECS (or even EC2) to use this service.
The Amazon EC2 Container registry, a new Docker registry service that is designed to be tightly integrated with ECS and other AWS services. You don’t have to use ECS (or even EC2) to use this service.