6. Services evolve to microservices
Monolithic application
Order UI User UI Shipping UI
Order
service
User
service
Shipping
service
Data
access
Host 1
Service A
Service B
Host 2
Service B
Service D
Host 3
Service A
Service C
Host 4
Service B
Service C
7. • Simple to model
• Any app, any language
• Image is the version
• Test and deploy same artifact
• Stateless servers decrease change risk
Containers are natural for microservices
10. Scheduling a cluster is hard
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
11. Amazon ECS
Docker
Task
Container instance
Amazon
ECS
Container
ECS agent
ELB
Internet
ELB
User /
Scheduler
API
Cluster Management Engine
Task
Container
Docker
Task
Container instance
Container
ECS agent
Task
Container
Docker
Task
Container instance
Container
ECS agent
Task
Container
AZ 1 AZ 2
Key/Value Store
Agent Communication Service
12. Amazon ECS under the hood
IDN-1 IDN IDN+1 IDN+2 IDN+3 IDN+4 IDN+5
IDN+6
IDN+5
WRITE
READ
13. Amazon ECS under the hood
IDN-1 IDN IDN+1 IDN+2 IDN+3 IDN+4 IDN+5
IDN+6IDN+3
IDN+5IDN+2
WRITE WRITE
READREAD
22. Designed for use with other AWS services
• Elastic Load Balancing
• Amazon Elastic Block Store
• Amazon Virtual Private Cloud
• Amazon CloudWatch
• AWS Identity and Access Management
• AWS CloudTrail
31. Create service
• Load balance traffic across containers
• Automatically recover unhealthy containers
• Discover services
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
32. Scale service
• Scale up
• Scale down
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
33. Update service
• Deploy new version
• Drain connections
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
old old old
34. Update service (cont.)
• Deploy new version
• Drain connections
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
old old old
35. Update service (cont.)
• Deploy new version
• Drain connections
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
58. Anatomy of Task Placement
Cluster Constraints
Custom Constraints
Placement Strategies
Apply Filter
Satisfy CPU, memory, and port requirements
Filter for location, instance-type, AMI, or custom
attribute constraints
Identify instances that meet spread or binpack
placement strategy
Select final container instances for placement
59. Three Steps to Getting Started with Events
Step 1: Create CWE Rule Step 2: Create SNS Topic Step 3: Put Events to SNS
74. Contributing to Blox
• Blox is licensed under Apache 2.0
• Open an issue or pull request
• Watch our roadmap on GitHub
• Check out our Gitter channel
So we are going to briefly recap, why containers and the challenges you may face in production, and some of the use patterns
We will then talk about cluster management and how Amazon ECS fits into all of this
Then we will close with a demo from my colleague _____
Containers are similar to hardware virtualization (like EC2), however instead of partitioning a machine, containers isolate the processes running on a single operating system
This is a useful concept that lets you use the OS kernel to create multiple isolated user space processes that can have constraints on them like cpu & memory.
The Docker CLI makes using containers easy, with commands like docker run.
Docker images make it easy to define what runs in a container and versions the entire app
These concepts enable automation – you can define your app, build & share the image, and deploy that image.
You may be thinking – this sounds interesting but why would I want to use containers? There are 4 key benefits to using containers.
1.) The first is that containers are portable.
the image is consistent and immutable -- no matter where I run it, or when I start it, it’s the same.
This makes the dev lifecycle simpler –an image works the same on the developer’s desktop & prod, whether I start it today or scale my environment tomorrow, so there’s no surprises.
The entire Application is self-contained -- The image is the version, which makes deployments and scaling easier because the image includes the dependencies.
Small, usually 10s MB for the image, very sharable
2.) Containers are flexible.
You can create clean, reproducible, and modular environments.
Whereas in the past multiple processes would be on the same OS (e.g. Ruby, caching, log pushing), now
Containers makes make it easy to decompose an app into smaller chunks, like microservices, reducing complexity & letting teams move faster while still running the processes on the same host, e.g. no library conflicts
This streamlines both code deployment and infrastructure management
3.) Simply stating that Docker images start fast sells the technology short as speed is apparent in both performance characteristics and in application lifecycle and deployment benefits
So yes, containers start quickly because the operating system is already running, but
Every container can be a single threaded dev stream; less interdependencies
Also ops benefits - Example: IT updates the base image, I just do a new docker build – I can just focus on my app, meaning it’s faster for me to build & release.
4.) Finally, containers are efficient. You can allocate exactly the resources you want – specific cpu, ram, disk, network
Since it shares the same OS kernel & libs, containers use less resources than running the same processes on different virtual machines (different way to get isolation)
So I want to tell you a story about Amazon.com and the evolution of its architecture.
Over 10 years ago, Amazon had a large monolithic application running its website. Everything from its UI, ordering systems, recommendations engine, shopping cart was one big application with one large code base. The problem with that was there are a lot of code interdependencies that have to be resolved. Another problem Amazon experienced was it was hard to scale the website. If one part or service was memory intensive and another CPU intensive, the servers much be provisioned with enough memory and CPU to handle that baseline load. So if the CPU intensive service received a heavy load you have to provision a large machine and have a lot of underutilized resources
In order to scale better, Amazon decomposed its architecture into individual services that could be deployed separately. This allowed it to scale each service independently. It was able to have smaller teams that worked on each of the services and controlled that services codebase. This allowed the website to evolve faster because new updates can be delivered independently of other teams. This architecture is what now is known as microservices.
So lets talk about scheduling
The Docker CLI is great if you want to run a container on your laptop for example “docker run myimage”.
But it’s challenging to scale to 100s of containers. Now you’re suddenly managing a cluster & cluster management is hard.
You need a way to intelligently place your containers on the instances that have the resources and that means you need to know the state of everything in your system. For example…
what instances have available resources like memory and ports?
How do I know if a container dies?
How do I hook into other resources like ELB?
Can I extend whatever system that I use, e.g. CD pipeline, third party schedulers, etc.
Do I need to operate another piece of software?
These are the questions and challenges that our customers had which led us to build Amazon ECS
Summing everything, what ECS allows is the reduction on the amount of code you need to go from idea to implementation when building distributed systems.
So, rather than having Mesos or other cluster management software having to manage a set of machines directly, ECS manages your instances.
Much of the undifferentiated heavy lifting and housekeeping has been abstracted behind a set of APIs.
The ability to run multiple tasks on a shared pool of resources can also lead to higher utilization and faster task completion than if compute resources are statically partitioned.
Lets talk a bit how we achieve this concurrency control under the hood
We implemented Amazon ECS using one of Amazon’s core distributed systems primitives:
a Paxos-based transactional journal based data store that keeps a record of every change made to a data entry.
Any write to the data store is committed as a transaction in the journal with a specific order-based ID.
The current value in a data store is the sum of all transactions made as recorded by the journal.
Any read from the data store is only a snapshot in time of the journal.
For a write to succeed, the write proposed must be the latest transaction since the last read.
So if a user made a read and subsequently a few writes happened after that and it tries to write based on the last seend ID, the write wouldn’t succeed
This primitive allows Amazon ECS to store its cluster state information with optimistic concurrency,
which is ideal in environments where constantly changing data is shared.
This architecture affords Amazon ECS high availability, low latency, and high throughput because the data store is never pessimistically locked.
So how it works is that Each scheduler periodically queries the current cluster state to check the resource availability
To schedule a task, the scheduler makes a claim for any available resources
The scheduler then updates the cluster state with the newly claimed resources in an atomic transaction.
If a resource is already claimed, ECS will reject the transaction because it maintains concurrency control
So what ECS enables is called “shared state optimistic scheduling” where all schedulers can see the current state of the cluster at all times.
The reason we developed ECS was customers had been running containers and Docker on EC2 for quite some time.
What customers told us was the difficulty of running these containers at scale which generally involved installing and managing cluster management software
Eliminates cluster management software
Manages cluster state
Manages containers
Control and monitoring
Scale from one to tens of thousands of containers
Earlier this year we ran a load test
Over a 3 day period we scaled our cluster 200 to over 1000 instances in our cluster as represented by the purple line
The green and red line show the p99 and p50 latencies
As you can see they are relatively flat demonstrating that ECS is stable and will scale regardless of your cluster size
Amazon ECS is built to work with the AWS services you value. You can set up each cluster in its own Virtual Private Cloud and use security groups to control network access to your ec2 instances. You can store persistent information using EBS and you can route traffic to containers using ELB. CloudTrail integration captures every API access for security analysis, resource change tracking, and compliance auditing
You can model your app using a file called a Task Definition
This file defines the containers you want to run together.
A task definition also lets you specify Docker concepts like links to establish network channels between the containers and the volumes your containers need.
Task definitions are tracked by name and revision, just like source code
To create a task definition, you can use the console to specify the Docker image to use for the containers
You can specify resources like CPU and memory, ports and volumes for each container.
You can specify what command to run when the container starts.
And the essential flag specifies whether the task should fail if the container stops running.
You can also type everything as JSON if you want
Once your task definition is created, scheduling a Task Definition onto an instance with available resources creates a task
A task is an instantiation of a task definition.
You can have a task with just 1 container…or up to 10 that work together on a single machine. Maybe nginx in front of rails, or redis behind rails.
You can run as many tasks on an instances as will fit.
Often people wonder about cross host links, those don’t go in your task, put them behind an ELB, or a discovery system and make multiple tasks.
ECS has a scheduler that is good for long-running applications called the service scheduler
You reference a task definition and the number of tasks you want to run and then can optionally place it behind an ELB.
The scheduler will then launch the number of tasks that you requested
The scheduler will maintain the number of tasks you want to run and will have it automatically load balance
Scaling up and down is simple. You just tell the scheduler how many tasks you need and the scheduler will automatically launch more tasks or terminate tasks
Updating a service is easy
You deploy the new version and the scheduler with launch tasks with the new application version
It will drain the connection from the old containers and remove the containers
Leaving the newest containers running
Prior to today we had support for three placement strategies:
targeted instances through start-task
random placement through run-task
spread AZ, spread instance by create-service
Now we have support for:
spread with placement groups (constraints)
Bin packing
Distinct instances
Affinity / Anti-Affinity
These offer customers greater control and choice over how to run their applications.
Also introducing support for strategy chaining which allows for more complex algorithms in selecting the set of instances to run tasks and services.
For example, you can choose to spread a service across availability zones, but binpack on container instances within a zone.
A great way to start interacting with these APIs is through the new cluster query language that the Placement Engine supports. This query language also allows you to build powerful runtime arguments to give you direct control over the placement of your application.
Let’s look at a few examples.
Here you can filter on an instance family or specific instance type. Useful for making placement decisions optimized for a specific set of resources.
Filter down to container instances in a specific availability zone.
Matches t2 and specific availability zone
Matches t2.small, t2.medium specific instance types or g2* family and not in us-east-1d
Also support for new custom attributes. Here is how to set a custom attribute and make placement decisions based on these.
Show an example of setting a custom attribute, querying via list-container-instances, then running a task.
Previously only way to do this was to use the same task definition – but then restricted you from scaling the frontend and db separately. Now supported use case with affinity.
Lambda
KinesisSQS
SNS
At least once delivery to CWE and CWE has at least once delivery to targets
Let’s look at a scenario where you had 10 container instances. To start you’ll make a request to run some tasks or create a service. As part of that request you’ll specify CPU, memory, or port requirements.
In addition now, you’ll also provide other constraints, such as specific Availability Zone, AMI or insance-type.
And then last you’ll tell us the strategy you prefer for us to use when starting the tasks, which could range from spread for availability, binpack to optimize for utilization, place together (affinity) or place apart (anti-affinity), etc.
At the end of that process we have identified a set of instances that satisfies the requirements for the task you want to run and we place (or run) those tasks across your cluster based on the requirements specified.
This function only looks at tasks moving into the running state, but could be written to handle start and stop as well.
CWE to Labmda and Lambda function updates R53.
CloudWatch rule to filter on Task Running state changes
Lambda function to filter for Task Start / Stop events and update R53
R53 store active private IP for containers
Blox is a collection of open source projects for container management and orchestration. Blox gives customers control and choice over how your containerized applications run on Amazon ECS. With Blox, you can build custom schedulers or easily integrate existing scheduler logic to manage production applications – while still leveraging the ECS cluster manager and integrating with AWS platform features for load balancing, auth, and auto-scaling.
1. Choice = can now choose to use built-in schedulers of Amazon ECS, to run applications using existing ECS APIs and placement engine, or to integrate existing 3rd party.
2. Control = all of the code is open sourced and we have provided reference architectures so that customers can see a model on how leverage the ECS cluster state. This is a customer run open source project. We have had great discussions around the future of this project with customer partners like Netflix and Yelp and we invite the entire community to help decide the roadmap and what we should build.
3. Developer Experience = we will continue to invest in end to end tooling that makes it easy for customers to build, test, and run applications on Amazon ECS. One example is the ecs-cli and our vision is to continue to build tooling that our customers tell us useful and necessary to test locally and deploy predictably on Amazon ECS.
Start with a high-level overview where Blox fits with Amazon ECS and other features we’ve discussed today.
Setup CW events to send task and container instance state changes from ECS to CWE.
With Blox we configure an SQS queue to store state transition messages.
The Blox cluster state service reads events from the queue and stores them in a local db.
Why did we build the cluster state service?
Consume Event Stream
Persist State Locally
Handle State Reconciliation
Rather than polling, now can respond in real-time to task / instance state changes across your ECS clusters
The CloudFormation template sets up the ECS event stream and SQS queue.
Provide an open and transparent roadmap to the tools that we are building to enable customers to run applications on Amazon ECS.
Focus on tooling that enable extensibility and end to end testing of applications.
Continue to add features that enable customers to use their scheduling frameworks of choice (ex Mesos) – while also leveraging the managed schedulers, task placement capabilities, and cluster management of Amazon ECS.
Hear from the customer what’s important to help us make decisions around features, roadmap, and prioritization.