Container Orchestration from Theory to Practice

Stephen Day & Laura Frank
Container Orchestration
from Theory to Practice

Stephen Day
Docker, Inc.
stephen@docker.com
@stevvooe
Laura Frank
Codeship
laura@codeship.com
@rhein_wein

Agenda
● Understanding the SwarmKit object
model
● Node topology
● Distributed consensus and Raft

An open source framework for building
orchestration systems
github.com/docker/swarmkit
SwarmKit

containerd
kubernetes
docker
swarmkitorchestration
container
runtime
user tooling

Orchestration
The Old Way
Cluster
https://en.wikipedia.org/wiki/Control_theory

Orchestration
A control system for your
cluster
ClusterO
-
Δ S
D
D = Desired State
O = Orchestrator
C = Cluster
S = State
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory

Convergence
A functional view D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
f(D, St-1
, C) → St
| min(St
-D)

Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input

Data Model Requirements
● Represent difference in cluster state
● Maximize observability
● Support convergence
● Do this while being extensible and reliable

Show me your data structures and I’ll
show you your orchestration system

Services
● Express desired state of the cluster
● Abstraction to control a set of containers
● Enumerates resources, network availability, placement
● Leave the details of runtime to container process
● Implement these services by distributing processes across a
cluster
Node 1 Node 2 Node 3

message ServiceSpec {
// Task defines the task template this service will spawn.
TaskSpec task = 2 [(gogoproto.nullable) = false];
// UpdateConfig controls the rate and policy of updates.
UpdateConfig update = 6;
// Service endpoint specifies the user provided configuration
// to properly discover and load balance a service.
EndpointSpec endpoint = 8;
}
Service Spec
Protobuf Example

message Service {
ServiceSpec spec = 3;
// UpdateStatus contains the status of an update, if one is in
// progress.
UpdateStatus update_status = 5;
// Runtime state of service endpoint. This may be different
// from the spec version because the user may not have entered
// the optional fields like node_port or virtual_ip and it
// could be auto allocated by the system.
Endpoint endpoint = 4;
}
Service
Protobuf Example

Object
Current State
Spec
Desired State
Reconciliation
Spec -> Object

Prepare: setup resources
Start: start the task
Wait: wait until task exits
Shutdown: stop task, cleanly
Task Model
Runtime

message TaskSpec {
oneof runtime {
NetworkAttachmentSpec attachment = 8;
ContainerSpec container = 1;
}
// Resource requirements for the container.
ResourceRequirements resources = 2;
// RestartPolicy specifies what to do when a task fails or finishes.
RestartPolicy restart = 4;
// Placement specifies node selection constraints
Placement placement = 5;
// Networks specifies the list of network attachment
// configurations (which specify the network and per-network
// aliases) that this task spec is bound to.
repeated NetworkAttachmentConfig networks = 7;
}
Task Spec
Protobuf Example

message Task {
// Spec defines the desired state of the task as specified by the user.
// The system will honor this and will *never* modify it.
TaskSpec spec = 3 [(gogoproto.nullable) = false];
// DesiredState is the target state for the task. It is set to
// TaskStateRunning when a task is first created, and changed to
// TaskStateShutdown if the manager wants to terminate the task. This field
// is only written by the manager.
TaskState desired_state = 10;
TaskStatus status = 9 [(gogoproto.nullable) = false];
}
Task
Protobuf Example

Orchestrator
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler

Kubernetes
type Deployment struct {
// Specification of the desired behavior of the Deployment.
// +optional
Spec DeploymentSpec
// Most recently observed status of the Deployment.
// +optional
Status DeploymentStatus
}
Parallel Concepts

Declarative
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis

Manager
Task
Task
Data Flow
ServiceSpec
TaskSpec
Service
ServiceSpec
TaskSpec
Task
TaskSpec
Worker

Only one component of the
system can write to a field
Field Ownership
Consistency

message Task {
TaskSpec spec = 3;
string service_id = 4;
uint64 slot = 5;
string node_id = 6;
TaskStatus status = 9;
TaskState desired_state = 10;
repeated NetworkAttachment networks = 11;
Endpoint endpoint = 12;
Driver log_driver = 13;
}
Owner
User
Orchestrator
Allocator
Scheduler
Shared
Task
Protobuf Example

State Owner
< Assigned Manager
>= Assigned Worker
Field Handoff
Task Status

Worker
Pre-Run
Preparing
Manager
Terminal States
Task State
New Allocated Assigned
Ready Starting
Running
Complete
Shutdown
Failed
Rejected

Observability and Controllability
The Problem
Low Observability High Observability
Failure Task/Pod State User Input

We’ve got a bunch of nodes… now
what?
Task management is only
half the fun. These tasks
can be scheduled across
a cluster for HA!

We’ve got a bunch of nodes… now
what?
Your cluster needs to
communicate.

● Not the mechanism, but the sequence and
frequency
● Three scenarios important to us now
○ Node registration
○ Workload dispatching
○ State reporting
Manager <-> Worker Communication

● Most approaches take the form of two
patterns
○ Push, where the Managers push to the
Workers
○ Pull, where the Workers pull from the
Managers
Manager <-> Worker Communication

3 - Payload
1 - Register
2 - Discover
Registration &
Payload
Push Model Pull Model
Payload

Pros Provides better control
over communication rate
- Managers decide
when to contact
Workers
Con
s
Requires a discovery
mechanism
- More failure
scenarios
- Harder to
troubleshoot
Pros Simpler to operate
- Workers connect to
Managers and don’t need to
bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain
connection to Managers at all
times
Push Model Pull Model

● Fetching logs
● Attaching to running pods via
kubectl
● Port-forwarding
Manager
Kubernetes

● Work dispatching in batches
● Next heartbeat timeout
Manager
Worker
SwarmKit

● Always open
● Self-registration
● Healthchecks
● Task status
● ...
Manager
Worker
SwarmKit

Rate Control: Heartbeats
● Manager dictates heartbeat rate
to Workers
● Rate is configurable (not by end
user)
● Managers agree on same rate by
consensus via Raft
● Managers add jitter so pings are
spread over time (avoid bursts)
Ping? Pong!
Ping me back in
5.2 seconds

Rate Control: Workloads
● Worker opens a gRPC stream to
receive workloads
● Manager can send data
whenever it wants to
● Manager will send data in
batches
● Changes are buffered and sent
in batches of 100 or every 100 ms,
whichever occurs first
● Adds little delay (at most 100ms)
but drastically reduces amount
of communication
Give me
work to do
100ms - [Batch of 12 ]
340ms - [Batch of 100]
360ms - [Batch of 100]

The Raft Consensus Algorithm
● This talk is about SwarmKit, but Raft is also
used by Kubernetes
● The implementation is slightly different but
the behavior is the same

SwarmKit and Kubernetes don’t make the
decision about how to handle leader election
and log replication, Raft does!
github.com/kubernetes/kubernetes/tree/master/vendor/github.com/coreos/etcd/raft
github.com/docker/swarmkit/tree/master/vendor/github.com/coreos/etcd/raft
No really...

secretlivesofdata.com
Want to know more about Raft?

Single Source of Truth
● SwarmKit implements Raft directly, instead
of via etcd
● The logs live in /var/lib/docker/swarm
● Or /var/lib/etcd/member/wal
● Easy to observe (and even read)

Manager <-> Worker
CommunicationIn Swarmkit

Communication
Leader FollowerFollower ● Worker can connect to
any reachable
Manager
● Followers will forward
traffic to the Leader

Reducing Network Load
● Followers multiplex all
workers to the Leader
using a single
connection
● Backed by gRPC
channels (HTTP/2
streams)
● Reduces Leader
networking load by
spreading the
connections evenly
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
Leader FollowerFollower

Leader Failure (Raft!)
● Upon Leader failure, a
new one is elected
● All managers start
redirecting worker
traffic to the new one
● Transparent to workers

Leader Election (Raft!)
Follower FollowerLeader ● Upon Leader failure, a
new one is elected
● All managers start
redirecting worker
traffic to the new one
● Transparent to workers

Manager Failure
- Manager 1 Addr
- Manager 2 Addr
- Manager 3 Addr
● Manager sends list of all
managers’ addresses to
Workers
● When a new manager
joins, all workers are
notified
● Upon manager failure,
workers will reconnect to
a different manager

Manager Failure (Worker POV)
Workers
notified
a different manager

Manager Failure (Worker POV)
Reconnect to
random manager
Workers
notified
a different manager

Presence
Scalable presence in a distributed
environment

Presence
● Swarm still handles node management, even if you use the
Kubernetes scheduler
● Manager Leader commits Worker state (Up or Down) into
Raft
○ Propagates to all managers via Raft
○ Recoverable in case of leader re-election

Presence
● Heartbeat TTLs kept in Manager Leader memory
○ Otherise, every ping would result in a quorum write
○ Leader keeps worker<->TTL in a heap (time.AfterFunc)
○ Upon leader failover workers are given a grace period to
reconnect
■ Workers considered Unknown until they reconnect
■ If they do they move back to Up
■ If they don’t they move to Down

Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last
updated
● Updates must provide a base Version; are rejected if it is out
of date
● Provides CAS semantics
● Also exposed through API calls that change objects in the
store

Versioned Updates
Consistency
service := getCurrentService()
spec := service.Spec
spec.Image = "my.serv/myimage:mytag"
update(spec, service.Version)

Sequencer
Original object:
Raft index when it was last updated
Service ABC
Version = 189
Spec
Replicas = 4
Image = registry:2.3.0
...

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:Original object:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Original object:
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Updated object:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Service ABC
Spec
Replicas = 5
...
Version = 189
Update request:Updated object:

github.com/docker/swarmkit
github.com/coreos/etcd (Raft)

Thank you!
@stevvooe
@rhein_wein

Container Orchestration from Theory to Practice

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Container Orchestration from Theory to Practice

Semelhante a Container Orchestration from Theory to Practice (20)

Mais de Docker, Inc.

Mais de Docker, Inc. (20)

Último

Último (20)

Container Orchestration from Theory to Practice