Container orchestration from theory to practice

v
Stephen Day & Laura Frank Tacho
Container Orchestration
from Theory to Practice

@stevvooe
Container enthusiast
Stephen Day
@rhein_wein
Director of Engineering, CloudBees
Laura Tacho

● Understanding the SwarmKit object model
● Node topology
● Distributed consensus and Raft
Agenda

An open source framework for building
orchestration systems
github.com/docker/swarmkit
SwarmKit

containerd
kubernetes
docker
swarmkitorchestration
container runtime
user tooling

The Old Way
Cluster
https://en.wikipedia.org/wiki/Control_theory
Orchestration

A control system for your cluster
ClusterO
-
Δ S
D
D = Desired State
O = Orchestrator
C = Cluster
S = State
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
Orchestration

Convergence
A functional view D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
f(D, St-1
, C) → St
| min(St
-D)

The Problem
Low Observability High Observability
Failure Process State User Input
Observability and Controllability

Data Model Requirements
● Represent difference in cluster state
● Maximize observability
● Support convergence
● Do this while being extensible and reliable

Show me your data structures and I’ll
show you your orchestration system

Services
● Express desired state of the cluster
● Abstraction to control a set of containers
● Enumerates resources, network availability, placement
● Leave the details of runtime to container process
● Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3

message ServiceSpec {
// Task defines the task template this service will spawn.
TaskSpec task = 2 [(gogoproto.nullable) = false];
// UpdateConfig controls the rate and policy of updates.
UpdateConfig update = 6;
// Service endpoint specifies the user provided configuration
// to properly discover and load balance a service.
EndpointSpec endpoint = 8;
}
Service Spec
Protobuf Example

message Service {
ServiceSpec spec = 3;
// UpdateStatus contains the status of an update, if one is in
// progress.
UpdateStatus update_status = 5;
// Runtime state of service endpoint. This may be different
// from the spec version because the user may not have entered
// the optional fields like node_port or virtual_ip and it
// could be auto allocated by the system.
Endpoint endpoint = 4;
}
Service
Protobuf Example

Object
Current State
Spec
Desired State
Reconciliation
Spec -> Object

Prepare: setup resources
Start: start the task
Wait: wait until task exits
Shutdown: stop task, cleanly
Task Model
Runtime

message TaskSpec {
oneof runtime {
NetworkAttachmentSpec attachment = 8;
ContainerSpec container = 1;
}
// Resource requirements for the container.
ResourceRequirements resources = 2;
// RestartPolicy specifies what to do when a task fails or finishes.
RestartPolicy restart = 4;
// Placement specifies node selection constraints
Placement placement = 5;
// Networks specifies the list of network attachment
// configurations (which specify the network and per-network
// aliases) that this task spec is bound to.
repeated NetworkAttachmentConfig networks = 7;
}
Task Spec
Protobuf Example

message Task {
// Spec defines the desired state of the task as specified by
// the user.
// The system will honor this and will *never* modify it.
TaskSpec spec = 3 [(gogoproto.nullable) = false];
// DesiredState is the target state for the task. It is set to
// TaskStateRunning when a task is first created, and changed
// to TaskStateShutdown if the manager wants to terminate the
// task. This field is only written by the manager.
TaskState desired_state = 10;
TaskStatus status = 9 [(gogoproto.nullable) = false];
}
Task
Protobuf Example

Orchestrator
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler

type Deployment struct {
// Specification of the desired behavior of the Deployment.
// +optional
Spec DeploymentSpec
// Most recently observed status of the Deployment.
// +optional
Status DeploymentStatus
}
Kubernetes
Parallel Concepts

Declarative - SwarmKit
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis

Declarative - Kubernetes
$ kubectl create deployment --image redis redis
deployment.extensions "redis" created
$ kubectl scale deployment redis --replicas=3
deployment.extensions "redis" scaled
$ kubectl rollout status deploy redis
deployment "redis" successfully rolled out

Manager
Task
Task
Data Flow
ServiceSpec
TaskSpec
Service
ServiceSpec
TaskSpec
Task
TaskSpec
Worker

Only one component of the system
can write to a field
Field Ownership
Consistency

message Task {
TaskSpec spec = 3;
string service_id = 4;
uint64 slot = 5;
string node_id = 6;
TaskStatus status = 9;
TaskState desired_state = 10;
repeated NetworkAttachment networks = 11;
Endpoint endpoint = 12;
Driver log_driver = 13;
}
Owner
User
Orchestrator
Allocator
Scheduler
Shared
Task
Protobuf Example

State Owner
< Assigned Manager
>= Assigned Worker
Task Status
Field Handoff

Worker
Pre-Run
Preparing
Manager
Terminal States
Task State
New Allocated Assigned
Ready Starting
Running
Complete
Shutdown
Failed
Rejected

Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input

We’ve got a bunch of nodes…
now what?
Task management is only half the fun. These
tasks can be scheduled across a cluster for HA!

We’ve got a bunch of nodes…
now what?
Your cluster needs to communicate!

Not the mechanism, but the sequence and frequency
Three scenarios important to us now
• Node registration
• Workload dispatching
• State reporting
Manager <-> Worker Communication

Most approaches take the form of two patterns
• Push, where the Managers push to the Workers
• Pull, where the Workers pull from the Managers
Manager <-> Worker Communication

3 - Payload
1 - Register
2 - Discover
Registration &
Payload
Push Model Pull Model
Payload

Pros Provides better control over
communication rate
- Managers decide when to
contact Workers
Cons Requires a discovery
mechanism
- More failure scenarios
- Harder to troubleshoot
Pros Simpler to operate
- Workers connect to Managers
and don’t need to bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain connection
to Managers at all times
Push Model Pull Model

Fetching logs
Attaching to running pods via kubectl
Port forwarding
Manager
Kubernetes

Work dispatching in batches
Next heartbeat timeout
Manager
Worker
SwarmKit

Always open
Self-registration
Healthchecks
Task status
...
Manager
Worker
SwarmKit

Rate Control: Heartbeats
Manager dictates heartbeat rate to
Workers
Rate is configurable (not by end user)
Managers agree on same rate by
consensus via Raft
Managers add jitter so pings are
spread over time (avoid bursts)
Ping? Pong!
Ping me back in
5.2 seconds

Rate Control: Workloads
Worker opens a gRPC stream to receive
workloads
Manager can send data whenever it wants
Manager will send data in batches
Changes are buffered and sent in batches
of 100 or every 100 ms, whichever occurs
first
Adds little delay (at most 100ms) but
drastically reduces amount of
communication
Give me
work to do
100ms - [Batch of 12 ]
340ms - [Batch of 100]
360ms - [Batch of 100]

● Backbone of distributed consensus in etcd and
SwarmKit (and other places!)
The Raft Consensus Algorithm

SwarmKit and Kubernetes don’t make the decision about
how to handle leader election and log replication, Raft
does!
github.com/kubernetes/kubernetes/tree/master/vendor/github.com/coreos/etcd/raft
github.com/docker/swarmkit/tree/master/vendor/github.com/coreos/etcd/raft

secretlivesofdata.com
Want to know more about Raft?

Single Source of Truth
● SwarmKit implements Raft directly, instead of via etcd
● The logs live in /var/lib/docker/swarm
● Or /var/lib/etcd/member/wal
● Easy to observe (and even read)

v
in SwarmKit
Manager <> Worker Communication

Communication
Leader FollowerFollower Worker can connect to any
reachable Manager
Followers will forward traffic to
the Leader

Reducing Network Load
Followers multiplex all
workers to the Leader using a
single connection
Backed by gRPC channels
(HTTP/2 streams)
Reduces Leader networking
load by spreading the
connections evenly
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
Leader FollowerFollower

Leader Failure (Raft!)
Upon Leader failure, a new
one is elected
All managers start redirecting
worker traffic to the new one
Transparent to workers

Leader Election (Raft!)
Follower FollowerLeader Upon Leader failure, a new
one is elected
All managers start redirecting
worker traffic to the new one
Transparent to workers

Manager Failure
- Manager 1 Addr
- Manager 2 Addr
- Manager 3 Addr
Manager sends list of all
managers’ addresses to
Workers
When a new manager joins, all
workers are notified
Upon manager failure, workers
will reconnect to a different
manager

Manager Failure (Worker POV)
Workers
manager

Manager Failure (Worker POV)
Reconnect to
random manager
Workers
manager

v
Presence
Scalable presence in a distributed system

Presence
Swarm still handles node management, even if
you use the Kubernetes scheduler
Manager Leader commits Worker state (Up or
Down) into Raft
• Propagates to all managers via Raft
• Recoverable in case of leader re-election

Presence
● Heartbeat TTLs kept in Manager Leader memory
○ Otherwise, every ping would result in a quorum write
○ Leader keeps worker<->TTL in a heap (time.AfterFunc)
○ Upon leader failover workers are given a grace period to
reconnect
■ Workers considered Unknown until they reconnect
■ If they do they move back to Up
■ If they don’t they move to Down

Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last
updated
● Updates must provide a base Version; are rejected if it is out
of date
● Provides CAS semantics
● Also exposed through API calls that change objects in the
store

Versioned Updates
Consistency
service := getCurrentService()
spec := service.Spec
spec.Image = "my.serv/myimage:mytag"
update(spec, service.Version)

Sequencer
Original object:
Service ABC
Version = 189
Spec
Replicas = 4
Image = registry:2.3.0
...
Raft index when it was last updated

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:Original object:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Original object:
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Updated object:

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Service ABC
Spec
Replicas = 5
...
Version = 189
Update request:Updated object:

github.com/docker/swarmkit
github.com/coreos/etcd (Raft)

Thank you!
@stevvooe
@rhein_wein

#TODO
- Rate control (Kubernetes)
- Where is push vs pull used in k8s?
- Object Model: Field ownership
- TLS in k8s
- Kubernetes node visualization (if using a stack deploy, does it show on the
docker visualizer)

Container orchestration from theory to practice

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Container orchestration from theory to practice

Semelhante a Container orchestration from theory to practice (20)

Mais de Docker, Inc.

Mais de Docker, Inc. (20)

Último

Último (20)

Container orchestration from theory to practice