Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice and walk away with more insights into your production applications.
7. Orchestration
A control system for your
cluster
ClusterO
-
Δ S
D
D = Desired State
O = Orchestrator
C = Cluster
S = State
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
8. Convergence
A functional view D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
f(D, St-1
, C) → St
| min(St
-D)
11. Data Model Requirements
● Represent difference in cluster state
● Maximize observability
● Support convergence
● Do this while being extensible and reliable
12. Show me your data structures and I’ll
show you your orchestration system
13. Services
● Express desired state of the cluster
● Abstraction to control a set of containers
● Enumerates resources, network availability, placement
● Leave the details of runtime to container process
● Implement these services by distributing processes across a
cluster
Node 1 Node 2 Node 3
14. message ServiceSpec {
// Task defines the task template this service will spawn.
TaskSpec task = 2 [(gogoproto.nullable) = false];
// UpdateConfig controls the rate and policy of updates.
UpdateConfig update = 6;
// Service endpoint specifies the user provided configuration
// to properly discover and load balance a service.
EndpointSpec endpoint = 8;
}
Service Spec
Protobuf Example
15. message Service {
ServiceSpec spec = 3;
// UpdateStatus contains the status of an update, if one is in
// progress.
UpdateStatus update_status = 5;
// Runtime state of service endpoint. This may be different
// from the spec version because the user may not have entered
// the optional fields like node_port or virtual_ip and it
// could be auto allocated by the system.
Endpoint endpoint = 4;
}
Service
Protobuf Example
18. message TaskSpec {
oneof runtime {
NetworkAttachmentSpec attachment = 8;
ContainerSpec container = 1;
}
// Resource requirements for the container.
ResourceRequirements resources = 2;
// RestartPolicy specifies what to do when a task fails or finishes.
RestartPolicy restart = 4;
// Placement specifies node selection constraints
Placement placement = 5;
// Networks specifies the list of network attachment
// configurations (which specify the network and per-network
// aliases) that this task spec is bound to.
repeated NetworkAttachmentConfig networks = 7;
}
Task Spec
Protobuf Example
19. message Task {
// Spec defines the desired state of the task as specified by the user.
// The system will honor this and will *never* modify it.
TaskSpec spec = 3 [(gogoproto.nullable) = false];
// DesiredState is the target state for the task. It is set to
// TaskStateRunning when a task is first created, and changed to
// TaskStateShutdown if the manager wants to terminate the task. This field
// is only written by the manager.
TaskState desired_state = 10;
TaskStatus status = 9 [(gogoproto.nullable) = false];
}
Task
Protobuf Example
21. Kubernetes
type Deployment struct {
// Specification of the desired behavior of the Deployment.
// +optional
Spec DeploymentSpec
// Most recently observed status of the Deployment.
// +optional
Status DeploymentStatus
}
Parallel Concepts
22. Declarative
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis
33. We’ve got a bunch of nodes… now
what?
Task management is only
half the fun. These tasks
can be scheduled across
a cluster for HA!
34. We’ve got a bunch of nodes… now
what?
Your cluster needs to
communicate.
35. ● Not the mechanism, but the sequence and
frequency
● Three scenarios important to us now
○ Node registration
○ Workload dispatching
○ State reporting
Manager <-> Worker Communication
36. ● Most approaches take the form of two
patterns
○ Push, where the Managers push to the
Workers
○ Pull, where the Workers pull from the
Managers
Manager <-> Worker Communication
37. 3 - Payload
1 - Register
2 - Discover
Registration &
Payload
Push Model Pull Model
Payload
38. Pros Provides better control
over communication rate
- Managers decide
when to contact
Workers
Con
s
Requires a discovery
mechanism
- More failure
scenarios
- Harder to
troubleshoot
Pros Simpler to operate
- Workers connect to
Managers and don’t need to
bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain
connection to Managers at all
times
Push Model Pull Model
39. ● Fetching logs
● Attaching to running pods via
kubectl
● Port-forwarding
Manager
Kubernetes
40. ● Work dispatching in batches
● Next heartbeat timeout
Manager
Worker
SwarmKit
41. ● Always open
● Self-registration
● Healthchecks
● Task status
● ...
Manager
Worker
SwarmKit
43. Rate Control: Heartbeats
● Manager dictates heartbeat rate
to Workers
● Rate is configurable (not by end
user)
● Managers agree on same rate by
consensus via Raft
● Managers add jitter so pings are
spread over time (avoid bursts)
Ping? Pong!
Ping me back in
5.2 seconds
44. Rate Control: Workloads
● Worker opens a gRPC stream to
receive workloads
● Manager can send data
whenever it wants to
● Manager will send data in
batches
● Changes are buffered and sent
in batches of 100 or every 100 ms,
whichever occurs first
● Adds little delay (at most 100ms)
but drastically reduces amount
of communication
Give me
work to do
100ms - [Batch of 12 ]
200ms - [Batch of 26 ]
300ms - [Batch of 32 ]
340ms - [Batch of 100]
360ms - [Batch of 100]
460ms - [Batch of 42 ]
560ms - [Batch of 23 ]
46. The Raft Consensus Algorithm
● This talk is about SwarmKit, but Raft is also
used by Kubernetes
● The implementation is slightly different but
the behavior is the same
47. The Raft Consensus Algorithm
SwarmKit and Kubernetes don’t make the
decision about how to handle leader election
and log replication, Raft does!
github.com/kubernetes/kubernetes/tree/master/vendor/github.com/coreos/etcd/raft
github.com/docker/swarmkit/tree/master/vendor/github.com/coreos/etcd/raft
No really...
48. The Raft Consensus Algorithm
secretlivesofdata.com
Want to know more about Raft?
49. Single Source of Truth
● SwarmKit implements Raft directly, instead
of via etcd
● The logs live in /var/lib/docker/swarm
● Or /var/lib/etcd/member/wal
● Easy to observe (and even read)
53. Reducing Network Load
● Followers multiplex all
workers to the Leader
using a single
connection
● Backed by gRPC
channels (HTTP/2
streams)
● Reduces Leader
networking load by
spreading the
connections evenly
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
Leader FollowerFollower
54. Leader FollowerFollower
Leader Failure (Raft!)
● Upon Leader failure, a
new one is elected
● All managers start
redirecting worker
traffic to the new one
● Transparent to workers
55. Leader Election (Raft!)
Follower FollowerLeader ● Upon Leader failure, a
new one is elected
● All managers start
redirecting worker
traffic to the new one
● Transparent to workers
56. Manager Failure
- Manager 1 Addr
- Manager 2 Addr
- Manager 3 Addr
● Manager sends list of all
managers’ addresses to
Workers
● When a new manager
joins, all workers are
notified
● Upon manager failure,
workers will reconnect to
a different manager
Leader FollowerFollower
57. Leader FollowerFollower
Manager Failure (Worker POV)
● Manager sends list of all
managers’ addresses to
Workers
● When a new manager
joins, all workers are
notified
● Upon manager failure,
workers will reconnect to
a different manager
58. Manager Failure (Worker POV)
Leader FollowerFollower
Reconnect to
random manager
● Manager sends list of all
managers’ addresses to
Workers
● When a new manager
joins, all workers are
notified
● Upon manager failure,
workers will reconnect to
a different manager
61. Presence
● Swarm still handles node management, even if you use the
Kubernetes scheduler
● Manager Leader commits Worker state (Up or Down) into
Raft
○ Propagates to all managers via Raft
○ Recoverable in case of leader re-election
62. Presence
● Heartbeat TTLs kept in Manager Leader memory
○ Otherise, every ping would result in a quorum write
○ Leader keeps worker<->TTL in a heap (time.AfterFunc)
○ Upon leader failover workers are given a grace period to
reconnect
■ Workers considered Unknown until they reconnect
■ If they do they move back to Up
■ If they don’t they move to Down
64. Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last
updated
● Updates must provide a base Version; are rejected if it is out
of date
● Provides CAS semantics
● Also exposed through API calls that change objects in the
store