1. Ready to get shipped?
By Chafik Belhaoues
@XebiaFr
2. Introduction [History & newness of the idea]1
Anatomy of the building blocks2
Namespaces3
cgroups5
Storage backends6
Execution environments7
3. A little bit of history:
The marine containers have been created in 1956 par Malcom Mclean in NewYork, just because
time is money (-90% of transport costs).
BEFORE AFTER
6. The need of containerization:
Develop, ship, and run applications {everywhere}.
Concept? Product? Life-cycle engine? …you said {DevOps} tool?
A single, runnable, distributable executable.
What is the difference with the other form of virtualization then?
Open source [CS version].
Not OS-related [theoretically].
No hypervisor needed.
A different [new] vision of IT.
Closer to the most IT needs.
7. Hardware-centric:
A VM packages a full stack (virtual hardware, kernel, a user space).
Designed with machine operators in mind, not software developers.
VMs offer no facilities for application versioning, monitoring, configuration, logging or service
discovery…
Application-centric:
Packages only the user space; there is no kernel or virtual hardware.
Sandboxing method known as containerization = Application virtualization.
8. Overview:
Docker is based on a client-server architecture. The client {user commands} talks to the Docker
Daemon.
Daemon: runs on a host machine.
Client: accepts commands from the user and communicates back and forth with a Docker
Daemon using API.
3 components involved: build..ship..run
Images: a read-only template, images are the build component of Docker.
Registries: hold images, the distribution component of Docker.
Containers: holds everything that is needed for an application to run, the run component of
Docker
10. Anatomy of the building blocks:
Apartment complex analogy:
1. Each apartment will require water and electricity and these resources should be distributed
fairly {resources}.
2. The apartments are isolated with walls to keep people separate from their respective neighbors
{isolation}.
3. Each apartment also has a door, lock, and keys {security}.
4. Finally, most apartment complexes benefit from a manager who works to ensure a consistent
and clean steady state of operations {management}.
By analogy to system resources required for a container, the kernel should implement 4
elements:
- Resource Management.
- Process Isolation.
- Security.
- Tooling (CLI).
11. Resource management is provided by control groups (cgroups).
Process isolation is provided by kernel namespaces.
Security is provided by policy manager like: SELinux
Overall management by Docker CLI.
12. Namespace:
Wraps a global system resources in an abstraction.
Changes are visible only inside the namespace.
Kernel namespaces allow the new process to have its own hostname, IP address and a whole
network stack, filesystem, PID, IPC stack, and even user mapping.
The container to look a VM.
Kernal space:
Strictly reserved for running a privileged operating system kernel, kernel extensions, and most
device drivers, the gate to this land is managed by CAP_SYS_ADMIN capability starting with
kernel 2.2 [before it was the superuser, or root, ID 0].
User space [userland]:
The memory area where application software and some drivers execute.
14. Playing with Syscalls:
clone:
Creates a new process, in a manner similar to fork then creates new namespaces for every flag
CLONE_NEW*.
Unlike fork, the child process is allowed to share parts of its execution context with the calling
process (the memory space, the table of file descriptors, the table of signal handlers…).
setns:
Allows the calling process to join an existing namespace.
unshare:
Moves the calling process to a new namespace in other words: disassociates parts of its execution
context that are currently being shared with other processes (or threads).
15. Namespace Date Kernel version
mount 2002 2.4.19
uts 2006 2.6.19
ipc 2006 2.6.19
pid 2008 2.6.24
net 2009 2.6.29
user 2013 3.8
16. MNT namespace:
Isolate the set of filesystem mount points.
Means that processes in different mount namespaces can have different views of the filesystem
hierarchy.
The container “thinks” that a directory which is actually mounted from the host OS is exclusive to
the container.
Interacting with this namespace is simply done by mount/umount syscalls.
All about Isolation…
17. PID namespace:
Isolate the process ID number space = processes in different "PID namespaces" can have the same
PID.
The container thinks it has a separate standalone instance of the OS.
Technically, the new process created in a new namespace will be the famous PID 1 "init“.
Inside the namespace fork/clone syscalls will produce processes with PIDs that are unique.
This mechanism allows containers to provide functionality such as:
suspending/resuming the set of processes.
PID consistency on migration.
18. NETNS namespace:
Logically another copy of the network stack, with its own routes, firewall rules, and network
devices.
It means each network namespace has its own network devices, IP addresses, IP routing tables,
/proc/net directory, port numbers...
It allows a container to have its own IP address, independent of that of the host.
19. UTS namespace [UNIX Time Sharing]:
Historically the term "UTS" derives from the name of the structure passed to the uname() system
call: struct utsname.
{Initially the time sharing was invited to allow a large number of users to interact concurrently
with a single computer by the sharing of a computing resource among many users by means of
multiprogramming and multi-tasking at the same time}.
This mechanism isolates two system identifiers nodename and domainname.
It allows the containers to have its own separate identity {hostname and NIS domain name}.
20. IPC namespace:
IPC (POSIX/SysV IPC) namespace provides isolation/separation of IPC resources:
Named shared memory segments.
Semaphores.
Message queues.
Why this need ?
Shared memory segments are used to accelerate inter-process communication at memory speed,
rather than through pipes or through the network stack. It is commonly used by databases and
custom-built high performance applications for scientific computing and financial services
industries. If these types of applications are broken into multiple containers, you might need to
share the IPC mechanisms of the containers.
21. User namespace:
The last namespace to be implemented, integrated in the kernel mainstream starting from 3.8
BUT in technical preview in almost all Linux distros.
A process's user and group IDs can be different inside and outside a user namespace, that means
a process can have a normal unprivileged user ID outside a user namespace while at the same
time having a user ID of 0 inside the namespace. Which in term of isolation, makes the user and
group ID number spaces totally separated.
22. cgroups:
Traditionally, all processes received similar amount of system resources and all the tuning goes
through the process niceness value.
A mechanism to organize processes hierarchically and distribute system resources — such as CPU
time, system memory, network bandwidth, or combinations of these resources — along
the hierarchy in a controlled and configurable manner.
Every process belongs to one and only one cgroup.
Initially, only the root cgroup exists to which all processes belong.
All processes are put in the cgroup that the parent process belongs to at the time.
Two parts of cgroups:
1. core: primarily responsible for hierarchically organizing processes.
2. controller: responsible for distributing or applying limits to a specific type of system resource.
23. blkio: sets limits on input/output access to and from block devices.
cpu: uses the CPU scheduler to provide cgroup tasks an access to the CPU.
cpuacct: creates automatic reports on CPU resources used by tasks in a cgroup.
cpuset: assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
devices: allows or denies access to devices for tasks in a cgroup.
freezer: suspends or resumes tasks in a cgroup.
memory: sets limits on memory used by tasks in a cgroup.
net_cls: tags network packets with a class identifier (classid) that allows the traffic controller to
identify packets originating from a particular cgroup task.
perf_event: enables monitoring cgroups with the perf tool.
hugetlb: allows to use virtual memory pages of large sizes, and to enforce resource limits on these
pages.
24. Union filesystem:
A stackable unification file system, which merges the contents of several directories (branches),
while keeping their physical content separate.
Builds file systems that operate by creating layers, allow files and directories of separate file
systems {branches}, to be transparently overlaid, forming a single coherent file system.
It allows any combination of read-only and read-write branches, as well as insertion and deletion
of branches anywhere in the tree.
25. AUFS [Another Union File System]:
Since V2 it stands for "advanced multi layered unification filesystem“.
It was the first storage driver in use with Docker, developed in 2006 as a complete rewrite of the
earlier UnionFS.
According to Docker:
AUFS is not included in the mainline (upstream) Linux kernel. It was rejected because of the
dense, unreadable, and uncommented code.
26. OverlayFS:
Merged in the Linux kernel in 2014, kernel version 3.18.
The natural successor to aufs.
Combines two filesystems - an 'upper' filesystem and a 'lower' filesystem.
When a name exists in both filesystems, the object in the 'upper' filesystem is visible while the
object in the 'lower' filesystem is either hidden or, in the case of directories, merged with the
'upper' object.
27. DeviceMapper [storage backend ]:
Initially developed by Redhat as an alternative to AUFS.
Based on snapshots.
Uses allocate-on-demand.
28. Container format:
Docker wraps all the previous components into an execution environment or driver called
{container format}.
Traditional container drivers: OpenVZ, systemd-nspawn, libvirt-lxc, libvirt-sandbox, qemu/kvm,
BSD Jails, Solaris Zones, and even good old chroot.
The new execution drivers: moving from libcontainer to runc & containerd.