Presentation on the Linux namespaces and system calls used to provide container isolation with Docker. Presented in March 2015 at http://www.meetup.com/Docker-Phoenix/ in Tempe, Arizona.
2. Eight Aspects of Isolation
PID – Process ID and capabilities
UTS – Host and domain name
MNT – File system access and structure
IPC – Communication over shared memory
NET – Network access and structure
USR – (New in PR) Map host users/uids to container users
chroot() – Set the root of a file system for a process
Cgroups – Resource protection
3. PID Namespace
Process IDs
Newbies should learn that processes have IDs
The PID Namespace lets you reuse PIDs
Each container has its own PID 1
Process capabilities
Process capabilities are specified at runtime
Isolation benefit: Process IDs leak all sort of information and
being able to reference processes outside of a container
opens several attack vectors. Capabilities are awesome.
4. PID Namespace Example
$ docker run --rm --name bob busybox:latest ps
PID USER COMMAND
1 root ps
$ docker run --rm --name tom --cap-add NET_ADMIN
busybox:latest ps
PID USER COMMAND
1 root ps
5. UTS Namespace
Host name
Domain name
Processes in a specific UTS namespace will see identify the
host they are running on with the same host and domain
name.
Isolation benefit: Combined with the NET namespace the
UTS namespace allows processes to identify their container
by name in addition to its virtual network address. Self-
identification breaks dependency on the host identification.
6. UTS Namespace Example
$ hostname
name.of.the.host
$ docker run --rm --hostname
something.specific.per.container
busybox:latest hostname
something.specific.per.container
7. chroot()
Linux system call
Sets the root of the file system for a process
Part of Unix since the 70’s
A core component of any container or jail strategy
Isolation benefit: If a process cannot reference part of a file
system, that process cannot use or modify the part which is
beyond its scope.
8. MNT Namespace
File system access and structure
Combined with chroot() to build and abstract the details of a
contained file system
Provides us features like Volumes
Isolation benefit: Build effectively full file systems in a
subtree. Augment simple chroot with bind mounted subtrees.
9. IPC Namespace
IPC Namespace
SysV shared memory blocks
POSIX queues
POSIX Semaphores
Container Types
Closed – No access to shared memory pools outside of the
container.
Joined – Reuse a namespace created for another container.
Open – Full access to the shared memory pools on the host.
Isolation Benefit: Protect from snooping on the shared memory
of other processes.
10. IPC Namespace Example
$ docker run --rm --name bob myserver
$ docker run --rm --ipc container:bob myclient
11. NET Namespace
Logical network devices
IP 4/6 network stacks
IP routing tables
Firewalls
/proc/net
Port numbers (sockets)
Isolation benefit: Containers can be treated like hosts.
12. NET Namespace Example
$ docker run --net none --name roy --expose 8080
busybox:latest
nc -l 0.0.0.0:8080
$ docker run --net container:roy busybox:latest
nc 127.0.0.1 8080
13. Cgroups - Resource Protection
Memory Limits
Hard byte limits. No checking for limits that exceed the available
memory on the host.
CPU Weight
Proportionality of container weights determines the percentage of
CPU time made available for each container.
Processes may burst beyond that proportion if the CPU is otherwise
idle.
CPU Set Restrictions
Limit the process to executing on a specific set of CPUs.
Device Access
Mount devices in containers (think specialized hardware)
15. USR Namespace
… Docker interface has yet to be implemented
But support has been built in LXC and libcontainer
Let’s talk about how it works now and how it could work using
the USR namespace to map users…
16. Bonus Round
Extreme Isolation Systems
SELinux – Labeling ALL THE THINGS!
AppArmor – Build an execution profile (file path based)
GRSEC Kernel
“Grsecurity is an extensive security enhancement to the Linux
kernel that defends against a wide range of security threats
through intelligent access control, memory corruption-based exploit
prevention, and a host of other system hardening that generally
require no configuration.”