SlideShare a Scribd company logo
1 of 60
Vaibhav Sharma
 This session is not about DevOps, CI/CD or test but must to know to design a state of art
DevOps and SecDevOps solutions.
 No new concepts and most of concepts are as old as year 2002 and in some cases 1970’s.
 Presentation is designed in two parts
 Information for all
 Information for system programmers
 Examples are as on RHEL 7 platform
 What is not covered
 Indepth discussion on storage related topics like copy-on-write.
 Containers and systemd/apparmour related topics and issues.
 Basics of OS LEVEL Virtualization.
 Products of Interest.
 Features of OS level virtualization.
 OS level virtualization features in brief.
 Linux Container Building blocks.
 Samples
INTRODUCTION
 It is server level virtualization, works with OS layer.
 Single instance/physical instance virtualized into multiple isolated partition.
 Common hardware and OS Kernel hosting multiple isolated partition.
 Cannot host guest OS kernel different from host OS kernel.
 OS level virtualization needs orienting host kernel and system services to
support multiple isolated partition.
 Limiting Hardware resource for per process usages.
 OS Containers
 Application Container
OS Containers:
 Shares kernel of host operating system but provide userspace isolation.
 System resources (like RAM,processer, libraries etsc.) are shared among container
 System resources are controlled by quota created as per policy on container controller or host
system.
 Runs multiple processes and services
 No Layered filesystem in default configuration
 Built on top of native process resource isolation.
 Example: LXC, openVZ, Linux Vserver, BSD Jails, Solaris Zones etc
 Application Containers are designed to run single processes/Service.
 Build on top of OS container
(OS Container)
Host Operating system
Container-1
App1 App2 App3
(Application Container)
Host Operating system
Container-1
App
1
Container-2
App
2
Container-3
App
3
 Chroot
 Docker
 LXC
 Systemd-nspawn
 Singularity
 openVZ
 Solaris Containers/Zone
 AIX- WPAR
 Linux-Vserver [Windos/Linux]
 Why limiting hardware resources ?
 CPU quotas
 Network isolation
 Memory limits
 IO Rate limit
 Disk quotas
 Portioning
 Check pointing
 Live migration
 File system isolation
 Root privilege isolation
 https://nodramadevops.com/2019/10/the-importance-of-docker-container-resource-
limits/
 https://nodramadevops.com/2019/10/docker-cpu-resource-limits/
 Kernel need userspace process help to understand which process is important and have
higher priority.[NICE]
 Limit the usage of a given process.
 Without CPU quotas many container process can starve and slows the system.
 Every OS provide certain control to manage resource usage for per process.
 Administrator can designate container specific CPU/Core.
 Networking is based on isolation, not virtualization.
 Why
 To leverage existing infrastructure and scale up as and when required.
 Provide security through sandboxing.
 To make network resource transparent with host,
 Obsolete/Old type
 Links and Ambassador
 Container Mapped Networking
 Modern Container networking
 None
 Bridge
 Host
 Overlay
 Underlays
 MACVLAN
 IPVLAN
 DIRECT ROUTING
 FAN Networking
 Point-to-Point
 Benefit
 OS support
 Memory limit
 A container is as process and operating system is bound to insure the amount to memory it
needs, provided operating system should have it.
 Running memory intensive task can consume all of you system memory.
 Limiting a memory if part of operating system’s framework in general.
 Container solution can use OS provided framework to control memory on per process basis.
 Example : a container with memory setting can use maximum of value that is set as memory
limit in RAM.
 Not setting this may throw your container into uninterruptible sleep state.
 I/O rate limit
 Same OS framework which controls memory limiting also dod I/O rate limiting.
 All containers use same cpu sys time.
 We need this setting to make sure some container run in parallel instead getting preempted
all the time.
 Defining CPU share is the key.
 Disk quotas
 When a admin need to give access to multiple users/service to a container
 And a user/service should not be able to consume all the disk space.
 In general 3 parameters are required to determine to how much disk space and inode a
container can use.
 Disk space
 Disk inode
 Quota time
 Partitioning
 By definition partitioning is running multiple OS on a single physical system and share
hardware resources.
 Approaches
 Hosted Architecture
 Hypervisor(Bare Metal Architecture)
 Application level partitioning
 Check Pointing
 Running container make changes to the filesystem which remains intact if container engine
starts/stops
 In memory data can be lost in such container engine start/stop events.
 If container or host system crashes container instance and data may remain inconsistent in
filesystem
 A robust container solution must have solution which allows to freeze a running container and
create a checkpoint as collection of files.
 Linux provide CRIU mechanism to create Checkpoint/Restore in userspace.
 [https://criu.org/Main_Page]
 Live migration
 A process to move live container from one physical server to another or cloud without
disconnecting from client.
 Two kind of live migration
1) pre-copy memory 2)post-copy memory (lazy migration)
 FileSystem Isolation
 How to restrict container to read/write within its own filesystem
 Chroot is the basic form of filesystem isolation
 Two types of isolators in general
 Filesystem/posix
 Works on all posix complaint system
 Share same host filesystem
 This isolaters handles persistant volume by creating symlinks in container sandbox.
 This symlinks points to specific persistent volume on the host filesystem
 Example: mesos
 Filesystem/linux
 Container gets its own mount
 Use unix permission to secure container sandboxes.
 Example: docker, mesos
 Root Privilege Isolation
 Nice we can run and execute any application as container without even care about
underlying host OS or even hardware unless host os/machine garantees the
availability of OS.
 But what if user want to test some kernel functionality ?
 use virtual kernels
 Compile and execute kernel code in userspace
 Example
 Vkernel
 RUMP kernel
 Usermode linux
 Unikernel
LINUX CONTAINER
BUILDING BLOCKS
 Namespace
 Control groups
 Capabilities
 CRIU (Checkpoint-Restore in userspace)
 Storage
 SELINUX
 Linux kernel allows developers to partition kernel resources in such a manner that a
distinct processes get distinct view of these kernel resources
 This feature uses same namespace for set of resources and processes.
 Namespaces are basic building blocks of Linux containers.
 There are different namespace for different resources.
 USER isolates user and groups IDs
 MNT isolates mount points
 PID isolates process IDs
 Network isolates network devices, port, stacks etc.
 UTS isolates hostname and NIS domain name.
 IPC isolates system-V IPC and POSIX message queue
 TIME isolates boot and montonic clocks
 CGROUP it isolates cgroup directories
 It is very often an application can start consuming system resources up to extent
where user start seeing hang kind situation while other processes starve for
resources.
 This may lead to system crash or more serious all of the ecosystem.
 Developers addressed this problem with early development of Android kernel in
2006 and merge in to mainline Linux kernel 2008 under tag line of CGROUPS.
 Main goal of CGROUPS was to provide a single interface to realize a whole
operating system level virtualization.
 CGROUP provides following functionalities:
 Resource Limiting
 Prioritization
 Accounting
 Control (like device node access control)
 Every process on linux is child of common process init and so linux process model is single
hierarchy or tree.
 Except init, every other process in linux inherits the environment (e.g. PATH) and some other
attributes like open file descriptor of its parent.
 Cgroup are somewhat similar to process in that
 They are hierarchical
 Child subgroup inherit attributes from their parent cgroup.
 Caveat : Different hierarchies of a cgroup in numbers can coexists, while processes lives in a
single tree process model.
 Multiple hierarchies of a cgroup allows to them to be part of many subsystems simultaneously.
 A subsystem is a kernel component that modifies the behavior of the processes in a cgroup.
 cpuset - assigns individual processor(s) and memory nodes to task(s) in a group;
 cpu - uses the scheduler to provide cgroup tasks access to the processor resources;
 cpuacct - generates reports about processor usage by a group;
 io - sets limit to read/write from/to block devices;
 memory - sets limit on memory usage by a task(s) from a group;
 devices - allows access to devices by a task(s) from a group;
 freezer - allows to suspend/resume for a task(s) from a group;
 net_cls - allows to mark network packets from task(s) from a group;
 net_prio - provides a way to dynamically set the priority of network traffic per network
interface for a group;
 perf_event - provides access to perf events) to a group;
 hugetlb - activates support for huge pages for a group;
 pid - sets limit to number of processes in a group, to avoid fork bomb.
 Example:
#lscgroup
perf_event:/
cpuset:/
memory:/
net_cls,net_prio:/
cpu,cpuacct:/
freezer:/
hugetlb:/
devices:/
devices:/machine.slice
devices:/user.slice
devices:/system.slice
devices:/system.slice/ldt-wipx2dtests.mount
blkio:/
pids:/
pids:/machine.slice
pids:/user.slice
pids:/system.slice
pids:/system.slice/ldt-wipx2dtests.mount
[vasharma@vasharma ~]$ mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=1743648k,nr_inodes=435912,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_prio,net_cls)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuacct,cpu)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices)
configfs on /sys/kernel/config type configfs (rw,relatime)
/
• As a container feature designer, One cannot desire to give root access of the host system
to everyone.
• Capabilities allows designer to segregate between the processes as privileged process or
unprivileged process.
• Privileged process will bypass all kernel permission checks based on process credential.
• List of important capabilities implemented in Linux:
• CAP_AUDIT_CONTROL
• CAP_AUDIT_READ
• CAP_AUDIT_WRITE
• CAP_CHOWN
• CAP_FOWNER
• CAP_IPC_LOCK
• CAP_IPC_OWNER
• CAP_KILL
• CAP_LINUX_IMMUTABLE
• CAP_MKNOD
• CAP_NET_ADMIN
• CAP_SETGID
• CAP_SETUID
• CAP_SYS_ADMIN
• CAP_SYS_BOOT
• CAP_SYS_CHROOT
 CRIU feature allows to stop a process and save a state to the filesystem.
 CRIU allow to restore the saved state.
 This process helps to achieve load balancing while container solution is deployed
in high availability environment.
 There can be a PID collision while trying to restore the saved state of process
unless process under restore had its own PID namespace.
 Container use case create two problem while maintaining multiple
containers at a time
 Inefficient disk space utilization
 10 container running on native filesystem of size 1 GB each will consume 10 GB of
physical memory. Seems lots of inefficient utilization.
 Latency in creating a new containers
 Containers all processes and created as child of container engines.
 Containers shares copy of memory segment of parent process
 To create a container engine copies a container image, that should be completed in
few seconds.
 So the footprint of image should be small such that it can share physical memory
segment among other containers.
 Union filesystem or similar solutions with copy-on-write support
(overlayfs, UnionMount, AUFS etc.) are basic building blacks of any
Linux based container solution.
 Union filesystem works on top of any filesystem native to Linux
environment.
 All major linux distribution has a Security framework consist of either
Apparmor or Selinux.
 SELinux/APPaormor restrict capabilities of a process running on the host
operating system.
 Both SELinux & APPaormor provides security lables to secure container
processes and files.
 Example of a container process secured with SELINUX
 system_u:system_r:container_t:s0:c940,c967
 System_u : user [ user designated to run system services]
 System_r : role [This role is for all system processes except user processes:]
 container_t : Types [ prebuilt selinux type to run containers]
 Running a docker container with apparmor security in Ubuntu
 docker run --rm -it --security-opt apparmor=unconfined debian:jessie bash -i
LITTLE BIT MORE DETAIL
From MAN page of CGROUP
The kernel's cgroup interface is provided through a pseudo-filesystem called
cgroupfs. Grouping is implemented in the core cgroup kernel code, while
resource tracking and limits are implemented in a set of per-resource-type
subsystems (memory, CPU, and so on).
 Two Versions:
 CGROUP – v1 [Linux Kernel ver 2.6.24 and later ]
 CGROUP- v2 [ Linux Kernel ver. 4.5 and later
 Both version are orthogonal
 Currently, cgroups v2 implements only a subset of the controllers available in cgroups v1.
 The two systems are implemented so that both v1 controllers and v2 controllers can be
mounted on the same system. But Container controller cannot simultaneously employed in
both.
 CGROUP –v1 is named hierarchies.
 Multiple instances of such hierarchies can be mounted; each hierarchy must have a unique name.
The only purpose of such hierarchies is to track processes.
mount -t cgroup -o none,name=somename none /some/mount/point
 CGROUP-v2 is unified hierarchies.
 Cgroups v2 provides a unified hierarchy against which all controllers are mounted.
 "Internal" processes are not permitted. With the exception of the root cgroup, processes may reside only in leaf nodes (cgroups that do not
themselves contain child cgroups). The details are somewhat more subtle than this, and are described below.
 Active cgroups must be specified via the files cgroup.controllers and cgroup.subtree_control.
 The tasks file has been removed. In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.
 An improved mechanism for notification of empty cgroups is provided by the cgroup.events file.
mount -t cgroup2 none /mnt/cgroup2
 A cgroup v2 controller is available only if it is not currently in use via a mount against a cgroup v1 hierarchy.
 Cgroups v2 controllers
 cpu, cpuset, freezer, hugetlb, io, memory, perf_envent, pids, rdma
 There is no direct equivalent of the net_cls and net_prio controllers from cgroups version 1. Instead, support has been added to iptables(8) to
allow eBPF filters that hook on cgroup v2 pathnames to make decisions about network traffic on a per-cgroup basis.
 cgroup in the v2 hierarchy contains the following two files:
 cgroup.controllers : This read-only file exposes a list of the controllers that are available in this cgroup.
 cgroup.subtree_control : This is a list of controllers that are active (enabled) in the cgroup.
 Example : echo '+pids -memory' > x/y/cgroup.subtree_control
 “No Internal Process" rule of CGROUP-v2
 if cgroup /cg1/cg2 exists, then a process may reside in /cg1/cg2, but not in /cg1. This is to avoid an ambiguity in cgroups v1 with respect to the
delegation of resources between processes in /cg1 and its child cgroups.
 In /cg1/cg2 path cg2 directory is called leaf node.
 So above rule can be stated as
 “A (nonroot) cgroup can't both (1) have member processes, and (2) distribute resources into child cgroups—that is, have a nonempty
cgroup.subtree_control file.”
 The implementation of cgroups requires a few, simple hooks into the rest of the kernel,
none in performance-critical paths:
 In boot phase (init/main.c) to preform various initializations.
 In process creation and destroy methods, fork() and exit().
 A new file system of type "cgroup" (VFS)
 Process descriptor additions (struct task_struct)
 Add procfs entries:
 For each process: /proc/pid/cgroup.
 System-wide: /proc/cgroups
 CGROUP code location:
 mm/memcontrol.c for memory
 kernel/cpuset.c for cpu set
 And as per functionality requirement in different directories of kernel source
 CGROUPs are not dependent on Namespaces.
 CGROUP is very complex feature and comes with very large number of rules if
someone wants to control resources in a given environment for a container. Multiple
container solution provides wrapper around that.
 A single hierarchy can have one or more subsystems attached to it.
 Any single subsystem (e.g. cpuacct) cannot be attached to more than one
hierarchy if one of those hierarchies has a different subsystem attached to it
already.
 A process cannot be a part of two different cgroup in same hierarchy.
 A forked process inherits same cgroups as its parent process.
 A child process created via fork(2) inherits its parent's cgroup memberships. A process's cgroup memberships are preserved across
execve(2).
 The clone3(2) CLONE_INTO_CGROUP flag can be used to create a childprocess that begins its life in a different version 2 cgroup from
the parent process.
 CGROUP-v1/v2 related file
# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 3 1 1
cpu 9 1 1
cpuacct 9 1 1
memory 4 1 1
devices 11 92 1
freezer 7 1 1
net_cls 8 1 1
blkio 10 1 1
perf_event 5 1 1
hugetlb 6 1 1
pids 2 92 1
net_prio 8 1 1
# cat /proc/[pid]/cgroup
11:devices:/system.slice/gdm.service
10:blkio:/
9:cpuacct,cpu:/
/sys/kernel/cgroup/delegate : This file exports a list of the cgroups v2 files (one per line) that are delegatable.
/sys/kernel/cgroup/features : This file contains list of cgroups v2 features that are provided by the kernel.
 Development library : libcgroup
 yum install libcgroup ( this will install cgconfig)
 yum install libcgroup-tools
 Setup cgconfig service and restart it [ edit /etc/cgconfig.conf ]
mount {
controller_name = /sys/fs/cgroup/controller_name;
…
}
# systemctl restart cgconfig.service
 CGROUP uses VFS.
 CGROUP actions are filesystem operations i.e moun/unmout, create/delete directory etc.
 Mounting CGROUP
# mkdir /sys/fs/cgroup/name
# mount -t cgroup -o controller_name none /sys/fs/cgroup/controller_name
 Mount command will aattach controller cgroup
 Verify whether cgroup is attached to the hierarchy correctly by listing all available hierarchies along with their current mount points using the lssubsys command
# lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls /sys/fs/cgroup/net_cls
blkio /sys/fs/cgroup/blkio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
net_prio /sys/fs/cgroup/net_prio
 Unmount hierarchy :
# umount /sys/fs/cgroup/controller_name
 Use cgcreate command
 cgcreate -t uid:gid -a uid:gid -g controllers:path
 -g — specifies the hierarchy in which the cgroup should be created, as a comma-separated list of the controllers associated with hierarchies.
 Alternatively we can create a child of cgroup directly using mkdir command
 mkdir /sys/fs/cgroup/controller/name/child_name
 To delete cgroup :
 cgdelete controllers:path
 Modify /etc/cgconfig.conf to set parameter of a control group.
perm {
task {
uid = task_user;
gid = task_group;
}
admin {
uid = admin_name;
gid = admin_group;
}
}
 Alternatively we can use cgset command.
cgset -r parameter=value path_to_cgroup
 Now we can move a desired process to cgroup
# cgclassify -g controllers:path_to_cgroup pidlist
 Start a process in control group
# cgexec -g controllers:path_to_cgroup command arguments
 Displaying Parameters of Control Groups
cgget -r parameter list_of_cgroups
# cgget -g cpuset /
group name {
[permissions]
controller {
param_name =
param_value; … } …
}
$ cgget -g cpuset /
/:
cpuset.memory_pressure_enabled: 0
cpuset.memory_spread_slab: 0
cpuset.memory_spread_page: 0
cpuset.memory_pressure: 0
cpuset.memory_migrate: 0
cpuset.sched_relax_domain_level: -1
 Things to discuss
 Namespace - Recap
 Linux processes and Namespace
 CGROUP namespace
 PID namespace
 USER namespace
 NET namespace
 MNT namespace
 UTS namespace
 IPC namespace
 TIME namespace
 A namespace wraps a global system resource in an abstraction that makes it
appear to the processes within the namespace that they have their own isolated
instance of the global resource. Changes to the global resource are visible to other
processes that are members of the namespace, but are invisible to other processes.
One use of namespaces is to implement containers.
Namespace Flag Page Isolates
Cgroup CLONE_NEWCGROUP cgroup_namespaces(7) Cgroup root directory
IPC CLONE_NEWIPC ipc_namespaces(7)
1.System V IPC 2.POSIX message
queues
Network CLONE_NEWNET network_namespaces(7) Network devices stacks ports etc.
Mount CLONE_NEWNS mount_namespaces(7) Mount points
PID CLONE_NEWPID pid_namespaces(7) Process IDs
Time CLONE_NEWTIME time_namespaces(7) Boot and monotonic clocks
User CLONE_NEWUSER user_namespaces(7) User and group IDs
UTS CLONE_NEWUTS uts_namespaces(7) Hostname and NIS domain name
 Namespace APIs contains following system call
 clone()
 setns()
 unshare()
 nsenter command
 clone() create a new process
 Unlike fork(2), it allows a child process to share parts of its
 Execution context with parent process
 Memory space
 File descriptor table
 Singnal handler table
 Important flags
 CLONE_FS : allows child process to share same filesystem
 CLONE_IO: allows child process to share I/O context with parent
 CLONE_PARENT : if set parent of the new child (as returned by getppid(2)) will be the same as that of the
calling parent process. Else the child's parent is the calling parent process.
 CLONE_NEWIPC : Create the process in a new IPC namespace.
 CLONE_NEWNET : create the process in a new network namespace.
 CLONE_NEWNS : the cloned child is started in a new mount namespace, initialized with a copy of the
namespace of the parent
 CLONE_NEWPID: create the process in a new PID namespace.
 CLONE_NEWUSER: create the process in a new user namespace.
 CLONE_NEWUTS: create the process in a new UTS namespace, whose identifiers are initialized by
duplicating the identifiers from the UTS namespace of the calling process.
 This systemcall reassociate thread with a namespace.
 Signature : int setns(int fd, int nstype);
 nstype argument specifies which type of namespace the calling thread may be
reassociated with.
 0: Allow any type of namespace to be joined
 CLONE_NEWIPC: fd must refer to an IPC namespace.
 CLONE_NEWNET: fd must refer to a network namespace.
 CLONE_NEWUTS: fd must refer to a UTS namespace.
 unshare() enables a process to disassociate parts of its execution context that are
currently being shared with other process.
 int unshare(int flags); // defined in sched.h
 CLONE_FS flags revers the effect of clone(2) CLONE_FS flag. It will unshare file
system attributes, so that calling process no longer share its root directory.
 Following flags will Unshare the given namespace, so that the calling process has
a private copy of the given namespace which is not shared with any other process.
 CLONE_NEWIPC
 CLONE_NEWNET
 CLONE_NEWNS
 CLONE_NEWUTS
 NOTE: If flags is specified as zero, then unshare() is a no-op; no changes are made
to the calling process's execution context.
struct task_struct {
[...]
/* process credentials */
const struct cred __rcu *cred; /* effective (overridable) subjective task *
credentials (COW) */
[...]
/* namespaces */
struct nsproxy *nsproxy;
struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
};
struct cred {
[...]
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
[...]
struct user_namespace {
[...]
struct user_namespace *parent;
struct ns_common ns;
[...]
};
 clone() - > do_fork() -> copy_process() -> copy_namespaces()
 In case any namespace flags not present in do_fork() call it just uses parent
namespaces else it will create a new nsproxy struct and copies all namespaces.
 Child process is responsible to change any namespace data.
 unshare() system call will allow process to disassociate some of its part of
execution context that are being shared with other processes.
 When a process ends, all namespaces they belong to that does not have any other
process attached are cleaned .
 nsenter stands for namespace enter.
 nsenter command allows to enter in specified namespace.
 Use nsenter command to dimistify the container and to understand internals of
containers.
 [vasharma@vasharma ~]$ lsns
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 2 9943 vasharma -bash
4026531837 user 2 9943 vasharma -bash
4026531838 uts 2 9943 vasharma -bash
4026531839 ipc 2 9943 vasharma -bash
4026531840 mnt 2 9943 vasharma -bash
4026531956 net 2 9943 vasharma –bash
 To check list of namespace associated with a given process
 lsns –p <pid of a container process>
 Example1: check ip address and routing table in network namespace
 nsenter -t <pid of a container process> -n ip a s
 nsenter -t <pid of a container process> -n ip route
 Exanple2: check hostname through UTC namespace
 nsenter -t <pid of a container process> -u hostname
 Processes running in different PID namespace can have same UID
 PID of first process in a nsmaespace while creating it should be 1.
 Behavior of PID 1 in namespace will be like init process.
 getppid() on newly created process with PID 1 will return 0.
 PID namespace can be nested upto 32 nesting level.
 A process created in user namespace will have differnet UIDs and GIDs
 It allows to map UID in container to UID on host
 UID 0 of container can be mapped to non privileged user on the host
 User can check the current mapping in
 /proc/PID/uid_map
 /proc/PID/gid_map
 These files have 3 values
 ID-inside-ns ID-outside-ns length
 The writing process must have the CAP_SETUID (CAP_SETGID for gid_map)
capability in the user namespace of the process PID.
 The writing process must be in either the user namespace of the process PID or
inside the (immediate) parent user namespace of the process PID.
 Mount namespace allows process to have their own private mounts and root fs.
 Container can have /proc, /sys/, nfs mounts
 Container can have prvet /tmp mounted per service or per user.
 Each namespace has owner user namespace
 While creating a less privileged mount namespace , shared mounts are reduced to
slave mounts.
 When a user create a process within a given network namespace it create it own set of network stack available
privately to newly created process.
 Process will see
 Network interface
 Routing table rules
 Firewall rules
 Sockets
 To create a new network namespace
 ip netns add <new namespace name>
 Assign a interface to network namespace
 Create a virtual ethernet adapter
 ip link add veth0 type veth peer name <virtual adampter name>
 Move this virtual network adapter to newly created namespace
 ip link set <virtual adampter name> netns <network namespace name>
 List network interface in given network namespace
 ip netns exec <network namespace name> ip link list
 Configure network interface in network interface
 ip netns exec <network namespace name> <command to run against that namespace>
 Connecting Network Namespaces to the Physical Network
 ip link set dev <device> netns < network namespace name>

 IPC namespace allows us to isolate following IPC resources,
 System V IPC (man 7 sysvipc)
 POSIX message queues
 /proc interface are different for each IPC namespace
 POSIX Message queue interfaces in /proc/sys/fs/mqueue.
 The System V IPC interfaces in /proc/sys/kernel for shmmini, shmmax, shmall,
shm_rmid_forced, sem, msgmax, msgmnb, msgmni.
 UTS : Unix Time Sharing
 UTS namespace isolates hostname and NIS domain name.
 Systemcall : uname()/sethostname()/gethostname()
 Namespaces in operation, part 1: namespaces overview
 Namespaces in operation, part 2: the namespaces API
 Namespaces in operation, part 3: PID namespaces
 Namespaces in operation, part 4: more on PID namespaces
 Namespaces in operation, part 5: User namespaces
 Namespaces in operation, part 6: more on user namespaces
 Namespaces in operation, part 7: Network namespaces
 Mount namespaces and shared subtrees
 Mount namespaces, mount propagation, and unbindable mounts
#?

More Related Content

What's hot

Basic of virtual memory of Linux
Basic of virtual memory of LinuxBasic of virtual memory of Linux
Basic of virtual memory of LinuxTetsuyuki Kobayashi
 
LinuxをインストールしてWebサーバーを立ち上げてみよう
LinuxをインストールしてWebサーバーを立ち上げてみようLinuxをインストールしてWebサーバーを立ち上げてみよう
LinuxをインストールしてWebサーバーを立ち上げてみようMasataka Tsukamoto
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunheut2008
 
GitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott RigbyGitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott RigbyWeaveworks
 
Open vSwitchソースコードの全体像
Open vSwitchソースコードの全体像 Open vSwitchソースコードの全体像
Open vSwitchソースコードの全体像 Sho Shimizu
 
Windows container security
Windows container securityWindows container security
Windows container securityDocker, Inc.
 
Ethernetの受信処理
Ethernetの受信処理Ethernetの受信処理
Ethernetの受信処理Takuya ASADA
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法Takuya ASADA
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt AffinityについてTakuya ASADA
 
フラッター開発におけるシークレット情報取扱考察
フラッター開発におけるシークレット情報取扱考察フラッター開発におけるシークレット情報取扱考察
フラッター開発におけるシークレット情報取扱考察cch-robo
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化Takuya ASADA
 
Who carries your container? Zun or Magnum?
Who carries your container? Zun or Magnum?Who carries your container? Zun or Magnum?
Who carries your container? Zun or Magnum?Madhuri Kumari
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] LimaAkihiro Suda
 

What's hot (20)

Basic of virtual memory of Linux
Basic of virtual memory of LinuxBasic of virtual memory of Linux
Basic of virtual memory of Linux
 
LinuxをインストールしてWebサーバーを立ち上げてみよう
LinuxをインストールしてWebサーバーを立ち上げてみようLinuxをインストールしてWebサーバーを立ち上げてみよう
LinuxをインストールしてWebサーバーを立ち上げてみよう
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zun
 
HAProxy
HAProxy HAProxy
HAProxy
 
GitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott RigbyGitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott Rigby
 
Open vSwitchソースコードの全体像
Open vSwitchソースコードの全体像 Open vSwitchソースコードの全体像
Open vSwitchソースコードの全体像
 
Jenkins tutorial
Jenkins tutorialJenkins tutorial
Jenkins tutorial
 
インフラチームのリモートワーク
インフラチームのリモートワークインフラチームのリモートワーク
インフラチームのリモートワーク
 
Windows container security
Windows container securityWindows container security
Windows container security
 
Ethernetの受信処理
Ethernetの受信処理Ethernetの受信処理
Ethernetの受信処理
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt Affinityについて
 
フラッター開発におけるシークレット情報取扱考察
フラッター開発におけるシークレット情報取扱考察フラッター開発におけるシークレット情報取扱考察
フラッター開発におけるシークレット情報取扱考察
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化
 
Hyperledger Besuの動向
Hyperledger Besuの動向Hyperledger Besuの動向
Hyperledger Besuの動向
 
Who carries your container? Zun or Magnum?
Who carries your container? Zun or Magnum?Who carries your container? Zun or Magnum?
Who carries your container? Zun or Magnum?
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima
 
Drive into calico architecture
Drive into calico architectureDrive into calico architecture
Drive into calico architecture
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
ゼロからはじめるKVM超入門
ゼロからはじめるKVM超入門ゼロからはじめるKVM超入門
ゼロからはじめるKVM超入門
 

Similar to Introduction to OS LEVEL Virtualization & Containers

Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Neeraj Shrimali
 
The building blocks of docker.
The building blocks of docker.The building blocks of docker.
The building blocks of docker.Chafik Belhaoues
 
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyLinux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyBoden Russell
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Boden Russell
 
Linux Container Brief for IEEE WG P2302
Linux Container Brief for IEEE WG P2302Linux Container Brief for IEEE WG P2302
Linux Container Brief for IEEE WG P2302Boden Russell
 
Container & kubernetes
Container & kubernetesContainer & kubernetes
Container & kubernetesTed Jung
 
Securing Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformSecuring Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformAll Things Open
 
Security on a Container Platform
Security on a Container PlatformSecurity on a Container Platform
Security on a Container PlatformAll Things Open
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetesKrishna-Kumar
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave ProbertEvolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probertyang
 
Java in containers
Java in containersJava in containers
Java in containersMartin Baez
 
Evolution of Linux Containerization
Evolution of Linux Containerization Evolution of Linux Containerization
Evolution of Linux Containerization WSO2
 
Evoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationEvoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationImesh Gunaratne
 
Operating System Concepts Presentation
Operating System Concepts PresentationOperating System Concepts Presentation
Operating System Concepts PresentationNitish Jadia
 
Revolutionizing the cloud with container virtualization
Revolutionizing the cloud with container virtualizationRevolutionizing the cloud with container virtualization
Revolutionizing the cloud with container virtualizationWSO2
 
Securing Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformSecuring Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformAll Things Open
 
Linux26 New Features
Linux26 New FeaturesLinux26 New Features
Linux26 New Featuresguest491c69
 

Similar to Introduction to OS LEVEL Virtualization & Containers (20)

Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup.
 
The building blocks of docker.
The building blocks of docker.The building blocks of docker.
The building blocks of docker.
 
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyLinux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
Linux Container Brief for IEEE WG P2302
Linux Container Brief for IEEE WG P2302Linux Container Brief for IEEE WG P2302
Linux Container Brief for IEEE WG P2302
 
Container & kubernetes
Container & kubernetesContainer & kubernetes
Container & kubernetes
 
Securing Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformSecuring Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container Platform
 
Security on a Container Platform
Security on a Container PlatformSecurity on a Container Platform
Security on a Container Platform
 
2337610
23376102337610
2337610
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetes
 
Oct2009
Oct2009Oct2009
Oct2009
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave ProbertEvolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probert
 
Java in containers
Java in containersJava in containers
Java in containers
 
Evolution of Linux Containerization
Evolution of Linux Containerization Evolution of Linux Containerization
Evolution of Linux Containerization
 
Evoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationEvoluation of Linux Container Virtualization
Evoluation of Linux Container Virtualization
 
Operating System Concepts Presentation
Operating System Concepts PresentationOperating System Concepts Presentation
Operating System Concepts Presentation
 
Revolutionizing the cloud with container virtualization
Revolutionizing the cloud with container virtualizationRevolutionizing the cloud with container virtualization
Revolutionizing the cloud with container virtualization
 
Securing Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container PlatformSecuring Applications and Pipelines on a Container Platform
Securing Applications and Pipelines on a Container Platform
 
Linux26 New Features
Linux26 New FeaturesLinux26 New Features
Linux26 New Features
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 

Introduction to OS LEVEL Virtualization & Containers

  • 2.  This session is not about DevOps, CI/CD or test but must to know to design a state of art DevOps and SecDevOps solutions.  No new concepts and most of concepts are as old as year 2002 and in some cases 1970’s.  Presentation is designed in two parts  Information for all  Information for system programmers  Examples are as on RHEL 7 platform  What is not covered  Indepth discussion on storage related topics like copy-on-write.  Containers and systemd/apparmour related topics and issues.
  • 3.  Basics of OS LEVEL Virtualization.  Products of Interest.  Features of OS level virtualization.  OS level virtualization features in brief.  Linux Container Building blocks.  Samples
  • 5.  It is server level virtualization, works with OS layer.  Single instance/physical instance virtualized into multiple isolated partition.  Common hardware and OS Kernel hosting multiple isolated partition.  Cannot host guest OS kernel different from host OS kernel.  OS level virtualization needs orienting host kernel and system services to support multiple isolated partition.  Limiting Hardware resource for per process usages.
  • 6.  OS Containers  Application Container
  • 7. OS Containers:  Shares kernel of host operating system but provide userspace isolation.  System resources (like RAM,processer, libraries etsc.) are shared among container  System resources are controlled by quota created as per policy on container controller or host system.  Runs multiple processes and services  No Layered filesystem in default configuration  Built on top of native process resource isolation.  Example: LXC, openVZ, Linux Vserver, BSD Jails, Solaris Zones etc
  • 8.  Application Containers are designed to run single processes/Service.  Build on top of OS container
  • 9. (OS Container) Host Operating system Container-1 App1 App2 App3 (Application Container) Host Operating system Container-1 App 1 Container-2 App 2 Container-3 App 3
  • 10.  Chroot  Docker  LXC  Systemd-nspawn  Singularity  openVZ  Solaris Containers/Zone  AIX- WPAR  Linux-Vserver [Windos/Linux]
  • 11.  Why limiting hardware resources ?  CPU quotas  Network isolation  Memory limits  IO Rate limit  Disk quotas  Portioning  Check pointing  Live migration  File system isolation  Root privilege isolation
  • 13.  Kernel need userspace process help to understand which process is important and have higher priority.[NICE]  Limit the usage of a given process.  Without CPU quotas many container process can starve and slows the system.  Every OS provide certain control to manage resource usage for per process.  Administrator can designate container specific CPU/Core.
  • 14.  Networking is based on isolation, not virtualization.  Why  To leverage existing infrastructure and scale up as and when required.  Provide security through sandboxing.  To make network resource transparent with host,  Obsolete/Old type  Links and Ambassador  Container Mapped Networking  Modern Container networking  None  Bridge  Host  Overlay  Underlays  MACVLAN  IPVLAN  DIRECT ROUTING  FAN Networking  Point-to-Point  Benefit  OS support
  • 15.  Memory limit  A container is as process and operating system is bound to insure the amount to memory it needs, provided operating system should have it.  Running memory intensive task can consume all of you system memory.  Limiting a memory if part of operating system’s framework in general.  Container solution can use OS provided framework to control memory on per process basis.  Example : a container with memory setting can use maximum of value that is set as memory limit in RAM.  Not setting this may throw your container into uninterruptible sleep state.  I/O rate limit  Same OS framework which controls memory limiting also dod I/O rate limiting.  All containers use same cpu sys time.  We need this setting to make sure some container run in parallel instead getting preempted all the time.  Defining CPU share is the key.
  • 16.  Disk quotas  When a admin need to give access to multiple users/service to a container  And a user/service should not be able to consume all the disk space.  In general 3 parameters are required to determine to how much disk space and inode a container can use.  Disk space  Disk inode  Quota time  Partitioning  By definition partitioning is running multiple OS on a single physical system and share hardware resources.  Approaches  Hosted Architecture  Hypervisor(Bare Metal Architecture)  Application level partitioning
  • 17.  Check Pointing  Running container make changes to the filesystem which remains intact if container engine starts/stops  In memory data can be lost in such container engine start/stop events.  If container or host system crashes container instance and data may remain inconsistent in filesystem  A robust container solution must have solution which allows to freeze a running container and create a checkpoint as collection of files.  Linux provide CRIU mechanism to create Checkpoint/Restore in userspace.  [https://criu.org/Main_Page]  Live migration  A process to move live container from one physical server to another or cloud without disconnecting from client.  Two kind of live migration 1) pre-copy memory 2)post-copy memory (lazy migration)
  • 18.  FileSystem Isolation  How to restrict container to read/write within its own filesystem  Chroot is the basic form of filesystem isolation  Two types of isolators in general  Filesystem/posix  Works on all posix complaint system  Share same host filesystem  This isolaters handles persistant volume by creating symlinks in container sandbox.  This symlinks points to specific persistent volume on the host filesystem  Example: mesos  Filesystem/linux  Container gets its own mount  Use unix permission to secure container sandboxes.  Example: docker, mesos  Root Privilege Isolation
  • 19.  Nice we can run and execute any application as container without even care about underlying host OS or even hardware unless host os/machine garantees the availability of OS.  But what if user want to test some kernel functionality ?  use virtual kernels  Compile and execute kernel code in userspace  Example  Vkernel  RUMP kernel  Usermode linux  Unikernel
  • 21.  Namespace  Control groups  Capabilities  CRIU (Checkpoint-Restore in userspace)  Storage  SELINUX
  • 22.  Linux kernel allows developers to partition kernel resources in such a manner that a distinct processes get distinct view of these kernel resources  This feature uses same namespace for set of resources and processes.  Namespaces are basic building blocks of Linux containers.  There are different namespace for different resources.  USER isolates user and groups IDs  MNT isolates mount points  PID isolates process IDs  Network isolates network devices, port, stacks etc.  UTS isolates hostname and NIS domain name.  IPC isolates system-V IPC and POSIX message queue  TIME isolates boot and montonic clocks  CGROUP it isolates cgroup directories
  • 23.  It is very often an application can start consuming system resources up to extent where user start seeing hang kind situation while other processes starve for resources.  This may lead to system crash or more serious all of the ecosystem.  Developers addressed this problem with early development of Android kernel in 2006 and merge in to mainline Linux kernel 2008 under tag line of CGROUPS.  Main goal of CGROUPS was to provide a single interface to realize a whole operating system level virtualization.  CGROUP provides following functionalities:  Resource Limiting  Prioritization  Accounting  Control (like device node access control)
  • 24.  Every process on linux is child of common process init and so linux process model is single hierarchy or tree.  Except init, every other process in linux inherits the environment (e.g. PATH) and some other attributes like open file descriptor of its parent.  Cgroup are somewhat similar to process in that  They are hierarchical  Child subgroup inherit attributes from their parent cgroup.  Caveat : Different hierarchies of a cgroup in numbers can coexists, while processes lives in a single tree process model.  Multiple hierarchies of a cgroup allows to them to be part of many subsystems simultaneously.  A subsystem is a kernel component that modifies the behavior of the processes in a cgroup.
  • 25.  cpuset - assigns individual processor(s) and memory nodes to task(s) in a group;  cpu - uses the scheduler to provide cgroup tasks access to the processor resources;  cpuacct - generates reports about processor usage by a group;  io - sets limit to read/write from/to block devices;  memory - sets limit on memory usage by a task(s) from a group;  devices - allows access to devices by a task(s) from a group;  freezer - allows to suspend/resume for a task(s) from a group;  net_cls - allows to mark network packets from task(s) from a group;  net_prio - provides a way to dynamically set the priority of network traffic per network interface for a group;  perf_event - provides access to perf events) to a group;  hugetlb - activates support for huge pages for a group;  pid - sets limit to number of processes in a group, to avoid fork bomb.
  • 27. [vasharma@vasharma ~]$ mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=1743648k,nr_inodes=435912,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_prio,net_cls) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuacct,cpu) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices) configfs on /sys/kernel/config type configfs (rw,relatime) /
  • 28. • As a container feature designer, One cannot desire to give root access of the host system to everyone. • Capabilities allows designer to segregate between the processes as privileged process or unprivileged process. • Privileged process will bypass all kernel permission checks based on process credential. • List of important capabilities implemented in Linux: • CAP_AUDIT_CONTROL • CAP_AUDIT_READ • CAP_AUDIT_WRITE • CAP_CHOWN • CAP_FOWNER • CAP_IPC_LOCK • CAP_IPC_OWNER • CAP_KILL • CAP_LINUX_IMMUTABLE • CAP_MKNOD • CAP_NET_ADMIN • CAP_SETGID • CAP_SETUID • CAP_SYS_ADMIN • CAP_SYS_BOOT • CAP_SYS_CHROOT
  • 29.  CRIU feature allows to stop a process and save a state to the filesystem.  CRIU allow to restore the saved state.  This process helps to achieve load balancing while container solution is deployed in high availability environment.  There can be a PID collision while trying to restore the saved state of process unless process under restore had its own PID namespace.
  • 30.  Container use case create two problem while maintaining multiple containers at a time  Inefficient disk space utilization  10 container running on native filesystem of size 1 GB each will consume 10 GB of physical memory. Seems lots of inefficient utilization.  Latency in creating a new containers  Containers all processes and created as child of container engines.  Containers shares copy of memory segment of parent process  To create a container engine copies a container image, that should be completed in few seconds.  So the footprint of image should be small such that it can share physical memory segment among other containers.  Union filesystem or similar solutions with copy-on-write support (overlayfs, UnionMount, AUFS etc.) are basic building blacks of any Linux based container solution.  Union filesystem works on top of any filesystem native to Linux environment.
  • 31.  All major linux distribution has a Security framework consist of either Apparmor or Selinux.  SELinux/APPaormor restrict capabilities of a process running on the host operating system.  Both SELinux & APPaormor provides security lables to secure container processes and files.  Example of a container process secured with SELINUX  system_u:system_r:container_t:s0:c940,c967  System_u : user [ user designated to run system services]  System_r : role [This role is for all system processes except user processes:]  container_t : Types [ prebuilt selinux type to run containers]  Running a docker container with apparmor security in Ubuntu  docker run --rm -it --security-opt apparmor=unconfined debian:jessie bash -i
  • 32. LITTLE BIT MORE DETAIL
  • 33. From MAN page of CGROUP The kernel's cgroup interface is provided through a pseudo-filesystem called cgroupfs. Grouping is implemented in the core cgroup kernel code, while resource tracking and limits are implemented in a set of per-resource-type subsystems (memory, CPU, and so on).
  • 34.  Two Versions:  CGROUP – v1 [Linux Kernel ver 2.6.24 and later ]  CGROUP- v2 [ Linux Kernel ver. 4.5 and later  Both version are orthogonal  Currently, cgroups v2 implements only a subset of the controllers available in cgroups v1.  The two systems are implemented so that both v1 controllers and v2 controllers can be mounted on the same system. But Container controller cannot simultaneously employed in both.  CGROUP –v1 is named hierarchies.  Multiple instances of such hierarchies can be mounted; each hierarchy must have a unique name. The only purpose of such hierarchies is to track processes. mount -t cgroup -o none,name=somename none /some/mount/point
  • 35.  CGROUP-v2 is unified hierarchies.  Cgroups v2 provides a unified hierarchy against which all controllers are mounted.  "Internal" processes are not permitted. With the exception of the root cgroup, processes may reside only in leaf nodes (cgroups that do not themselves contain child cgroups). The details are somewhat more subtle than this, and are described below.  Active cgroups must be specified via the files cgroup.controllers and cgroup.subtree_control.  The tasks file has been removed. In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.  An improved mechanism for notification of empty cgroups is provided by the cgroup.events file. mount -t cgroup2 none /mnt/cgroup2  A cgroup v2 controller is available only if it is not currently in use via a mount against a cgroup v1 hierarchy.  Cgroups v2 controllers  cpu, cpuset, freezer, hugetlb, io, memory, perf_envent, pids, rdma  There is no direct equivalent of the net_cls and net_prio controllers from cgroups version 1. Instead, support has been added to iptables(8) to allow eBPF filters that hook on cgroup v2 pathnames to make decisions about network traffic on a per-cgroup basis.  cgroup in the v2 hierarchy contains the following two files:  cgroup.controllers : This read-only file exposes a list of the controllers that are available in this cgroup.  cgroup.subtree_control : This is a list of controllers that are active (enabled) in the cgroup.  Example : echo '+pids -memory' > x/y/cgroup.subtree_control  “No Internal Process" rule of CGROUP-v2  if cgroup /cg1/cg2 exists, then a process may reside in /cg1/cg2, but not in /cg1. This is to avoid an ambiguity in cgroups v1 with respect to the delegation of resources between processes in /cg1 and its child cgroups.  In /cg1/cg2 path cg2 directory is called leaf node.  So above rule can be stated as  “A (nonroot) cgroup can't both (1) have member processes, and (2) distribute resources into child cgroups—that is, have a nonempty cgroup.subtree_control file.”
  • 36.  The implementation of cgroups requires a few, simple hooks into the rest of the kernel, none in performance-critical paths:  In boot phase (init/main.c) to preform various initializations.  In process creation and destroy methods, fork() and exit().  A new file system of type "cgroup" (VFS)  Process descriptor additions (struct task_struct)  Add procfs entries:  For each process: /proc/pid/cgroup.  System-wide: /proc/cgroups  CGROUP code location:  mm/memcontrol.c for memory  kernel/cpuset.c for cpu set  And as per functionality requirement in different directories of kernel source  CGROUPs are not dependent on Namespaces.  CGROUP is very complex feature and comes with very large number of rules if someone wants to control resources in a given environment for a container. Multiple container solution provides wrapper around that.
  • 37.  A single hierarchy can have one or more subsystems attached to it.  Any single subsystem (e.g. cpuacct) cannot be attached to more than one hierarchy if one of those hierarchies has a different subsystem attached to it already.  A process cannot be a part of two different cgroup in same hierarchy.  A forked process inherits same cgroups as its parent process.
  • 38.  A child process created via fork(2) inherits its parent's cgroup memberships. A process's cgroup memberships are preserved across execve(2).  The clone3(2) CLONE_INTO_CGROUP flag can be used to create a childprocess that begins its life in a different version 2 cgroup from the parent process.  CGROUP-v1/v2 related file # cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 3 1 1 cpu 9 1 1 cpuacct 9 1 1 memory 4 1 1 devices 11 92 1 freezer 7 1 1 net_cls 8 1 1 blkio 10 1 1 perf_event 5 1 1 hugetlb 6 1 1 pids 2 92 1 net_prio 8 1 1 # cat /proc/[pid]/cgroup 11:devices:/system.slice/gdm.service 10:blkio:/ 9:cpuacct,cpu:/ /sys/kernel/cgroup/delegate : This file exports a list of the cgroups v2 files (one per line) that are delegatable. /sys/kernel/cgroup/features : This file contains list of cgroups v2 features that are provided by the kernel.
  • 39.  Development library : libcgroup  yum install libcgroup ( this will install cgconfig)  yum install libcgroup-tools  Setup cgconfig service and restart it [ edit /etc/cgconfig.conf ] mount { controller_name = /sys/fs/cgroup/controller_name; … } # systemctl restart cgconfig.service  CGROUP uses VFS.  CGROUP actions are filesystem operations i.e moun/unmout, create/delete directory etc.  Mounting CGROUP # mkdir /sys/fs/cgroup/name # mount -t cgroup -o controller_name none /sys/fs/cgroup/controller_name  Mount command will aattach controller cgroup  Verify whether cgroup is attached to the hierarchy correctly by listing all available hierarchies along with their current mount points using the lssubsys command # lssubsys -am cpuset /sys/fs/cgroup/cpuset cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct memory /sys/fs/cgroup/memory devices /sys/fs/cgroup/devices freezer /sys/fs/cgroup/freezer net_cls /sys/fs/cgroup/net_cls blkio /sys/fs/cgroup/blkio perf_event /sys/fs/cgroup/perf_event hugetlb /sys/fs/cgroup/hugetlb net_prio /sys/fs/cgroup/net_prio  Unmount hierarchy : # umount /sys/fs/cgroup/controller_name
  • 40.  Use cgcreate command  cgcreate -t uid:gid -a uid:gid -g controllers:path  -g — specifies the hierarchy in which the cgroup should be created, as a comma-separated list of the controllers associated with hierarchies.  Alternatively we can create a child of cgroup directly using mkdir command  mkdir /sys/fs/cgroup/controller/name/child_name  To delete cgroup :  cgdelete controllers:path  Modify /etc/cgconfig.conf to set parameter of a control group. perm { task { uid = task_user; gid = task_group; } admin { uid = admin_name; gid = admin_group; } }  Alternatively we can use cgset command. cgset -r parameter=value path_to_cgroup  Now we can move a desired process to cgroup # cgclassify -g controllers:path_to_cgroup pidlist  Start a process in control group # cgexec -g controllers:path_to_cgroup command arguments  Displaying Parameters of Control Groups cgget -r parameter list_of_cgroups # cgget -g cpuset / group name { [permissions] controller { param_name = param_value; … } … } $ cgget -g cpuset / /: cpuset.memory_pressure_enabled: 0 cpuset.memory_spread_slab: 0 cpuset.memory_spread_page: 0 cpuset.memory_pressure: 0 cpuset.memory_migrate: 0 cpuset.sched_relax_domain_level: -1
  • 41.  Things to discuss  Namespace - Recap  Linux processes and Namespace  CGROUP namespace  PID namespace  USER namespace  NET namespace  MNT namespace  UTS namespace  IPC namespace  TIME namespace
  • 42.  A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers. Namespace Flag Page Isolates Cgroup CLONE_NEWCGROUP cgroup_namespaces(7) Cgroup root directory IPC CLONE_NEWIPC ipc_namespaces(7) 1.System V IPC 2.POSIX message queues Network CLONE_NEWNET network_namespaces(7) Network devices stacks ports etc. Mount CLONE_NEWNS mount_namespaces(7) Mount points PID CLONE_NEWPID pid_namespaces(7) Process IDs Time CLONE_NEWTIME time_namespaces(7) Boot and monotonic clocks User CLONE_NEWUSER user_namespaces(7) User and group IDs UTS CLONE_NEWUTS uts_namespaces(7) Hostname and NIS domain name
  • 43.  Namespace APIs contains following system call  clone()  setns()  unshare()  nsenter command
  • 44.  clone() create a new process  Unlike fork(2), it allows a child process to share parts of its  Execution context with parent process  Memory space  File descriptor table  Singnal handler table  Important flags  CLONE_FS : allows child process to share same filesystem  CLONE_IO: allows child process to share I/O context with parent  CLONE_PARENT : if set parent of the new child (as returned by getppid(2)) will be the same as that of the calling parent process. Else the child's parent is the calling parent process.  CLONE_NEWIPC : Create the process in a new IPC namespace.  CLONE_NEWNET : create the process in a new network namespace.  CLONE_NEWNS : the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent  CLONE_NEWPID: create the process in a new PID namespace.  CLONE_NEWUSER: create the process in a new user namespace.  CLONE_NEWUTS: create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process.
  • 45.  This systemcall reassociate thread with a namespace.  Signature : int setns(int fd, int nstype);  nstype argument specifies which type of namespace the calling thread may be reassociated with.  0: Allow any type of namespace to be joined  CLONE_NEWIPC: fd must refer to an IPC namespace.  CLONE_NEWNET: fd must refer to a network namespace.  CLONE_NEWUTS: fd must refer to a UTS namespace.
  • 46.  unshare() enables a process to disassociate parts of its execution context that are currently being shared with other process.  int unshare(int flags); // defined in sched.h  CLONE_FS flags revers the effect of clone(2) CLONE_FS flag. It will unshare file system attributes, so that calling process no longer share its root directory.  Following flags will Unshare the given namespace, so that the calling process has a private copy of the given namespace which is not shared with any other process.  CLONE_NEWIPC  CLONE_NEWNET  CLONE_NEWNS  CLONE_NEWUTS  NOTE: If flags is specified as zero, then unshare() is a no-op; no changes are made to the calling process's execution context.
  • 47. struct task_struct { [...] /* process credentials */ const struct cred __rcu *cred; /* effective (overridable) subjective task * credentials (COW) */ [...] /* namespaces */ struct nsproxy *nsproxy;
  • 48. struct nsproxy { atomic_t count; struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns_for_children; struct net *net_ns; }; struct cred { [...] struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */ [...] struct user_namespace { [...] struct user_namespace *parent; struct ns_common ns; [...] };
  • 49.  clone() - > do_fork() -> copy_process() -> copy_namespaces()  In case any namespace flags not present in do_fork() call it just uses parent namespaces else it will create a new nsproxy struct and copies all namespaces.  Child process is responsible to change any namespace data.  unshare() system call will allow process to disassociate some of its part of execution context that are being shared with other processes.  When a process ends, all namespaces they belong to that does not have any other process attached are cleaned .
  • 50.  nsenter stands for namespace enter.  nsenter command allows to enter in specified namespace.  Use nsenter command to dimistify the container and to understand internals of containers.
  • 51.  [vasharma@vasharma ~]$ lsns NS TYPE NPROCS PID USER COMMAND 4026531836 pid 2 9943 vasharma -bash 4026531837 user 2 9943 vasharma -bash 4026531838 uts 2 9943 vasharma -bash 4026531839 ipc 2 9943 vasharma -bash 4026531840 mnt 2 9943 vasharma -bash 4026531956 net 2 9943 vasharma –bash  To check list of namespace associated with a given process  lsns –p <pid of a container process>
  • 52.  Example1: check ip address and routing table in network namespace  nsenter -t <pid of a container process> -n ip a s  nsenter -t <pid of a container process> -n ip route  Exanple2: check hostname through UTC namespace  nsenter -t <pid of a container process> -u hostname
  • 53.  Processes running in different PID namespace can have same UID  PID of first process in a nsmaespace while creating it should be 1.  Behavior of PID 1 in namespace will be like init process.  getppid() on newly created process with PID 1 will return 0.  PID namespace can be nested upto 32 nesting level.
  • 54.  A process created in user namespace will have differnet UIDs and GIDs  It allows to map UID in container to UID on host  UID 0 of container can be mapped to non privileged user on the host  User can check the current mapping in  /proc/PID/uid_map  /proc/PID/gid_map  These files have 3 values  ID-inside-ns ID-outside-ns length  The writing process must have the CAP_SETUID (CAP_SETGID for gid_map) capability in the user namespace of the process PID.  The writing process must be in either the user namespace of the process PID or inside the (immediate) parent user namespace of the process PID.
  • 55.  Mount namespace allows process to have their own private mounts and root fs.  Container can have /proc, /sys/, nfs mounts  Container can have prvet /tmp mounted per service or per user.  Each namespace has owner user namespace  While creating a less privileged mount namespace , shared mounts are reduced to slave mounts.
  • 56.  When a user create a process within a given network namespace it create it own set of network stack available privately to newly created process.  Process will see  Network interface  Routing table rules  Firewall rules  Sockets  To create a new network namespace  ip netns add <new namespace name>  Assign a interface to network namespace  Create a virtual ethernet adapter  ip link add veth0 type veth peer name <virtual adampter name>  Move this virtual network adapter to newly created namespace  ip link set <virtual adampter name> netns <network namespace name>  List network interface in given network namespace  ip netns exec <network namespace name> ip link list  Configure network interface in network interface  ip netns exec <network namespace name> <command to run against that namespace>  Connecting Network Namespaces to the Physical Network  ip link set dev <device> netns < network namespace name> 
  • 57.  IPC namespace allows us to isolate following IPC resources,  System V IPC (man 7 sysvipc)  POSIX message queues  /proc interface are different for each IPC namespace  POSIX Message queue interfaces in /proc/sys/fs/mqueue.  The System V IPC interfaces in /proc/sys/kernel for shmmini, shmmax, shmall, shm_rmid_forced, sem, msgmax, msgmnb, msgmni.
  • 58.  UTS : Unix Time Sharing  UTS namespace isolates hostname and NIS domain name.  Systemcall : uname()/sethostname()/gethostname()
  • 59.  Namespaces in operation, part 1: namespaces overview  Namespaces in operation, part 2: the namespaces API  Namespaces in operation, part 3: PID namespaces  Namespaces in operation, part 4: more on PID namespaces  Namespaces in operation, part 5: User namespaces  Namespaces in operation, part 6: more on user namespaces  Namespaces in operation, part 7: Network namespaces  Mount namespaces and shared subtrees  Mount namespaces, mount propagation, and unbindable mounts
  • 60. #?

Editor's Notes

  1. https://blog.risingstack.com/operating-system-containers-vs-application-containers/#:~:text=OS%20containers%20are%20virtual%20environments,of%20OS%20containers%20as%20VMs.&text=OS%20containers%20are%20useful%20when,or%20different%20flavors%20of%20distros.
  2. https://scoutapm.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling
  3. Networking: https://thenewstack.io/container-networking-breakdown-explanation-analysis/
  4. https://dzone.com/articles/docker-container-resource-management-cpu-ram-and-I
  5. https://www.linode.com/community/questions/10445/quota-management-of-lxc-containers
  6. https://technology.amis.nl/2018/04/08/first-steps-with-docker-checkpoint-to-create-and-restore-snapshots-of-running-containers/#:~:text=First%20steps%20with%20Docker%20Checkpoint%20%E2%80%93%20to%20create,restore%20snapshots%20of%20running%20containers&text=Linux%20has%20a%20mechanism%20called,collection%20of%20files%20on%20disk.
  7. https://www.lanl.gov/projects/national-security-education-center/information-science-technology/_assets/docs/2015-si-docs/TeamVermillion-presentation.pdf
  8. https://www.linuxjournal.com/content/everything-you-need-know-about-linux-containers-part-i-linux-control-groups-and-process
  9. https://www.linuxjournal.com/article/5737 https://www.kernel.org/doc/ols/2008/ols2008v1-pages-163-172.pdf https://blog.pentesteracademy.com/linux-security-understanding-linux-capabilities-series-part-i-4034cf8a7f09
  10. https://blog.knoldus.com/unionfs-a-file-system-of-a-container/
  11. https://www.usenix.org/conference/usenixsecurity18/presentation/sun https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security https://cloud.google.com/container-optimized-os/docs/how-to/secure-apparmor docker run --rm -it --security-opt apparmor=unconfined debian:jessie bash –I [ rm will remove container once work has done] https://opensource.com/article/18/2/understanding-selinux-labels-container-runtimes
  12. https://events.static.linuxfound.org/sites/events/files/slides/cgroups_0.pdf
  13. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-relationships_between_subsystems_hierarchies_control_groups_and_tasks https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-implications_for_resource_management
  14. https://lwn.net/Articles/679786/ [Understanding the new control groups API] https://lwn.net/Articles/484251/ [Fixing control groups ]
  15. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/chap-using_control_groups
  16. https://www.redhat.com/sysadmin/container-namespaces-nsenter
  17. https://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/
  18. http://jancorg.github.io/blog/2015/01/05/linux-kernel-namespaces-pt-i/ Pathc of nsproxy : https://lwn.net/Articles/183046/