SlideShare uma empresa Scribd logo
1 de 69
Baixar para ler offline
2
TRANSFORMING NETWORKING & STORAGE
About Myself:
I am a working for Intel for various projects, primarily Kernel
networking.
My website: http://ramirose.wix.com/ramirosen
I am the author of a book titled “Linux Kernel Networking”by Apress,
648 pages, 2014:
Agenda:
Overview of the cgroup subsystem and the namespace subsystem
cgroups
The PIDs cgroup controller
cgroup v2
namespaces
Backup
• The namespace subsystem and the cgroup
subsystem are the basis of lightweight process
virtualization.
• They form the basis of Linux containers.
• Can be used also for setting a testing environment or as a resource
management/resource isolation setup and for accounting.
• We will talk mainly about the kernel implementation with
some userspace usage examples.
lightweight process virtualization: A process which gives the user
an illusion that he runs a Linux operating system. You can run
many such processes on a machine, and all such processes in
fact share a single Linux kernel which runs on the machine.
This is opposed to hypervisor solutions, like Xen or KVM, where you
run another instance of the kernel.
The idea is not really a new paradigm - we have Solaris Zones and BSD
jails already several years ago.
It seems that Hypervisor-based VMs like KVM are here to stay (at least
for the next several years). There is an ecosystem of cloud infrastructure
around solutions like KVMs.
Advantages of Hypervisor-based VMs (like KVM) :
• You can create VMs of other operating systems (windows, BSDs).
• Security
• Though there were cases of security vulnerabilities which were found and
required installing patches to handle them (like VENOM).
Containers – advantages:
• Lightweight: occupies less resources (like memory) significantly then a
hypervisor.
• Density: you can install many more containers on a given host then KVM
based VMs.
• Elasticity: startup time and shutdown time is much shorter, almost
instantaneous. Creation of a container has the overhead of creating a Linux
process, which can be of the order of milliseconds, while creating a VM based
on XEN/KVM can take seconds.
The lightness of the containers in fact provides both their density and their
elasticity.
Containers versus Hypervisor-based VMs
The cgroup (control groups) subsystem is a Resource Management
and ResourceAccounting/Tracking solution, providing a generic
process-grouping framework.
● It handles resources such as memory, cpu, network, and more;
mostly needed in both ends of the spectrum (servers and embedded).
● Development was started by engineers at Google in 2006 under the
name "process containers”.
● Merged into kernel 2.6.24 (2008).
● cgroup core has 3 maintainers, and each cgroup controller has its
own maintainers.
cpu, memory and io are the most important resources that cgroup deals with. For
networking, it's mostly about allowing network layer to match packets to cgroups.
The actual control part still happens on the network side through iptables and tc.
The cgroup subsystem: background
No new system call was needed in order to support cgroups.
A new file system (VFS), "cgroup“ (also referred sometimes as
cgroupfs).
The implementation of the cgroup subsystem required a few, simple
hooks into the rest of the kernel, none in performance-critical paths:
– In boot phase (init/main.c) to perform various initializations.
– In process creation and termination methods, fork() and exit().
– Minimal additions to the process descriptor (struct task_struct)
– Add procfs entries:
● For each process: /proc/pid/cgroup.
● System-wide: /proc/cgroups
cgroup subsystem implementation
• The cgroups subsystem (v1 and v2) is composed of 2 ingredients:
– cgroup core.
• Mostly in kernel/cgroup.c (~6k lines of code).
• The same code serves both cgroup V1 and cgroup V2.
– cgroup controllers.
• Currently there are 12 cgroup v1 controllers and 3 cgroup
v2 controllers (memory, io, and pids) and there are other
v2 controllers which are under work in progress.
• For both v1 and v2, these controllers are represented by
cgroup_susbsys objects.
cgroup subsystem implementation - contd
In order to use the cgroups filesystem (browse it/attachtasks to cgroups,
etc.) it must be mounted, as any other filesystem. The cgroup filesystem
can be mounted on any path on the filesystem. Systemd and various
container projects uses /sys/fs/cgroup as a mounting point.
When mounting, we can specify with mount options (-o) which cgroup
controllers will be used (can be a bitmask of controllers, and also all
controllers):
Example: mounting the net_prio cgroup controller:
mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
Mounting cgroups
Name Kernel Object name Module
blkio io_cgrp_subsys block/blk-cgroup.c
cpuacct cpuacct_cgrp_subsys kernel/sched/cpuacct.c
cpu cpu_cgrp_subsys kernel/sched/core.c
cpuset cpuset_cgrp_subsys kernel/cpuset.c
devices devices_cgrp_subsys security/device_cgroup.c
freezer freezer_cgrp_subsys kernel/cgroup_freezer.c
hugetlb hugetlb_cgrp_subsys mm/hugetlb_cgroup.c
memory memory_cgrp_subsys mm/memcontrol.c
net_cls net_cls_cgrp_subsys net/core/netclassid_cgroup.c
net_prio net_prio_cgrp_subsys net/core/netprio_cgroup.c
perf_event perf_event_cgrp_subsys kernel/events/core.c
pids pids_cgrp_subsys kernel/cgroup_pids.c
Kernel 4.4 cgroup v1 – 12 controllers
tasks release_agent Note: the release_agentand sane_behavior
appear only on root
cgroup.clone_children notify_on_release
cgroup.event_control cgroup.sane_behavior
cgroup.procs memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate memory.memsw.max_usage_in_bytes
memory.numa_stat memory.memsw.failcnt
memory.oom_control memory.memsw.limit_in_bytes
memory.kmem.limit_in_bytes memory.pressure_level
memory.kmem.max_usage_in_bytes memory.soft_limit_in_bytes
memory.kmem.slabinfo memory.stat
memory.kmem.tcp.failcnt memory.swappiness
memory.kmem.tcp.limit_in_bytes memory.usage_in_bytes
memory.kmem.tcp.max_usage_in_bytes memory.use_hierarchy
memory.kmem.tcp.usage_in_bytes memory.failcnt
memory.kmem.usage_in_bytes memory.force_empty
memory.limit_in_bytes memory.kmem.failcnt
Memory controller interface files
mkdir /sys/fs/cgroup/memory/group0
• The tasks entry that is created under group0 is empty (processes are
called tasks in cgroup terminology).
echo 0 > /sys/fs/cgroup/memory/group0/tasks
• The pid of the current bash shell process is moved from the memory
controller in which it resides into group0 memory controller group.
echo 40M > /sys/fs/cgroup/memory/group0/memory.limit_in_bytes
You can disable the out of memory killer with memcg:
echo 1 > /sys/fs/cgroup/memory/group0/memory.oom_control
Example 1: memcg (memory control groups)
memcg (memory control groups) - contd
With containers, the same can be done from the host:
lxc-cgroup -n myfedora memory.limit_in_bytes 40M
cat /sys/fs/cgroup/memory/lxc/myfedora/memory.limit_in_bytes
41943040
Or for assigning the processors 0 and 1 to the container:
lxc-cgroup -n myfedora cpuset.cpus "0,1"
cghost:/$cat /sys/fs/cgroup/cpuset/lxc/myfedora/cpuset.cpus
0-1
docker run -i --cpuset=0,2 -t fedora /bin/bash
cat /sys/fs/cgroup/cpuset/system.slice/docker-64bit_ID.scope/cpuset.cpus
0,2
The release agent mechanism is invoked when the last process of a
cgroup terminates.
● The cgroup release_agent entry should be set to a path to an
executable/script to be invoked when the last process in a group
terminates.
● The cgroup notify_on_release entry should be set so that
release_agent will be invoked.
echo 1 > /sys/fs/cgroup/memory/notify_on_release
The release_agent can be set also via a mount option; systemd, for
example, use this mechanism. For example in Fedora 21, mount shows:
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/
systemd-cgroups-agent,name=systemd)
Example 2: release_agent in memcg
Also referred to as : devcg (devices control group)
• It is more of an access controller than a resource controller.
● devices cgroup provides enforcing restrictions on reading, writing and
creating (mknod) operations on device files.
● 3 control files: devices.allow, devices.deny, devices.list.
– devices.allow can be considered as devices whitelist
– devices.deny can be considered as devices blacklist.
– devices.list available devices.
● Each entry in these files consist of 4 fields:
– type: can be a (all), c (char device), or b (block device).
● All means all types of devices, and all major and minor numbers.
– Major number.
– Minor number.
– Access: composition of 'r' (read), 'w' (write) and 'm' (mknod).
Example 3: devices control group
/dev/null major number is 1 and minor number is 3 (see
Documentation/devices.txt)
mkdir /sys/fs/cgroup/devices/group0
By default, for a new group, you have full permissions:
cat /sys/fs/cgroup/devices/group0/devices.list
a *:* rwm
echo 'c 1:3 rmw' > /sys/fs/cgroup/devices/group0/devices.deny
This denies rmw access from /dev/null device.
echo 0 > /sys/fs/cgroup/devices/group0/tasks # Attaches the
current shell to group0
echo "test" > /dev/null
bash: /dev/null: Operation not permitted
devices control group – example (contd)
PIDs cgroup controller
• An anti-fork-bomb solution by a new cgroup controller.
• Adds a cgroup controller to enable limiting the number of processes that
can be forked inside a cgroup.
• Implementation of the prlimit(2)/RLIMIT_NPROC but to a cgroup and not to a
process.
• The PIDs space is a limited resource space, about 4 million pids system-wide.
• PID_MAX_LIMIT=4,194,304 (0x400000) see: include/linux/threads.h.
 With nowadays RAM capacities, all of them can be used up by a single container,
making the whole system unusable. The PIDs cgroup controller can help avoiding
this by limiting the number of processes per cgroup.
• Developed by Aleksa Sarai and integrated into the kernel since v4.3
 A simple and short module, ~300 lines of code only, kernel/cgroup_pids.c
PIDs cgroup controller - contd
The PID cgroup controller has two interface files:
• pids.max (Read/Write)
– The maximum number of processes for the cgroup.
– Does not exist for the root cgroup directory.
– For subgroups,the value is “max”, which is about 4,000,000.
 pids.current (Read only)
– The number of processes currently in the cgroup and its children
- Also of processes of a child on which PIDs controller is not enabled.
– Does not include zombies.
– Does not exist for the root cgroup directory
– When the number of processes in a group exceeds its pids.max, you
will get this error: fork: Resource temporarily unavailable.
cgroup v2 - background
• cgroup v1 is the existing cgroup kernel implementation (also referred to as
legacy cgroup in cgroup/LKML/other mailing lists).
• “Unified Hierarchy” development features are available over three years with
special mount option (__DEVEL__sane_behavior).
• From kernel 4.4 (January 2016), cgroup V2 is an official part of the kernel
with special filesystem type.
• Systemd has initial support to cgroup v2 since v226, September 2015.
• Also cgmanager added initial support for cgroup v2 (by Serge Hallyn)
• Why cgroup v2 ?
• A lot of chaos in cgroupv1.
– no consistency across cgroup controllers; for example:
– When creating a new controller, in several controllers the attribute values
are inherited from the parent (like net_prio and net_cls), and in several
others the attribute values are the defaults, regardless of the parent.
– For cpuset, changes in the parent are propagated to its descendants,
whereas with all controllers they are not (clone_children should be set for
this),
– With cgroup v1, some controllers have controller-specific interface files
in the root cgroup while others don’t have.
– cgroup v2 establishes a strict and consistent interfaces
– In cgroup v2, there is only one hierarchy, “the unified hierarchy”. Each
process can belong to only a single cgroup. cgroup-v2.txt in kernel
Documentation describes in detail cgroup v1 inconsistencies:
cgroup v2 - background
tree -L 1 /sys/fs/cgroup/ on Fedora 23: (cgroup v1 – multiple
hierarchies)
├── blkio
├── -> cpu,cpuacct
├── cpuacct -> cpu,cpuacct
├── cpu,cpuacct
├── cpuset
├── devices
├── freezer
├── hugetlb
├── memory
├── net_cls -> net_cls,net_prio
├── net_cls,net_prio
├── net_prio -> net_cls,net_prio
├── perf_event
└── systemd
Unlike cgroup v1, cgroup v2 has only a
single hierarchy and is strict about
hierarchical behavior.
• Enabling/Disabling of a controller is done always by the cgroup
parent rather than by the cgroup itself (subtree_control).
• Interface files - semantics
• When a controller implements an absolute resource limit, the interface files
should be named "min" and "max“ respectively (for example, pids.max
for the PIDs controller)
• When a controller implements best effort resource limit, the interface files
should be named "low“ and "high" respectively.
• used for example, in the memory controller.
• A special token "max" is used for these interface files (write/read),
representing infinity.
• Mounting cgroup v2 is done by:
– mount -t cgroup2 none $MOUNT_POINT
● The mount point can be anywhere in the filesystem.
cgroup v2 controllers
 As opposed to cgroups v1, there are no cgroup mount options in
cgroup v2.
 When the system boots, both cgroup v1 filesystem and cgroup v2
filesystem are registered, so you can work with a mixture of cgroup v1
and cgroup v2 controllers.
 You cannot use the same type of controller simultaneously both in
cgroup v1 and cgroup v2.
The root cgroup object
• After mounting /cgroup2 with mount -t cgroup2 none /cgroup2, a root
cgroup object is created.
• The following three cgroup core interface files are created under the
/cgroup2 mount point:
/cgroup2/
├── cgroup.controllers
├── cgroup.procs
└── cgroup.subtree_control
Next we will describe these three cgroup core interface files.
There is a single root cgroup object, and it does not
have any resource control interface files.
• cgroup.controllers (A read-only file).
– Shows the supported cgroup controllers. In cgroup v2, we have currently
support for the memory, io and pids cgroup controllers only. All v2
controllers which are not bound to a v1 hierarchy are automatically bound
to the v2 hierarchy and show up at the root, so reading cgroup.controllers
will give io memory pids
• cgroup.procs (A read-write file)
– The list of PIDs of processes in the group; contains the PIDs of all
processes in the system after mount (zombie processes do not
appear in "cgroup.procs" and thus can't be moved to another
cgroup).
The root cgroup object – contd.
The root cgroup object – contd.
• cgroup.subtree_control (A read-write file.)
– This entry is empty after mount, as no controller is enabled by default.
– Enabling cgroup v2 controller is done by writing to cgroup.subtree_control.
– For example, enabling the memory controller is done by:
– echo “+memory” > /cgroup2/ cgroup.subtree_control
– Disabling the memory controller can be done, for example, by:
- echo “-memory” > /cgroup2/ cgroup.subtree_control
• Creating a cgroup is done by mkdir (like in cgroup v1), for example:
– mkdir /cgroup2/group1
– After running this command, four cgroup core “cgroup.” prefixed entries
are created and also several interface files for the cgroup controllers
enabled in the parent, as we will immediately see.
group1/
├── cgroup.controllers
├── cgroup.procs
├── cgroup.events
└── cgroup.subtree_control
All cgroup core interface files are prefixed with "cgroup.“
Next we will see the entries which are created when running “mkdir
/cgroup2/group1”.
Creating a subgroup
The set of these 4 cgroup
interface files is the v2 hierarchy
itself and is referred to internally
as “the default hierarchy”.
• /cgroup2/group1/cgroup.procs
– The list of PIDs of processes in this group; empty upon creation.
• /cgroup2/group1/cgroup.controllers
– The subgroup enabled controllers. For subgroups, this will show the controllers that
were enabled in the parent by writing to subtree_control. When changing the
subtree_control in the parent, changes are propagated to the child cgroup.contorllers
• /cgroup2/group1/cgroup.subtree_control
– Initialized to be empty for the child group. Also here, you can enable only controllers
which appear in the cgroups.controllers of this cgroup (group1).
• /cgroup2/group1/cgroup.events
– Contains only one field, "populated”; 0 means no live process in this cgroup and
its descendants, 1 otherwise. Upon creation of a subgroup, populated is 0
– monitoring changes of populated from userspace - with poll(), inotify() and
dnotify() , as opposed to call_usermodehelper() in cgroup v1.
subgroup interface files
You can attach processes only to leaves.
You cannot attach a process to an internal subgroup if it has any
controller enabled.
• Controllers can be enabled by either writing to subtree_control of the
parent or implicitly via controllers dependency.
• The idea is that only processes of the leaves can compte on
resources. This scheme is more well organized.
• The only exception for this is the cgroup root object.
• This is opposed to cgroup v1, which allowed threads to be in any cgroups
• See “2-4-3. No Internal Process Constraint” in cgroup-v2.txt.
No Internal Process rule
A controller can't be disabled if one or more descendants have it
enabled.
multi-destination migration as a result of subtree_control enabling:
Run this sequence:
mount –t cgroup2 nodev /cgroup2
mkdir /cgroup2/group1
mkdir /cgroup2/group1/nested1
mkdir /cgroup2/group1/nested2
echo +pids > cgroup2/cgroup.subtree_controll
cgroup2 example
34
TRANSFORMING NETWORKING & STORAGE
echo +pids > cgroup2/group1/cgroup.subtree_controller
• What happens if there were processes in nested1 and nested2
before running:
echo +pids > cgroup2/group1/cgroup.subtree_controller?
• An inner cgroup_subsys_state (css) object is created for that group.
• The processes in nested1 and nested2 should be migrated to this css.
• With PIDs controller, attaching a process to a cgroup will never fail
• For other controllers, there are cases when the attaching a process to a cgroup
will fail, for example, when the CLONE_IO is set (for cgroup v1 as well as cgroup
v2).
• Migrating the processes should also handle implicit controllers.
Migrating processes and threads
• cgroup v2 is process-granular.
• Every process in the system belongs to one and only one cgroup.
• All threads of a process belong to the same cgroup.
• A process can be migrated into a cgroup by writing its PID to the target
cgroup's cgroup.procs file.
• Writing the PID of any thread of a process to cgroup.procs of a destination
cgroup migrates all the threads of the process into the destination cgroup
(including the main process).
• This is opposed to cgroup v1 thread granularity, which allowed different
threads of a process to belong to different cgroups.
• When forking other processes from inside a process, migrating of a parent
process to another cgroup does not affect the existing child processes, and
migrating of a child process does not affect the parent process.
Finding matches based on the cgroup name in cgroup v2 (which is done by the --
path parameter) is based on getting the cgroup to which the process holding the
socket belongs. Practically, this mechanism is not possible in cgroup v1 as a
process can belong to more than one cgroup. In cgroup v2 we have only the
match by path capability. An example for a rule for matching traffic which
originates from sockets created in a group called “test”, or its subgroups, can be
iptables -A OUTPUT -m cgroup --path test -j LOG
This is based to extension of the xt_cgroup netfilter match module (adding a new
revision) , net/netfilter/xt_cgroup.c. The xt_cgroup module can be used both for
cgroup v1 and cgroup v2.
Create a cgroupv2 group named test and move the current shell to it:
mkdir /cgroup2/test
echo 0 > /cgroup2/test/cgroup.procs
cgroup v2- match subgroup by path
39
TRANSFORMING NETWORKING & STORAGE
Now every socket created in this shell will have a pointer to the cgroup
subgroup in which it was created, namely the “test” group.
A sock_cgroup_data object, which contains per-socket cgroup information:
was added to the sock object. It includes:
• A pointer to the cgroup in which the socket is created.
• assigned when the socket is created
• prioidx
• classid
• is_data – 0 indicates a cgroup pointer , 1 indicates prioidx or classid.
• Note: once net_prio or net_class will be used, that pointer in the socket
will no longer point to the cgroup, but to the priority or classid.
INTEL CONFIDENTIAL Doc #xxxxx
rr:/$echo 5 > /sys/fs/cgroup/net_prio/net_cls.classid
rr:/$mkdir /sys/fs/cgroup/net_prio/group1
/sys/fs/cgroup/net_prio/group1/net_cls.classid will be 5 as it is inherited.
echo $$ > /sys/fs/cgroup/net_prio/group1/tasks
iptables -A OUTPUT -m cgroup --cgroup 5 -j LOG
This will trigger again logging the packets to syslog.
cgroup v2- match cgroup example -contd
Developmenttook over a decade: Namespaces implementation started
in about 2002.There are currently 6 namespaces in Linux:
● mnt (mount points, filesystems)
● pid (processes)
● net (network stack)
● ipc (System V IPC)
● uts (hostname)
● user (UIDs)
A process can be created in Linux by the fork(), clone() or vclone()
system calls. In order to support namespaces, 6 flags (CLONE_NEW*)
were added. These flags (or a combination of them) can be used in
clone() or unshare() system calls to create a namespace.
Namespaces
Clone flag Kernel Version Required capability
CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN
CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN
CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN
CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN
CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN
CLONE_NEWUSER 3.8 No capability is required
Namespaces clone flags
Namespaces API consists of these 3 system calls:
● clone() - creates a new process and a new namespace; the newly
created process is attached to the new namespace.
– The process creation and process termination methods, fork() and
exit(), were patched to handle the new namespace CLONE_NEW* flags.
● unshare() – gets only a single parameter, flags. Does not create a
new process; creates a new namespace and attaches the calling
process to it.
– unshare() was added in 2005.
see “new system call, unshare” : http://lwn.net/Articles/135266/
● setns() - a new system call, for attaching the calling process to an
existing namespace; prototype: int setns(int fd, int nstype);
Namespaces system calls
UTS namespace provides a way to get information about the system with
commands like uname or hostname.
UTS namespace was the most simple one to implement.
There is a member in the process descriptor called nsproxy.
A member named uts_ns (uts_namespace object) was added to it.
The uts_ns object includes an object (new_utsname struct ) with 6
members:
• sysname
• nodename
• release
• version
• machine
• domainname
UTS namespace
In order to implement the UTS namespace, usage of global variables
was replaces by accessing the corresponding members in the UTS
namespace via the new nsproxy object of the process descriptor.
See for example the implementation of gethostname() , sethostname()
and uname() syscalls.
For IPC namespaces, the same principle as in UTS namespace.
For Mount namespace, a member named mnt_ns (mnt_namespace
object) was added to the nsproxy.
● In the new mount namespace, all previous mounts will be visible; and
from now on, mounts/unmounts in that mount namespace are invisible to
the rest of the system.
● mounts/unmounts in the global namespace are visible in that
namespace.
● Added a member named pid_ns (pid_namespace object) to the
nsproxy struct.
● Processes in different PID namespaces can have the same process ID.
● When creating the first process in a new PID namespace, its PID is 1.
● Behavior like the “init” process:
– When a process dies, all its orphaned children will now have the
process with PID 1 as their parent (child reaping).
– Sending SIGKILL signal does not kill process 1, regardless of in which
namespace the command was issued (initial namespace or other pid
namespace).
● pid namespaces can be nested, up to 32 nesting levels.
(MAX_PID_NS_LEVEL)
• A usecase for PID namespaces: the CRIU project
PID namespaces
● Added a member named user_ns (user_namespace object) to the
credentials object (struct sched).
• Also each namespace includes a pointer to a user_namespace object.
Each process will have distinct set of UIDs, GIDs and capabilities.
User namespace enables non root user to create a process in which it
will be root.
Anatomy of a user namespaces vulnerability:
By Michael Kerrisk, March 2013
An article about CVE 2013-1858 - exploitable security
vulnerability
http://lwn.net/Articles/543273/
User Namespaces
● A network namespace is logically another copy of the network stack,
with its own routing tables, firewall rules, and network devices.
● The network namespace is represented by a huge struct net. (defined
in include/net/net_namespace.h).
Method in the stack which change the state were adjusted to have the
net object as a parameter and set its members accordigly.
struct net includes all network stack ingredients, like:
• – Loopback device.
• – SNMP stats. (netns_mib)
• – All network tables: routing, neighboring, etc.
• – All sockets
• – /procfs and /sysfs entries.
Network Namespaces
At a given moment -
• A network device belongs to exactly one network namespace.
• A socket belongs to exactly one network namespace.
The default initial network namespace, init_net (instance of struct net),
includes the loopback device and all the physical devices, the networking
tables, etc.
● Each newly created network namespace includes only the loopback
device.
Create two namespaces, called "myns1" and "myns2":
● ip netns add myns1
● ip netns add myns2
• This triggers creation of /var/run/netns/myns1,/var/run/netns/myns2 empty
folders and invoking the unshare() system call with CLONE_NEWNET.
You delete a namespace by:
● ip netns del myns1
– This unmounts and removes /var/run/netns/myns1
Example
You can movea network interface (eth0) to myns1 network namespace:
● ip link set eth0 netns myns1
There are other subcommands for monitoring namespaces, running a
command in a namespace/in all namespaces, listing the network
namespaces which were created with the ip netns command and
more:
ip netns monitor, ip netns exec myns1 bash, ip -all netns exec ip
link, ip netns list
Applications which usually look for configuration files under /etc (like
/etc/hosts or /etc/resolv.conf), will first look under /etc/netns/NAME/, and
only if nothing is available there, will look under /etc.
cgroup namespace
• Provides a new namespace, which gives the ability to remember the cgroup
of the process at the point of creation of the cgroup namespace. Supports
both cgroup v1 and cgroup v2.
• Implementation: a new CLONE flag was added, CLONE_NEWCGROUP, and
a new object was added to nsproxy, representing cgroup namespace,
cgroup_ns.
The /proc/<pid>/cgroup file may leak potential system level information to the
isolated processes. For example, without cgroup namespaces:
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
But with cgroup namespaces, from within new cgroupns,
cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
Legal Disclaimer
53
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING
TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL
APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFIC ERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS,
DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION
CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literatur e.htm% 20 Performance tests and ratings are
measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual
performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel
Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel
AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software
configurations. Consult with your system vendor for more information.
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an
Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a
TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technolog y/security/ for more information.
†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on
the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary
depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or
these devices may be subject to change without notice.
Copyright © 2016, Intel Corporation. All rights reserved.
5
4
5
5
56
TRANSFORMING NETWORKING & STORAGE
• The ability to make one controller dependent on another is one of the new
features of cgroup v2.
• controller dependency is not possible in cgroup v1.
• This can be defined in code by setting the depends_on member of the
cgroup_subsys.
• For example, the io controller depends on the memory controller.
struct cgroup_subsys io_cgrp_subsys = {
...
.depends_on= 1 << memory_cgrp_id,
...
};
• This means that enabling 'io' enables 'memory' *implicitly*, but it is not
visible (no interface files)
Implicit controllers (cgroup v2)
57
TRANSFORMING NETWORKING & STORAGE
The net_cls controller – when creating new cgroup, the net_cls.classid
value is propagated to the existing subgroups:
rr:/sys/fs/cgroup/net_cls$mkdir group1
rr:/sys/fs/cgroup/net_cls/group1$echo 0x2 > net_cls.classid
r:/sys/fs/cgroup/net_cls/group1$mkdir nested1
rr:/sys/fs/cgroup/net_cls/group1$cat nested1/net_cls.classid
2
Note: after the child groups are created, changes in the parent are not
propagated to the existing child groups:
rr:/sys/fs/cgroup/net_cls/group1$echo 0x1 > net_cls.classid
rr:/sys/fs/cgroup/net_cls/group1$cat nested1/net_cls.classid
2
Example 1 (cgroup v1 propagation)
58
TRANSFORMING NETWORKING & STORAGE
The Following sequence shows propagation from parent when creating a new group:
rr:/sys/fs/cgroup/cpuset$ echo 1 > cgroup.clone_children
rr:/sys/fs/cgroup/cpuset$ mkdir group1
rr:/sys/fs/cgroup/cpuset$ echo 1-2 > group1/cpuset.cpus
rr:/sys/fs/cgroup/cpuset$ cat group1/cpuset.cpus
1-2
rr:/sys/fs/cgroup/cpuset$ cat group1/nested1/cpuset.cpus
1-2
Notes:
Without echo 1 > cgroup.clone_children this propagation won’t work.
The clone_children is effective only with the cpuset controller.
example 2 (cgroup v1 clone_children)
59
TRANSFORMING NETWORKING & STORAGE
s:/$mkdir /sys/fs/cgroup/net_prio/group1
s:/$echo "eth0 4" > /sys/fs/cgroup/net_prio/group1/net_prio.ifpriomap
This sets the netprio_map object of eth0 net_device.
This will set the priority of outgoing (egress) traffic of packets (skbs) of
processes attached to group1 to be 4.
This is implementedby the skb_update_prio()method
http://lxr.free-electrons.com/source/net/core/dev.c#L2926
This is done prior to queuing the packet with the qdisc (Queuing
Discipline).
Setting the priority of a socket can be done by setting the SO_PRIORITY
socket option, but this option is not always available.
Example 3 – cgroup v1 net_prio
60
TRANSFORMING NETWORKING & STORAGE
cgroup v1:
Run as root:
mkdir -p /sys/fs/cgroup/devices/group1/nested1
su user1
echo $$ > /sys/fs/cgroup/devices/group1/nested1/cgroup.procs
You will get –EPERM
But after you will set access permission to nested1 by running as root:
chown -R user1:user1 /sys/fs/cgroup/devices/group1/nested1/
Running it will succeed.
Example 4 : Delegation Containment
61
TRANSFORMING NETWORKING & STORAGE
In order to support delegation, three conditions should be met: the
writer's euid must match either uid or suid of the target process.The
writer must have write access to the "cgroup.procs"file. The writer must
have write access to the "cgroup.procs"file of the common ancestor of
the source and destination cgroups.
Run as root:
echo $$
4767
mkdir -p /cgroup2/group1/nested1
chown –R user1:user1 /cgroup2/group1
echo $$ > /cgroup2/group1/cgroup.procs
su user1
echo 4767 > /cgroup2/group1/nested1/cgroup.procs
Example 4 : delegation containment – v2
62
TRANSFORMING NETWORKING & STORAGE
How do I know to which cgroup does a process belong to?
cat "/proc/$PID/cgroup" shows this info.
• The entry for cgroup v2 is always in the format "0::$PATH".
• So for example, if we created a cgroup named group1 and attached a task
with PID 1000 to it, then running:
cat "/proc/1000/cgroup
0::/group1
And for a nested group:
cat /proc/869/cgroup
0::/group1/nested
Getting info about cgroups
63
TRANSFORMING NETWORKING & STORAGE
/proc/cgroups shows info on both cgroup v1/cgroup v2. The hierarchy_idfor cgroupv2 controllers is 0.
#subsys_name hierarchy num_cgroups enabled
cpuset 2 1 1
cpu 3 1 1
cpuacct 3 1 1
blkio 0 1 1
memory 0 1 1
devices 6 61 1
freezer 7 1 1
net_cls 8 1 1
perf_event 9 1 1
net_prio 8 1 1
hugetlb 10 1 1
pids 0 1 1
64
TRANSFORMING NETWORKING & STORAGE
For the memory controller:
• memory.current
• memory.events
• memory.high
• memory.low
• memory.max
• For the IO controller:
• io.max
• io.weight
Note: Work is currently being done by Vladimir
Davydov, one of the Memory Resource Controller
(memcg) maintainers, for adding cgroup v2 kmem
accounting, so entries like memory.swap.current,
memory.swap.max are coming soon.
Cgroup v2 Control files (Interface files)
65
TRANSFORMING NETWORKING & STORAGE
Minor advantage: There is a single Linux kernel infrastructure for
containers (namespacesand cgroups) while for Xen an KVM we have
two different implementations withoutany common code.
Now run the following iptables rule (from anywhere in the system):
iptables -A OUTPUT -m cgroup --path test -j LOG
And then ping anywhere from the shell that is now a process in “test”.
The socket created has a pointer to “test” v2 cgroup, and the iptables
match rule is for “test” (--path test). So the packets will be dumped to
syslog.
• Running this test from any subgroupof test will have the same
result.
• The sum of the allocations of immediatesubgroups can not exceed
the amount of resources available to the parent.
So, for example:
If in /cgroup2/group1
pids.max = 4
Then if in
/cgroup2/group1/nested1
/cgroup2/group1/nested2
There are together 4 processes, we cannot fork a fifth in either of them.
67
TRANSFORMING NETWORKING & STORAGE
The script in the following slide shows how to connect two namespaces
by veth (Virtual Ethernet drivers) so that you will be able to ping and
send traffic between them:
Script for connecting two namespaces
68
TRANSFORMING NETWORKING & STORAGE
ip netns add netA
ip netns add netB
ip link add name vm1-eth0type veth peer name vm1-eth0.1
ip link add name vm2-eth0type veth peer name vm2-eth0.1
ip link set vm1-eth0.1netns netA
ip link set vm2-eth0.1netns netB
ip netns exec netA ip l set lo up
ip netns exec netA ip l set vm1-eth0.1up
ip netns exec netB ip l set lo up
ip netns exec netB ip l set vm2-eth0.1up
ip netns exec netA ip a add 192.168.0.10 dev vm1-eth0.1
ip netns exec netB ip a add 192.168.0.20 dev vm2-eth0.1
ip netns exec netA ip r add 192.168.0.0/24dev vm1-eth0.1
ip netns exec netB ip r add 192.168.0.0/24 dev vm2-eth0.1
brctl addbr mybr
ip l set mybr up
ip l set vm1-eth0up
brctl addif mybr vm1-eth0
ip l set vm2-eth0up
brctl addif mybr vm2-eth0
Script for connecting two namespaces
69
TRANSFORMING NETWORKING & STORAGE
RDMA cgroups controller – patches were already sent by Parav Pandit:
https://lwn.net/Articles/674161/
The RDMA cgroup will support both V1 and V2.
Upcoming…

Mais conteúdo relacionado

Mais procurados

LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversSatpal Parmar
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020Akihiro Suda
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicJoseph Lu
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usagevincentvdk
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
Introduction to Docker storage, volume and image
Introduction to Docker storage, volume and imageIntroduction to Docker storage, volume and image
Introduction to Docker storage, volume and imageejlp12
 
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Vietnam Open Infrastructure User Group
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
Yocto bspを作ってみた
Yocto bspを作ってみたYocto bspを作ってみた
Yocto bspを作ってみたwata2ki
 
Ansible ex407 and EX 294
Ansible ex407 and EX 294Ansible ex407 and EX 294
Ansible ex407 and EX 294IkiArif1
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiVietnam Open Infrastructure User Group
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Jian-Hong Pan
 

Mais procurados (20)

LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panic
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
Linux Internals - Part I
Linux Internals - Part ILinux Internals - Part I
Linux Internals - Part I
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usage
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Introduction to Docker storage, volume and image
Introduction to Docker storage, volume and imageIntroduction to Docker storage, volume and image
Introduction to Docker storage, volume and image
 
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
 
Wireguard VPN
Wireguard VPNWireguard VPN
Wireguard VPN
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
Yocto bspを作ってみた
Yocto bspを作ってみたYocto bspを作ってみた
Yocto bspを作ってみた
 
Ansible ex407 and EX 294
Ansible ex407 and EX 294Ansible ex407 and EX 294
Ansible ex407 and EX 294
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 

Destaque

Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Jérôme Petazzoni
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2Outlyer
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Boden Russell
 
Docker en kernel security
Docker en kernel securityDocker en kernel security
Docker en kernel securitysmart_bit
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerFred Love
 
Introduction of mesos persistent storage
Introduction of mesos persistent storageIntroduction of mesos persistent storage
Introduction of mesos persistent storageZhou Weitao
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsAlexander Kamkin
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-enAndri Yabu
 
BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingLinaro
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsThe Linux Foundation
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the BeastKernel TLV
 
Userfaultfd and Post-Copy Migration
Userfaultfd and Post-Copy MigrationUserfaultfd and Post-Copy Migration
Userfaultfd and Post-Copy MigrationKernel TLV
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing LandscapeKernel TLV
 
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege EscalationSemtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege EscalationKernel TLV
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheadsSandeep Joshi
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 

Destaque (20)

Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
Docker en kernel security
Docker en kernel securityDocker en kernel security
Docker en kernel security
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for Docker
 
Introduction of mesos persistent storage
Introduction of mesos persistent storageIntroduction of mesos persistent storage
Introduction of mesos persistent storage
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-en
 
BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development Meeting
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Dulloor xen-summit
Dulloor xen-summitDulloor xen-summit
Dulloor xen-summit
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the Beast
 
Userfaultfd and Post-Copy Migration
Userfaultfd and Post-Copy MigrationUserfaultfd and Post-Copy Migration
Userfaultfd and Post-Copy Migration
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
macvlan and ipvlan
macvlan and ipvlanmacvlan and ipvlan
macvlan and ipvlan
 
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege EscalationSemtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 

Semelhante a TRANSFORMING NETWORKING & STORAGE WITH CONTAINERS

Docker 原理與實作
Docker 原理與實作Docker 原理與實作
Docker 原理與實作kao kuo-tung
 
Java in containers
Java in containersJava in containers
Java in containersMartin Baez
 
Mastering java in containers - MadridJUG
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUGJorge Morales
 
Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Anthony Wong
 
Linux cgroups and namespaces
Linux cgroups and namespacesLinux cgroups and namespaces
Linux cgroups and namespacesLocaweb
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containersNitish Jadia
 
Docker Security Paradigm
Docker Security ParadigmDocker Security Paradigm
Docker Security ParadigmAnis LARGUEM
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
 
Systemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveSystemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveAlison Chaiken
 
Secure container: Kata container and gVisor
Secure container: Kata container and gVisorSecure container: Kata container and gVisor
Secure container: Kata container and gVisorChing-Hsuan Yen
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Neeraj Shrimali
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetesmountpoint.io
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Docker, Inc.
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVMMarcus L Sorensen
 
Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !Dinakar Guniguntala
 
Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embeddedAlison Chaiken
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Red Hat Developers
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013dotCloud
 

Semelhante a TRANSFORMING NETWORKING & STORAGE WITH CONTAINERS (20)

Docker 原理與實作
Docker 原理與實作Docker 原理與實作
Docker 原理與實作
 
Java in containers
Java in containersJava in containers
Java in containers
 
Mastering java in containers - MadridJUG
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUG
 
Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势
 
Linux cgroups and namespaces
Linux cgroups and namespacesLinux cgroups and namespaces
Linux cgroups and namespaces
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containers
 
Docker Security Paradigm
Docker Security ParadigmDocker Security Paradigm
Docker Security Paradigm
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Systemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveSystemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to love
 
Secure container: Kata container and gVisor
Secure container: Kata container and gVisorSecure container: Kata container and gVisor
Secure container: Kata container and gVisor
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup.
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVM
 
Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 
Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embedded
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 

Mais de Kernel TLV

Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution EnvironmentKernel TLV
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityKernel TLV
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to BottomKernel TLV
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernel TLV
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentKernel TLV
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 

Mais de Kernel TLV (20)

DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution Environment
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem Security
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker Guidelines
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future Development
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 

Último

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 

Último (20)

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 

TRANSFORMING NETWORKING & STORAGE WITH CONTAINERS

  • 1.
  • 2. 2 TRANSFORMING NETWORKING & STORAGE About Myself: I am a working for Intel for various projects, primarily Kernel networking. My website: http://ramirose.wix.com/ramirosen I am the author of a book titled “Linux Kernel Networking”by Apress, 648 pages, 2014:
  • 3. Agenda: Overview of the cgroup subsystem and the namespace subsystem cgroups The PIDs cgroup controller cgroup v2 namespaces Backup
  • 4. • The namespace subsystem and the cgroup subsystem are the basis of lightweight process virtualization. • They form the basis of Linux containers. • Can be used also for setting a testing environment or as a resource management/resource isolation setup and for accounting. • We will talk mainly about the kernel implementation with some userspace usage examples. lightweight process virtualization: A process which gives the user an illusion that he runs a Linux operating system. You can run many such processes on a machine, and all such processes in fact share a single Linux kernel which runs on the machine.
  • 5. This is opposed to hypervisor solutions, like Xen or KVM, where you run another instance of the kernel. The idea is not really a new paradigm - we have Solaris Zones and BSD jails already several years ago. It seems that Hypervisor-based VMs like KVM are here to stay (at least for the next several years). There is an ecosystem of cloud infrastructure around solutions like KVMs. Advantages of Hypervisor-based VMs (like KVM) : • You can create VMs of other operating systems (windows, BSDs). • Security • Though there were cases of security vulnerabilities which were found and required installing patches to handle them (like VENOM).
  • 6. Containers – advantages: • Lightweight: occupies less resources (like memory) significantly then a hypervisor. • Density: you can install many more containers on a given host then KVM based VMs. • Elasticity: startup time and shutdown time is much shorter, almost instantaneous. Creation of a container has the overhead of creating a Linux process, which can be of the order of milliseconds, while creating a VM based on XEN/KVM can take seconds. The lightness of the containers in fact provides both their density and their elasticity. Containers versus Hypervisor-based VMs
  • 7. The cgroup (control groups) subsystem is a Resource Management and ResourceAccounting/Tracking solution, providing a generic process-grouping framework. ● It handles resources such as memory, cpu, network, and more; mostly needed in both ends of the spectrum (servers and embedded). ● Development was started by engineers at Google in 2006 under the name "process containers”. ● Merged into kernel 2.6.24 (2008). ● cgroup core has 3 maintainers, and each cgroup controller has its own maintainers. cpu, memory and io are the most important resources that cgroup deals with. For networking, it's mostly about allowing network layer to match packets to cgroups. The actual control part still happens on the network side through iptables and tc. The cgroup subsystem: background
  • 8. No new system call was needed in order to support cgroups. A new file system (VFS), "cgroup“ (also referred sometimes as cgroupfs). The implementation of the cgroup subsystem required a few, simple hooks into the rest of the kernel, none in performance-critical paths: – In boot phase (init/main.c) to perform various initializations. – In process creation and termination methods, fork() and exit(). – Minimal additions to the process descriptor (struct task_struct) – Add procfs entries: ● For each process: /proc/pid/cgroup. ● System-wide: /proc/cgroups cgroup subsystem implementation
  • 9. • The cgroups subsystem (v1 and v2) is composed of 2 ingredients: – cgroup core. • Mostly in kernel/cgroup.c (~6k lines of code). • The same code serves both cgroup V1 and cgroup V2. – cgroup controllers. • Currently there are 12 cgroup v1 controllers and 3 cgroup v2 controllers (memory, io, and pids) and there are other v2 controllers which are under work in progress. • For both v1 and v2, these controllers are represented by cgroup_susbsys objects. cgroup subsystem implementation - contd
  • 10. In order to use the cgroups filesystem (browse it/attachtasks to cgroups, etc.) it must be mounted, as any other filesystem. The cgroup filesystem can be mounted on any path on the filesystem. Systemd and various container projects uses /sys/fs/cgroup as a mounting point. When mounting, we can specify with mount options (-o) which cgroup controllers will be used (can be a bitmask of controllers, and also all controllers): Example: mounting the net_prio cgroup controller: mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio Mounting cgroups
  • 11. Name Kernel Object name Module blkio io_cgrp_subsys block/blk-cgroup.c cpuacct cpuacct_cgrp_subsys kernel/sched/cpuacct.c cpu cpu_cgrp_subsys kernel/sched/core.c cpuset cpuset_cgrp_subsys kernel/cpuset.c devices devices_cgrp_subsys security/device_cgroup.c freezer freezer_cgrp_subsys kernel/cgroup_freezer.c hugetlb hugetlb_cgrp_subsys mm/hugetlb_cgroup.c memory memory_cgrp_subsys mm/memcontrol.c net_cls net_cls_cgrp_subsys net/core/netclassid_cgroup.c net_prio net_prio_cgrp_subsys net/core/netprio_cgroup.c perf_event perf_event_cgrp_subsys kernel/events/core.c pids pids_cgrp_subsys kernel/cgroup_pids.c Kernel 4.4 cgroup v1 – 12 controllers
  • 12. tasks release_agent Note: the release_agentand sane_behavior appear only on root cgroup.clone_children notify_on_release cgroup.event_control cgroup.sane_behavior cgroup.procs memory.memsw.usage_in_bytes memory.move_charge_at_immigrate memory.memsw.max_usage_in_bytes memory.numa_stat memory.memsw.failcnt memory.oom_control memory.memsw.limit_in_bytes memory.kmem.limit_in_bytes memory.pressure_level memory.kmem.max_usage_in_bytes memory.soft_limit_in_bytes memory.kmem.slabinfo memory.stat memory.kmem.tcp.failcnt memory.swappiness memory.kmem.tcp.limit_in_bytes memory.usage_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.use_hierarchy memory.kmem.tcp.usage_in_bytes memory.failcnt memory.kmem.usage_in_bytes memory.force_empty memory.limit_in_bytes memory.kmem.failcnt Memory controller interface files
  • 13. mkdir /sys/fs/cgroup/memory/group0 • The tasks entry that is created under group0 is empty (processes are called tasks in cgroup terminology). echo 0 > /sys/fs/cgroup/memory/group0/tasks • The pid of the current bash shell process is moved from the memory controller in which it resides into group0 memory controller group. echo 40M > /sys/fs/cgroup/memory/group0/memory.limit_in_bytes You can disable the out of memory killer with memcg: echo 1 > /sys/fs/cgroup/memory/group0/memory.oom_control Example 1: memcg (memory control groups)
  • 14. memcg (memory control groups) - contd With containers, the same can be done from the host: lxc-cgroup -n myfedora memory.limit_in_bytes 40M cat /sys/fs/cgroup/memory/lxc/myfedora/memory.limit_in_bytes 41943040 Or for assigning the processors 0 and 1 to the container: lxc-cgroup -n myfedora cpuset.cpus "0,1" cghost:/$cat /sys/fs/cgroup/cpuset/lxc/myfedora/cpuset.cpus 0-1 docker run -i --cpuset=0,2 -t fedora /bin/bash cat /sys/fs/cgroup/cpuset/system.slice/docker-64bit_ID.scope/cpuset.cpus 0,2
  • 15. The release agent mechanism is invoked when the last process of a cgroup terminates. ● The cgroup release_agent entry should be set to a path to an executable/script to be invoked when the last process in a group terminates. ● The cgroup notify_on_release entry should be set so that release_agent will be invoked. echo 1 > /sys/fs/cgroup/memory/notify_on_release The release_agent can be set also via a mount option; systemd, for example, use this mechanism. For example in Fedora 21, mount shows: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/ systemd-cgroups-agent,name=systemd) Example 2: release_agent in memcg
  • 16. Also referred to as : devcg (devices control group) • It is more of an access controller than a resource controller. ● devices cgroup provides enforcing restrictions on reading, writing and creating (mknod) operations on device files. ● 3 control files: devices.allow, devices.deny, devices.list. – devices.allow can be considered as devices whitelist – devices.deny can be considered as devices blacklist. – devices.list available devices. ● Each entry in these files consist of 4 fields: – type: can be a (all), c (char device), or b (block device). ● All means all types of devices, and all major and minor numbers. – Major number. – Minor number. – Access: composition of 'r' (read), 'w' (write) and 'm' (mknod). Example 3: devices control group
  • 17. /dev/null major number is 1 and minor number is 3 (see Documentation/devices.txt) mkdir /sys/fs/cgroup/devices/group0 By default, for a new group, you have full permissions: cat /sys/fs/cgroup/devices/group0/devices.list a *:* rwm echo 'c 1:3 rmw' > /sys/fs/cgroup/devices/group0/devices.deny This denies rmw access from /dev/null device. echo 0 > /sys/fs/cgroup/devices/group0/tasks # Attaches the current shell to group0 echo "test" > /dev/null bash: /dev/null: Operation not permitted devices control group – example (contd)
  • 18. PIDs cgroup controller • An anti-fork-bomb solution by a new cgroup controller. • Adds a cgroup controller to enable limiting the number of processes that can be forked inside a cgroup. • Implementation of the prlimit(2)/RLIMIT_NPROC but to a cgroup and not to a process. • The PIDs space is a limited resource space, about 4 million pids system-wide. • PID_MAX_LIMIT=4,194,304 (0x400000) see: include/linux/threads.h.  With nowadays RAM capacities, all of them can be used up by a single container, making the whole system unusable. The PIDs cgroup controller can help avoiding this by limiting the number of processes per cgroup. • Developed by Aleksa Sarai and integrated into the kernel since v4.3  A simple and short module, ~300 lines of code only, kernel/cgroup_pids.c
  • 19. PIDs cgroup controller - contd The PID cgroup controller has two interface files: • pids.max (Read/Write) – The maximum number of processes for the cgroup. – Does not exist for the root cgroup directory. – For subgroups,the value is “max”, which is about 4,000,000.  pids.current (Read only) – The number of processes currently in the cgroup and its children - Also of processes of a child on which PIDs controller is not enabled. – Does not include zombies. – Does not exist for the root cgroup directory – When the number of processes in a group exceeds its pids.max, you will get this error: fork: Resource temporarily unavailable.
  • 20. cgroup v2 - background • cgroup v1 is the existing cgroup kernel implementation (also referred to as legacy cgroup in cgroup/LKML/other mailing lists). • “Unified Hierarchy” development features are available over three years with special mount option (__DEVEL__sane_behavior). • From kernel 4.4 (January 2016), cgroup V2 is an official part of the kernel with special filesystem type. • Systemd has initial support to cgroup v2 since v226, September 2015. • Also cgmanager added initial support for cgroup v2 (by Serge Hallyn) • Why cgroup v2 ? • A lot of chaos in cgroupv1.
  • 21. – no consistency across cgroup controllers; for example: – When creating a new controller, in several controllers the attribute values are inherited from the parent (like net_prio and net_cls), and in several others the attribute values are the defaults, regardless of the parent. – For cpuset, changes in the parent are propagated to its descendants, whereas with all controllers they are not (clone_children should be set for this), – With cgroup v1, some controllers have controller-specific interface files in the root cgroup while others don’t have. – cgroup v2 establishes a strict and consistent interfaces – In cgroup v2, there is only one hierarchy, “the unified hierarchy”. Each process can belong to only a single cgroup. cgroup-v2.txt in kernel Documentation describes in detail cgroup v1 inconsistencies: cgroup v2 - background
  • 22. tree -L 1 /sys/fs/cgroup/ on Fedora 23: (cgroup v1 – multiple hierarchies) ├── blkio ├── -> cpu,cpuacct ├── cpuacct -> cpu,cpuacct ├── cpu,cpuacct ├── cpuset ├── devices ├── freezer ├── hugetlb ├── memory ├── net_cls -> net_cls,net_prio ├── net_cls,net_prio ├── net_prio -> net_cls,net_prio ├── perf_event └── systemd Unlike cgroup v1, cgroup v2 has only a single hierarchy and is strict about hierarchical behavior.
  • 23. • Enabling/Disabling of a controller is done always by the cgroup parent rather than by the cgroup itself (subtree_control). • Interface files - semantics • When a controller implements an absolute resource limit, the interface files should be named "min" and "max“ respectively (for example, pids.max for the PIDs controller) • When a controller implements best effort resource limit, the interface files should be named "low“ and "high" respectively. • used for example, in the memory controller. • A special token "max" is used for these interface files (write/read), representing infinity. • Mounting cgroup v2 is done by: – mount -t cgroup2 none $MOUNT_POINT ● The mount point can be anywhere in the filesystem.
  • 24. cgroup v2 controllers  As opposed to cgroups v1, there are no cgroup mount options in cgroup v2.  When the system boots, both cgroup v1 filesystem and cgroup v2 filesystem are registered, so you can work with a mixture of cgroup v1 and cgroup v2 controllers.  You cannot use the same type of controller simultaneously both in cgroup v1 and cgroup v2.
  • 25. The root cgroup object • After mounting /cgroup2 with mount -t cgroup2 none /cgroup2, a root cgroup object is created. • The following three cgroup core interface files are created under the /cgroup2 mount point: /cgroup2/ ├── cgroup.controllers ├── cgroup.procs └── cgroup.subtree_control Next we will describe these three cgroup core interface files. There is a single root cgroup object, and it does not have any resource control interface files.
  • 26. • cgroup.controllers (A read-only file). – Shows the supported cgroup controllers. In cgroup v2, we have currently support for the memory, io and pids cgroup controllers only. All v2 controllers which are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root, so reading cgroup.controllers will give io memory pids • cgroup.procs (A read-write file) – The list of PIDs of processes in the group; contains the PIDs of all processes in the system after mount (zombie processes do not appear in "cgroup.procs" and thus can't be moved to another cgroup). The root cgroup object – contd.
  • 27. The root cgroup object – contd. • cgroup.subtree_control (A read-write file.) – This entry is empty after mount, as no controller is enabled by default. – Enabling cgroup v2 controller is done by writing to cgroup.subtree_control. – For example, enabling the memory controller is done by: – echo “+memory” > /cgroup2/ cgroup.subtree_control – Disabling the memory controller can be done, for example, by: - echo “-memory” > /cgroup2/ cgroup.subtree_control
  • 28. • Creating a cgroup is done by mkdir (like in cgroup v1), for example: – mkdir /cgroup2/group1 – After running this command, four cgroup core “cgroup.” prefixed entries are created and also several interface files for the cgroup controllers enabled in the parent, as we will immediately see. group1/ ├── cgroup.controllers ├── cgroup.procs ├── cgroup.events └── cgroup.subtree_control All cgroup core interface files are prefixed with "cgroup.“ Next we will see the entries which are created when running “mkdir /cgroup2/group1”. Creating a subgroup The set of these 4 cgroup interface files is the v2 hierarchy itself and is referred to internally as “the default hierarchy”.
  • 29. • /cgroup2/group1/cgroup.procs – The list of PIDs of processes in this group; empty upon creation. • /cgroup2/group1/cgroup.controllers – The subgroup enabled controllers. For subgroups, this will show the controllers that were enabled in the parent by writing to subtree_control. When changing the subtree_control in the parent, changes are propagated to the child cgroup.contorllers • /cgroup2/group1/cgroup.subtree_control – Initialized to be empty for the child group. Also here, you can enable only controllers which appear in the cgroups.controllers of this cgroup (group1). • /cgroup2/group1/cgroup.events – Contains only one field, "populated”; 0 means no live process in this cgroup and its descendants, 1 otherwise. Upon creation of a subgroup, populated is 0 – monitoring changes of populated from userspace - with poll(), inotify() and dnotify() , as opposed to call_usermodehelper() in cgroup v1. subgroup interface files
  • 30. You can attach processes only to leaves. You cannot attach a process to an internal subgroup if it has any controller enabled. • Controllers can be enabled by either writing to subtree_control of the parent or implicitly via controllers dependency. • The idea is that only processes of the leaves can compte on resources. This scheme is more well organized. • The only exception for this is the cgroup root object. • This is opposed to cgroup v1, which allowed threads to be in any cgroups • See “2-4-3. No Internal Process Constraint” in cgroup-v2.txt. No Internal Process rule
  • 31.
  • 32. A controller can't be disabled if one or more descendants have it enabled.
  • 33. multi-destination migration as a result of subtree_control enabling: Run this sequence: mount –t cgroup2 nodev /cgroup2 mkdir /cgroup2/group1 mkdir /cgroup2/group1/nested1 mkdir /cgroup2/group1/nested2 echo +pids > cgroup2/cgroup.subtree_controll cgroup2 example
  • 35. echo +pids > cgroup2/group1/cgroup.subtree_controller
  • 36. • What happens if there were processes in nested1 and nested2 before running: echo +pids > cgroup2/group1/cgroup.subtree_controller? • An inner cgroup_subsys_state (css) object is created for that group. • The processes in nested1 and nested2 should be migrated to this css. • With PIDs controller, attaching a process to a cgroup will never fail • For other controllers, there are cases when the attaching a process to a cgroup will fail, for example, when the CLONE_IO is set (for cgroup v1 as well as cgroup v2). • Migrating the processes should also handle implicit controllers.
  • 37. Migrating processes and threads • cgroup v2 is process-granular. • Every process in the system belongs to one and only one cgroup. • All threads of a process belong to the same cgroup. • A process can be migrated into a cgroup by writing its PID to the target cgroup's cgroup.procs file. • Writing the PID of any thread of a process to cgroup.procs of a destination cgroup migrates all the threads of the process into the destination cgroup (including the main process). • This is opposed to cgroup v1 thread granularity, which allowed different threads of a process to belong to different cgroups. • When forking other processes from inside a process, migrating of a parent process to another cgroup does not affect the existing child processes, and migrating of a child process does not affect the parent process.
  • 38. Finding matches based on the cgroup name in cgroup v2 (which is done by the -- path parameter) is based on getting the cgroup to which the process holding the socket belongs. Practically, this mechanism is not possible in cgroup v1 as a process can belong to more than one cgroup. In cgroup v2 we have only the match by path capability. An example for a rule for matching traffic which originates from sockets created in a group called “test”, or its subgroups, can be iptables -A OUTPUT -m cgroup --path test -j LOG This is based to extension of the xt_cgroup netfilter match module (adding a new revision) , net/netfilter/xt_cgroup.c. The xt_cgroup module can be used both for cgroup v1 and cgroup v2. Create a cgroupv2 group named test and move the current shell to it: mkdir /cgroup2/test echo 0 > /cgroup2/test/cgroup.procs cgroup v2- match subgroup by path
  • 39. 39 TRANSFORMING NETWORKING & STORAGE Now every socket created in this shell will have a pointer to the cgroup subgroup in which it was created, namely the “test” group. A sock_cgroup_data object, which contains per-socket cgroup information: was added to the sock object. It includes: • A pointer to the cgroup in which the socket is created. • assigned when the socket is created • prioidx • classid • is_data – 0 indicates a cgroup pointer , 1 indicates prioidx or classid. • Note: once net_prio or net_class will be used, that pointer in the socket will no longer point to the cgroup, but to the priority or classid. INTEL CONFIDENTIAL Doc #xxxxx
  • 40. rr:/$echo 5 > /sys/fs/cgroup/net_prio/net_cls.classid rr:/$mkdir /sys/fs/cgroup/net_prio/group1 /sys/fs/cgroup/net_prio/group1/net_cls.classid will be 5 as it is inherited. echo $$ > /sys/fs/cgroup/net_prio/group1/tasks iptables -A OUTPUT -m cgroup --cgroup 5 -j LOG This will trigger again logging the packets to syslog. cgroup v2- match cgroup example -contd
  • 41. Developmenttook over a decade: Namespaces implementation started in about 2002.There are currently 6 namespaces in Linux: ● mnt (mount points, filesystems) ● pid (processes) ● net (network stack) ● ipc (System V IPC) ● uts (hostname) ● user (UIDs) A process can be created in Linux by the fork(), clone() or vclone() system calls. In order to support namespaces, 6 flags (CLONE_NEW*) were added. These flags (or a combination of them) can be used in clone() or unshare() system calls to create a namespace. Namespaces
  • 42. Clone flag Kernel Version Required capability CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN CLONE_NEWUSER 3.8 No capability is required Namespaces clone flags
  • 43. Namespaces API consists of these 3 system calls: ● clone() - creates a new process and a new namespace; the newly created process is attached to the new namespace. – The process creation and process termination methods, fork() and exit(), were patched to handle the new namespace CLONE_NEW* flags. ● unshare() – gets only a single parameter, flags. Does not create a new process; creates a new namespace and attaches the calling process to it. – unshare() was added in 2005. see “new system call, unshare” : http://lwn.net/Articles/135266/ ● setns() - a new system call, for attaching the calling process to an existing namespace; prototype: int setns(int fd, int nstype); Namespaces system calls
  • 44. UTS namespace provides a way to get information about the system with commands like uname or hostname. UTS namespace was the most simple one to implement. There is a member in the process descriptor called nsproxy. A member named uts_ns (uts_namespace object) was added to it. The uts_ns object includes an object (new_utsname struct ) with 6 members: • sysname • nodename • release • version • machine • domainname UTS namespace
  • 45. In order to implement the UTS namespace, usage of global variables was replaces by accessing the corresponding members in the UTS namespace via the new nsproxy object of the process descriptor. See for example the implementation of gethostname() , sethostname() and uname() syscalls. For IPC namespaces, the same principle as in UTS namespace. For Mount namespace, a member named mnt_ns (mnt_namespace object) was added to the nsproxy. ● In the new mount namespace, all previous mounts will be visible; and from now on, mounts/unmounts in that mount namespace are invisible to the rest of the system. ● mounts/unmounts in the global namespace are visible in that namespace.
  • 46. ● Added a member named pid_ns (pid_namespace object) to the nsproxy struct. ● Processes in different PID namespaces can have the same process ID. ● When creating the first process in a new PID namespace, its PID is 1. ● Behavior like the “init” process: – When a process dies, all its orphaned children will now have the process with PID 1 as their parent (child reaping). – Sending SIGKILL signal does not kill process 1, regardless of in which namespace the command was issued (initial namespace or other pid namespace). ● pid namespaces can be nested, up to 32 nesting levels. (MAX_PID_NS_LEVEL) • A usecase for PID namespaces: the CRIU project PID namespaces
  • 47. ● Added a member named user_ns (user_namespace object) to the credentials object (struct sched). • Also each namespace includes a pointer to a user_namespace object. Each process will have distinct set of UIDs, GIDs and capabilities. User namespace enables non root user to create a process in which it will be root. Anatomy of a user namespaces vulnerability: By Michael Kerrisk, March 2013 An article about CVE 2013-1858 - exploitable security vulnerability http://lwn.net/Articles/543273/ User Namespaces
  • 48. ● A network namespace is logically another copy of the network stack, with its own routing tables, firewall rules, and network devices. ● The network namespace is represented by a huge struct net. (defined in include/net/net_namespace.h). Method in the stack which change the state were adjusted to have the net object as a parameter and set its members accordigly. struct net includes all network stack ingredients, like: • – Loopback device. • – SNMP stats. (netns_mib) • – All network tables: routing, neighboring, etc. • – All sockets • – /procfs and /sysfs entries. Network Namespaces
  • 49. At a given moment - • A network device belongs to exactly one network namespace. • A socket belongs to exactly one network namespace. The default initial network namespace, init_net (instance of struct net), includes the loopback device and all the physical devices, the networking tables, etc. ● Each newly created network namespace includes only the loopback device.
  • 50. Create two namespaces, called "myns1" and "myns2": ● ip netns add myns1 ● ip netns add myns2 • This triggers creation of /var/run/netns/myns1,/var/run/netns/myns2 empty folders and invoking the unshare() system call with CLONE_NEWNET. You delete a namespace by: ● ip netns del myns1 – This unmounts and removes /var/run/netns/myns1 Example
  • 51. You can movea network interface (eth0) to myns1 network namespace: ● ip link set eth0 netns myns1 There are other subcommands for monitoring namespaces, running a command in a namespace/in all namespaces, listing the network namespaces which were created with the ip netns command and more: ip netns monitor, ip netns exec myns1 bash, ip -all netns exec ip link, ip netns list Applications which usually look for configuration files under /etc (like /etc/hosts or /etc/resolv.conf), will first look under /etc/netns/NAME/, and only if nothing is available there, will look under /etc.
  • 52. cgroup namespace • Provides a new namespace, which gives the ability to remember the cgroup of the process at the point of creation of the cgroup namespace. Supports both cgroup v1 and cgroup v2. • Implementation: a new CLONE flag was added, CLONE_NEWCGROUP, and a new object was added to nsproxy, representing cgroup namespace, cgroup_ns. The /proc/<pid>/cgroup file may leak potential system level information to the isolated processes. For example, without cgroup namespaces: $ cat /proc/self/cgroup 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1 But with cgroup namespaces, from within new cgroupns, cat /proc/self/cgroup 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
  • 53. Legal Disclaimer 53 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFIC ERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literatur e.htm% 20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technolog y/security/ for more information. †Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor. * Other names and brands may be claimed as the property of others. Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice. Copyright © 2016, Intel Corporation. All rights reserved.
  • 54. 5 4
  • 55. 5 5
  • 56. 56 TRANSFORMING NETWORKING & STORAGE • The ability to make one controller dependent on another is one of the new features of cgroup v2. • controller dependency is not possible in cgroup v1. • This can be defined in code by setting the depends_on member of the cgroup_subsys. • For example, the io controller depends on the memory controller. struct cgroup_subsys io_cgrp_subsys = { ... .depends_on= 1 << memory_cgrp_id, ... }; • This means that enabling 'io' enables 'memory' *implicitly*, but it is not visible (no interface files) Implicit controllers (cgroup v2)
  • 57. 57 TRANSFORMING NETWORKING & STORAGE The net_cls controller – when creating new cgroup, the net_cls.classid value is propagated to the existing subgroups: rr:/sys/fs/cgroup/net_cls$mkdir group1 rr:/sys/fs/cgroup/net_cls/group1$echo 0x2 > net_cls.classid r:/sys/fs/cgroup/net_cls/group1$mkdir nested1 rr:/sys/fs/cgroup/net_cls/group1$cat nested1/net_cls.classid 2 Note: after the child groups are created, changes in the parent are not propagated to the existing child groups: rr:/sys/fs/cgroup/net_cls/group1$echo 0x1 > net_cls.classid rr:/sys/fs/cgroup/net_cls/group1$cat nested1/net_cls.classid 2 Example 1 (cgroup v1 propagation)
  • 58. 58 TRANSFORMING NETWORKING & STORAGE The Following sequence shows propagation from parent when creating a new group: rr:/sys/fs/cgroup/cpuset$ echo 1 > cgroup.clone_children rr:/sys/fs/cgroup/cpuset$ mkdir group1 rr:/sys/fs/cgroup/cpuset$ echo 1-2 > group1/cpuset.cpus rr:/sys/fs/cgroup/cpuset$ cat group1/cpuset.cpus 1-2 rr:/sys/fs/cgroup/cpuset$ cat group1/nested1/cpuset.cpus 1-2 Notes: Without echo 1 > cgroup.clone_children this propagation won’t work. The clone_children is effective only with the cpuset controller. example 2 (cgroup v1 clone_children)
  • 59. 59 TRANSFORMING NETWORKING & STORAGE s:/$mkdir /sys/fs/cgroup/net_prio/group1 s:/$echo "eth0 4" > /sys/fs/cgroup/net_prio/group1/net_prio.ifpriomap This sets the netprio_map object of eth0 net_device. This will set the priority of outgoing (egress) traffic of packets (skbs) of processes attached to group1 to be 4. This is implementedby the skb_update_prio()method http://lxr.free-electrons.com/source/net/core/dev.c#L2926 This is done prior to queuing the packet with the qdisc (Queuing Discipline). Setting the priority of a socket can be done by setting the SO_PRIORITY socket option, but this option is not always available. Example 3 – cgroup v1 net_prio
  • 60. 60 TRANSFORMING NETWORKING & STORAGE cgroup v1: Run as root: mkdir -p /sys/fs/cgroup/devices/group1/nested1 su user1 echo $$ > /sys/fs/cgroup/devices/group1/nested1/cgroup.procs You will get –EPERM But after you will set access permission to nested1 by running as root: chown -R user1:user1 /sys/fs/cgroup/devices/group1/nested1/ Running it will succeed. Example 4 : Delegation Containment
  • 61. 61 TRANSFORMING NETWORKING & STORAGE In order to support delegation, three conditions should be met: the writer's euid must match either uid or suid of the target process.The writer must have write access to the "cgroup.procs"file. The writer must have write access to the "cgroup.procs"file of the common ancestor of the source and destination cgroups. Run as root: echo $$ 4767 mkdir -p /cgroup2/group1/nested1 chown –R user1:user1 /cgroup2/group1 echo $$ > /cgroup2/group1/cgroup.procs su user1 echo 4767 > /cgroup2/group1/nested1/cgroup.procs Example 4 : delegation containment – v2
  • 62. 62 TRANSFORMING NETWORKING & STORAGE How do I know to which cgroup does a process belong to? cat "/proc/$PID/cgroup" shows this info. • The entry for cgroup v2 is always in the format "0::$PATH". • So for example, if we created a cgroup named group1 and attached a task with PID 1000 to it, then running: cat "/proc/1000/cgroup 0::/group1 And for a nested group: cat /proc/869/cgroup 0::/group1/nested Getting info about cgroups
  • 63. 63 TRANSFORMING NETWORKING & STORAGE /proc/cgroups shows info on both cgroup v1/cgroup v2. The hierarchy_idfor cgroupv2 controllers is 0. #subsys_name hierarchy num_cgroups enabled cpuset 2 1 1 cpu 3 1 1 cpuacct 3 1 1 blkio 0 1 1 memory 0 1 1 devices 6 61 1 freezer 7 1 1 net_cls 8 1 1 perf_event 9 1 1 net_prio 8 1 1 hugetlb 10 1 1 pids 0 1 1
  • 64. 64 TRANSFORMING NETWORKING & STORAGE For the memory controller: • memory.current • memory.events • memory.high • memory.low • memory.max • For the IO controller: • io.max • io.weight Note: Work is currently being done by Vladimir Davydov, one of the Memory Resource Controller (memcg) maintainers, for adding cgroup v2 kmem accounting, so entries like memory.swap.current, memory.swap.max are coming soon. Cgroup v2 Control files (Interface files)
  • 65. 65 TRANSFORMING NETWORKING & STORAGE Minor advantage: There is a single Linux kernel infrastructure for containers (namespacesand cgroups) while for Xen an KVM we have two different implementations withoutany common code. Now run the following iptables rule (from anywhere in the system): iptables -A OUTPUT -m cgroup --path test -j LOG And then ping anywhere from the shell that is now a process in “test”. The socket created has a pointer to “test” v2 cgroup, and the iptables match rule is for “test” (--path test). So the packets will be dumped to syslog. • Running this test from any subgroupof test will have the same result.
  • 66. • The sum of the allocations of immediatesubgroups can not exceed the amount of resources available to the parent. So, for example: If in /cgroup2/group1 pids.max = 4 Then if in /cgroup2/group1/nested1 /cgroup2/group1/nested2 There are together 4 processes, we cannot fork a fifth in either of them.
  • 67. 67 TRANSFORMING NETWORKING & STORAGE The script in the following slide shows how to connect two namespaces by veth (Virtual Ethernet drivers) so that you will be able to ping and send traffic between them: Script for connecting two namespaces
  • 68. 68 TRANSFORMING NETWORKING & STORAGE ip netns add netA ip netns add netB ip link add name vm1-eth0type veth peer name vm1-eth0.1 ip link add name vm2-eth0type veth peer name vm2-eth0.1 ip link set vm1-eth0.1netns netA ip link set vm2-eth0.1netns netB ip netns exec netA ip l set lo up ip netns exec netA ip l set vm1-eth0.1up ip netns exec netB ip l set lo up ip netns exec netB ip l set vm2-eth0.1up ip netns exec netA ip a add 192.168.0.10 dev vm1-eth0.1 ip netns exec netB ip a add 192.168.0.20 dev vm2-eth0.1 ip netns exec netA ip r add 192.168.0.0/24dev vm1-eth0.1 ip netns exec netB ip r add 192.168.0.0/24 dev vm2-eth0.1 brctl addbr mybr ip l set mybr up ip l set vm1-eth0up brctl addif mybr vm1-eth0 ip l set vm2-eth0up brctl addif mybr vm2-eth0 Script for connecting two namespaces
  • 69. 69 TRANSFORMING NETWORKING & STORAGE RDMA cgroups controller – patches were already sent by Parav Pandit: https://lwn.net/Articles/674161/ The RDMA cgroup will support both V1 and V2. Upcoming…