SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
David Lo, Dragos Sbirlea, Rohit Jnagal
Managing Memory Bandwidth Antagonism @ Scale
David Dragos & Rohit
Borg Model
● Large clusters with multi-tenant hosts.
● Run a mix of :
○ high and low priority workloads.
○ latency-sensitive and batch workloads.
● Isolation through bare-metal containers
(cgroups/namespaces)
○ Cgroups and perf to monitor host and job
performance.
○ Cgroups and h/w controls to manage
on-node performance.
○ Cluster scheduling and balancing manages
service performance.
Efficiency
Availability
Performance
3
The Memory Bandwidth Problem
● Large variation in performance
on multi-tenant hosts.
● On average, saturation events are
few, but:
○ Periodically causes significant
cluster-wide performance
degradation.
● Some workloads are much more
seriously affected than others.
○ Does not necessarily correlate
with victim’s memory bandwidth
use.
Latency
time
Antagonist task starts
4
Note : This talk is focussed on membw problem for general servers and
does not cover GPUs and other special devices. Similar techniques apply
there too.
Memory BW Saturation is Increasing Over Time
Nov
2018
5
Time
Fractionofmachineswithsaturation
Jan
2018
Fraction of machines that experienced mem BW saturation
● Large machines need to pack more jobs to maintain
utilization, resulting in more “noisy neighbor” problems.
Why It Is a (Bigger) Problem Now
● ML workloads are memory BW intensive
6
● Track per-socket local and remote memory bandwidth use
● Identify per-platform thresholds for performance dips (saturation)
● Characterize saturation by platform and clusters
Understanding the Scope : Socket-Level MonitoringMEM
Local
Write
Remote
MEM
LocalRemote
Read WriteWriteReadRead WriteRead
Socket 0 Socket 1
7
Saturation behavior varies with platform and cluster, due to
● hardware differences (membw/core ratio)
● workload (large CPU consumers run on bigger platforms)
Platform and Cluster Variation
8
By platform
By cluster
● Socket-level information gives the magnitude of the
problem and hot-spots
● Need task-level information to identify:
○ Abusers : tasks using disproportionate amount of
bandwidth
○ Victims : tasks seeing performance drop
● New platforms provide task-level memory bandwidth
monitoring, but:
○ RDT cgroup was on its way out
○ Have no data on older platforms
For our purposes, a rough attribution of memory bandwidth
was good enough
Monitoring Sockets ↣ Monitoring Tasks
Saturation threshold
9
Totalmemorybandwidth
MemoryBWbreakdown
● Summary of requirements:
○ Local and remote bandwidth breakdown
○ Compatible with with cgroup model
● What's available in hardware?
○ Uncore counters (IMC, CHA)
■ Difficult to attribute to HyperThread => cgroup
○ CPU PMU counters
■ Counters are HyperThread local
■ Works with cgroup profiling mode
D
D
R
I
M
C
CPU Core
CHA
HT0 HT1
CPU Core
CHA
HT0 HT1
Per-task Memory Bandwidth Estimation
10
● OFFCORE_RESPONSE for Intel CPUs
● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote)
● Captures both demand load and HW prefetcher traffic
● Online documentation of the meaning of bits, per CPU (download.01.org)
● How to interpret: cache lines / sec X 64b/cache line = BW
Intel SDM Vol 3
Which CPU Perfmon to Use?
11
Abuser insights
● Large percentage of time, a single consumer uses up most bandwidth.
● The share of CPU of that consumer are much lower than its share of membw.
Victim insight
● Many jobs are sensitive to membw saturation.
● Jobs are sensitive even though they are not big users of membw.
Guidance on enforcement options
● How much saturation would we avoid if we do X?
● Which jobs would get caught in the crossfire?
Insights from Task Measurement
CPI degradation on saturation
(as a fraction)
Numberofjobs
Combinations of jobs (by CPU
requirements) during saturation
12
Enforcement : Controlling Different Workloads
BW Usage
Priority
Moderate Heavy
LowMediumHigh
Isolate
Disable
ThrottleThrottle
Reactive rescheduling
Isolate
13
What Can We Do ? Node and Cluster Level Actuators
Node
Memory Bandwidth Allocation in hardware
Use HW QoS to apply max limits to tasks
overusing memory bandwidth.
CPU throttling for indirect control
Limit CPU access of over-using tasks to
indirectly limit the memory bandwidth used.
Cluster
Reactive evictions & re-scheduling
Hosts experiencing memory BW saturation
signals scheduler to re-distribute bigger memory
bandwidth users to lightly-loaded machines.
Disabling heavy antagonist workloads
Tasks that saturate a socket by itself cannot be
effectively redistributed. If slowing down is not
an option, de-schedule them.
14
+ Very effective in reducing saturation;
+ Works on all platforms
Node : CPU Throttling
Socket 0 (saturated) Socket 1
CPUs running memBW over-users
- Too coarse in granularity;
- Interacts poorly with Autoscaling & Load-balancing
15
Socket memory BW
saturation detector
Cgroup memory BW
estimator
Memory BW enforcer
Socket
perf counters
Every x seconds
If socket BW > saturation threshold
Socket, Cgroup
perf counters
Profile potentially
eligible tasks
Policy filter
CPU runnable mask
Select eligible tasks
for throttling
If socket BW < unthrottle threshold,
unthrottle tasks
16
Throttling - Enforcement Algorithm
Node : Memory Bandwidth Allocation
Intel RDT
Memory Bandwidth Allocation
+ Reduced bandwidth without lowering CPU
utilization.
+ Somewhat fine-grained than cpu-level
controls.
- Newer platforms only.
- Can’t isolate well between hyperthreads.
Supported through resctrl in kernel
(more on that later)
17
In many cases, there are:
● A low-percentage of saturated sockets in cluster, and
● Multiple tasks contributing to saturation.
Re-scheduling the tasks to less loaded machines can avoid
slow-downs.
Does not help with large antagonists that can saturate any
socket it runs on.
Cluster : Reactive Re-Scheduling
ObserverScheduler
host
A
host
B
host
C
host
D
saturated
1.Callforhelp
2.Evict
3.Reschedule
18
Low priority jobs can be dealt at node-level through throttling.
If SLOs do not permit throttling and the antagonists cannot be
redistributed :
● Disable (kick out of the cluster)
● Users can then reconfigure their service to use different product.
● Area of continual work.
Alternative :
● Colocate multiple antagonists (that’s just working around SLOs)
Handling Cluster-Wide Saturation
Cluster Membw distribution
amenable to rescheduling
Cluster Membw distribution
amenable to job disabling
Saturation
threshold
Saturation
threshold
19
Results : CPU Throttling + Rescheduling
20
Results : Rebalancing
21
● New, unified interface: resctrl
● resctrl is a big improvement over the previous non-standard cgroup interface
● Uniform way of monitoring/controlling HW QoS across vendors/architectures
○ AMD, ARM, Intel
● (Non-exhaustive) list of HW features supported:
○ Memory BW monitoring
○ Memory BW throttling
○ L3 cache usage monitoring
○ L3 cache partitioning
resctrl : HW QoS Support in Kernel
22
● Below is using x86 terminology
● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique
ones in HW.
● Resource Monitoring ID (RMID): used to tag workloads and their used resources to
aggregate their resource usage. Typically O(100) unique ones in HW.
Intro to HW QoS Terms and Concepts
Hi priority (CLOSID 0)
100% L3 cache
100% mem BW
Low priority (CLOSID 1)
50% L3 cache
20% mem BW
RMID0 RMID1 RMID2 RMID3 RMID4
Workload A Workload B Workload C
23
resctrl/
|- groupA/
| |- mon_groups/
| | |- monA/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- groupB/
|- ...
Overview of resctrl Filesystem
Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
A resource control group. Represents one unique HW CLOSID.
A monitoring group. Represents one unique HW RMID.
TIDs in monitoring group
TIDs in resource control group
QoS configuration for resource control group
Resource usage data for entire resource control group
Resource usage data for monitoring group
24
Example Usage of resctrl Interfaces
$ cat groupA/schemata
L3:0=ff;1=ff
MB:0=90;1=90
$ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ sleep 1
$ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ echo $((READING1-READING0))
1816234126
Allowed to use 8 cache ways for L3 on both sockets.
Per-core memory BW constrained to 90% on both sockets.
Compute memory BW by taking a rate.
In this case, BW ~= 1.8GiB/s
25
Reconciling resctrl and cgroups: First Try
resctrl/
|- no_throttle/
| |- mon_groups/
| | |- cgroupX/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- bw_throttled/
|- ...
<< #1
<< #1
<< #1
<< #3
<< #5 ↻
<< #6 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
26
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
Challenges with Naive Approach
Race in moving TIDs if cgroup is
creating threads. Expensive if lots
of TIDs and to deal with the race.
Desynchronization of L3 cache
occupancy data, since existing
data is tagged with an old RMID.
27
● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups
○ To change QoS configs, just rewrite schemata
○ More efficient, remove need to move TIDs around
○ Keep existing RMID, prevent L3 occupancy desynchronization issue
○ 100% compatible with existing resctrl abstraction
● CHALLENGE: with existing system, will run out of CLOSIDs very quickly
● SOLUTION: share CLOSIDs between resource control groups with the same schemata
● Google-developed kernel patch for this functionality to be released soon
● Demonstrates need to make cgroup model a first class consideration for QoS
interfaces
A Better Approach for resctrl and cgroups
28
cgroups and resctrl with After the Change
resctrl/
|- cgroupX/
| |- mon_groups/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- high_bw_cgroup/
| |- schemata
| |- ...
|- ...
<< #1
<< #4 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Create a resctrl group cgroupX
2. Write no throttling configuration to
cgroupX/schemata
3. Start cgroupX
4. Move TIDs into cgroupX/tasks
5. Rewrite schemata of high BW using cgroup to
throttle
<< #2
<< #5
29
● Measuring µArch impact is not a first class component of
most container runtimes.
○ Can’t manage what we can’t see...
● Most container runtimes expose isolation knobs per
container.
● Managing µArch isolation requires node and cluster level
feedback-loops.
○ Dual operating mode : admins & users.
○ Performance isolation not necessarily controllable by
end-users.
We would love to contribute to a standard framework around
performance management for container runtimes.
µArch Features & Container Runtimes
Efficiency
Availability
Performance
30
Takeaways and Future work
● Memory bandwidth and low-level isolation issues becoming more significant.
● Continuous monitoring is critical to run successful multi-tenant hosts.
● Defining requirements for h/w providers and s/w interfaces on QoS knobs.
○ Critical to have these solutions work for containers / process-groups.
● Increasing success rate with current approach:
○ Handling of minimum guaranteed membw usage
○ Handling logically related jobs - Borg allocs
● A general framework would help collaboration.
● Future : Memory BW scheduling (based on hints)
○ Based on membw usage
○ Based on membw sensitivity
31
Find us at the conf or reach out at :
davidlo@
dragoss@
google.com
jnagal@
eranian@
Thanks !
32

Mais conteúdo relacionado

Mais procurados

Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EASLinaro
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンクCODE BLUE
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Tuning android for low ram devices
Tuning android for low ram devicesTuning android for low ram devices
Tuning android for low ram devicesDroidcon Berlin
 
Summary of linux kernel security protections
Summary of linux kernel security protectionsSummary of linux kernel security protections
Summary of linux kernel security protectionsShubham Dubey
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례Won-Chon Jung
 
LAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLinaro
 

Mais procurados (20)

Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EAS
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク
大義のために:趣味と実益のためのVMware RPCインターフェースの活用 by アブドゥル・アジズ・ハリリ, ジャシエル・スペルマン, ブライアン・ゴーレンク
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Tuning android for low ram devices
Tuning android for low ram devicesTuning android for low ram devices
Tuning android for low ram devices
 
Summary of linux kernel security protections
Summary of linux kernel security protectionsSummary of linux kernel security protections
Summary of linux kernel security protections
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례
 
LAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in Android
 
Linux Internals - Part I
Linux Internals - Part ILinux Internals - Part I
Linux Internals - Part I
 

Semelhante a Memory Bandwidth QoS

R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsJoshua Mora
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneOpen-NFP
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLELinaro
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
High Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyHigh Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyMassimo Talia
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Anna Shymchenko
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Community
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxssuser30e7d2
 

Semelhante a Memory Bandwidth QoS (20)

R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
High Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyHigh Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander Krishnamurthy
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Caching in
Caching inCaching in
Caching in
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 

Mais de Rohit Jnagal

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015Rohit Jnagal
 

Mais de Rohit Jnagal (6)

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015
 
Docker n co
Docker n coDocker n co
Docker n co
 
Docker Overview
Docker OverviewDocker Overview
Docker Overview
 
Docker internals
Docker internalsDocker internals
Docker internals
 

Último

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 

Último (20)

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 

Memory Bandwidth QoS

  • 1. David Lo, Dragos Sbirlea, Rohit Jnagal Managing Memory Bandwidth Antagonism @ Scale
  • 3. Borg Model ● Large clusters with multi-tenant hosts. ● Run a mix of : ○ high and low priority workloads. ○ latency-sensitive and batch workloads. ● Isolation through bare-metal containers (cgroups/namespaces) ○ Cgroups and perf to monitor host and job performance. ○ Cgroups and h/w controls to manage on-node performance. ○ Cluster scheduling and balancing manages service performance. Efficiency Availability Performance 3
  • 4. The Memory Bandwidth Problem ● Large variation in performance on multi-tenant hosts. ● On average, saturation events are few, but: ○ Periodically causes significant cluster-wide performance degradation. ● Some workloads are much more seriously affected than others. ○ Does not necessarily correlate with victim’s memory bandwidth use. Latency time Antagonist task starts 4 Note : This talk is focussed on membw problem for general servers and does not cover GPUs and other special devices. Similar techniques apply there too.
  • 5. Memory BW Saturation is Increasing Over Time Nov 2018 5 Time Fractionofmachineswithsaturation Jan 2018 Fraction of machines that experienced mem BW saturation
  • 6. ● Large machines need to pack more jobs to maintain utilization, resulting in more “noisy neighbor” problems. Why It Is a (Bigger) Problem Now ● ML workloads are memory BW intensive 6
  • 7. ● Track per-socket local and remote memory bandwidth use ● Identify per-platform thresholds for performance dips (saturation) ● Characterize saturation by platform and clusters Understanding the Scope : Socket-Level MonitoringMEM Local Write Remote MEM LocalRemote Read WriteWriteReadRead WriteRead Socket 0 Socket 1 7
  • 8. Saturation behavior varies with platform and cluster, due to ● hardware differences (membw/core ratio) ● workload (large CPU consumers run on bigger platforms) Platform and Cluster Variation 8 By platform By cluster
  • 9. ● Socket-level information gives the magnitude of the problem and hot-spots ● Need task-level information to identify: ○ Abusers : tasks using disproportionate amount of bandwidth ○ Victims : tasks seeing performance drop ● New platforms provide task-level memory bandwidth monitoring, but: ○ RDT cgroup was on its way out ○ Have no data on older platforms For our purposes, a rough attribution of memory bandwidth was good enough Monitoring Sockets ↣ Monitoring Tasks Saturation threshold 9 Totalmemorybandwidth MemoryBWbreakdown
  • 10. ● Summary of requirements: ○ Local and remote bandwidth breakdown ○ Compatible with with cgroup model ● What's available in hardware? ○ Uncore counters (IMC, CHA) ■ Difficult to attribute to HyperThread => cgroup ○ CPU PMU counters ■ Counters are HyperThread local ■ Works with cgroup profiling mode D D R I M C CPU Core CHA HT0 HT1 CPU Core CHA HT0 HT1 Per-task Memory Bandwidth Estimation 10
  • 11. ● OFFCORE_RESPONSE for Intel CPUs ● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote) ● Captures both demand load and HW prefetcher traffic ● Online documentation of the meaning of bits, per CPU (download.01.org) ● How to interpret: cache lines / sec X 64b/cache line = BW Intel SDM Vol 3 Which CPU Perfmon to Use? 11
  • 12. Abuser insights ● Large percentage of time, a single consumer uses up most bandwidth. ● The share of CPU of that consumer are much lower than its share of membw. Victim insight ● Many jobs are sensitive to membw saturation. ● Jobs are sensitive even though they are not big users of membw. Guidance on enforcement options ● How much saturation would we avoid if we do X? ● Which jobs would get caught in the crossfire? Insights from Task Measurement CPI degradation on saturation (as a fraction) Numberofjobs Combinations of jobs (by CPU requirements) during saturation 12
  • 13. Enforcement : Controlling Different Workloads BW Usage Priority Moderate Heavy LowMediumHigh Isolate Disable ThrottleThrottle Reactive rescheduling Isolate 13
  • 14. What Can We Do ? Node and Cluster Level Actuators Node Memory Bandwidth Allocation in hardware Use HW QoS to apply max limits to tasks overusing memory bandwidth. CPU throttling for indirect control Limit CPU access of over-using tasks to indirectly limit the memory bandwidth used. Cluster Reactive evictions & re-scheduling Hosts experiencing memory BW saturation signals scheduler to re-distribute bigger memory bandwidth users to lightly-loaded machines. Disabling heavy antagonist workloads Tasks that saturate a socket by itself cannot be effectively redistributed. If slowing down is not an option, de-schedule them. 14
  • 15. + Very effective in reducing saturation; + Works on all platforms Node : CPU Throttling Socket 0 (saturated) Socket 1 CPUs running memBW over-users - Too coarse in granularity; - Interacts poorly with Autoscaling & Load-balancing 15
  • 16. Socket memory BW saturation detector Cgroup memory BW estimator Memory BW enforcer Socket perf counters Every x seconds If socket BW > saturation threshold Socket, Cgroup perf counters Profile potentially eligible tasks Policy filter CPU runnable mask Select eligible tasks for throttling If socket BW < unthrottle threshold, unthrottle tasks 16 Throttling - Enforcement Algorithm
  • 17. Node : Memory Bandwidth Allocation Intel RDT Memory Bandwidth Allocation + Reduced bandwidth without lowering CPU utilization. + Somewhat fine-grained than cpu-level controls. - Newer platforms only. - Can’t isolate well between hyperthreads. Supported through resctrl in kernel (more on that later) 17
  • 18. In many cases, there are: ● A low-percentage of saturated sockets in cluster, and ● Multiple tasks contributing to saturation. Re-scheduling the tasks to less loaded machines can avoid slow-downs. Does not help with large antagonists that can saturate any socket it runs on. Cluster : Reactive Re-Scheduling ObserverScheduler host A host B host C host D saturated 1.Callforhelp 2.Evict 3.Reschedule 18
  • 19. Low priority jobs can be dealt at node-level through throttling. If SLOs do not permit throttling and the antagonists cannot be redistributed : ● Disable (kick out of the cluster) ● Users can then reconfigure their service to use different product. ● Area of continual work. Alternative : ● Colocate multiple antagonists (that’s just working around SLOs) Handling Cluster-Wide Saturation Cluster Membw distribution amenable to rescheduling Cluster Membw distribution amenable to job disabling Saturation threshold Saturation threshold 19
  • 20. Results : CPU Throttling + Rescheduling 20
  • 22. ● New, unified interface: resctrl ● resctrl is a big improvement over the previous non-standard cgroup interface ● Uniform way of monitoring/controlling HW QoS across vendors/architectures ○ AMD, ARM, Intel ● (Non-exhaustive) list of HW features supported: ○ Memory BW monitoring ○ Memory BW throttling ○ L3 cache usage monitoring ○ L3 cache partitioning resctrl : HW QoS Support in Kernel 22
  • 23. ● Below is using x86 terminology ● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique ones in HW. ● Resource Monitoring ID (RMID): used to tag workloads and their used resources to aggregate their resource usage. Typically O(100) unique ones in HW. Intro to HW QoS Terms and Concepts Hi priority (CLOSID 0) 100% L3 cache 100% mem BW Low priority (CLOSID 1) 50% L3 cache 20% mem BW RMID0 RMID1 RMID2 RMID3 RMID4 Workload A Workload B Workload C 23
  • 24. resctrl/ |- groupA/ | |- mon_groups/ | | |- monA/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- groupB/ |- ... Overview of resctrl Filesystem Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt A resource control group. Represents one unique HW CLOSID. A monitoring group. Represents one unique HW RMID. TIDs in monitoring group TIDs in resource control group QoS configuration for resource control group Resource usage data for entire resource control group Resource usage data for monitoring group 24
  • 25. Example Usage of resctrl Interfaces $ cat groupA/schemata L3:0=ff;1=ff MB:0=90;1=90 $ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ sleep 1 $ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ echo $((READING1-READING0)) 1816234126 Allowed to use 8 cache ways for L3 on both sockets. Per-core memory BW constrained to 90% on both sockets. Compute memory BW by taking a rate. In this case, BW ~= 1.8GiB/s 25
  • 26. Reconciling resctrl and cgroups: First Try resctrl/ |- no_throttle/ | |- mon_groups/ | | |- cgroupX/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- bw_throttled/ |- ... << #1 << #1 << #1 << #3 << #5 ↻ << #6 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled 26
  • 27. Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled Challenges with Naive Approach Race in moving TIDs if cgroup is creating threads. Expensive if lots of TIDs and to deal with the race. Desynchronization of L3 cache occupancy data, since existing data is tagged with an old RMID. 27
  • 28. ● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups ○ To change QoS configs, just rewrite schemata ○ More efficient, remove need to move TIDs around ○ Keep existing RMID, prevent L3 occupancy desynchronization issue ○ 100% compatible with existing resctrl abstraction ● CHALLENGE: with existing system, will run out of CLOSIDs very quickly ● SOLUTION: share CLOSIDs between resource control groups with the same schemata ● Google-developed kernel patch for this functionality to be released soon ● Demonstrates need to make cgroup model a first class consideration for QoS interfaces A Better Approach for resctrl and cgroups 28
  • 29. cgroups and resctrl with After the Change resctrl/ |- cgroupX/ | |- mon_groups/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- high_bw_cgroup/ | |- schemata | |- ... |- ... << #1 << #4 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Create a resctrl group cgroupX 2. Write no throttling configuration to cgroupX/schemata 3. Start cgroupX 4. Move TIDs into cgroupX/tasks 5. Rewrite schemata of high BW using cgroup to throttle << #2 << #5 29
  • 30. ● Measuring µArch impact is not a first class component of most container runtimes. ○ Can’t manage what we can’t see... ● Most container runtimes expose isolation knobs per container. ● Managing µArch isolation requires node and cluster level feedback-loops. ○ Dual operating mode : admins & users. ○ Performance isolation not necessarily controllable by end-users. We would love to contribute to a standard framework around performance management for container runtimes. µArch Features & Container Runtimes Efficiency Availability Performance 30
  • 31. Takeaways and Future work ● Memory bandwidth and low-level isolation issues becoming more significant. ● Continuous monitoring is critical to run successful multi-tenant hosts. ● Defining requirements for h/w providers and s/w interfaces on QoS knobs. ○ Critical to have these solutions work for containers / process-groups. ● Increasing success rate with current approach: ○ Handling of minimum guaranteed membw usage ○ Handling logically related jobs - Borg allocs ● A general framework would help collaboration. ● Future : Memory BW scheduling (based on hints) ○ Based on membw usage ○ Based on membw sensitivity 31
  • 32. Find us at the conf or reach out at : davidlo@ dragoss@ google.com jnagal@ eranian@ Thanks ! 32