SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
CAT @ Scale
Deploying cache isolation in a mixed-workload environment
Rohit Jnagal jnagal@google
David Lo davidlo@google
David
Rohit
Borg : Google cluster manager
● Admits, schedules, starts, restarts, and
monitors the full range of applications
that Google runs.
● Mixed workload system - two tiered
: latency sensitive ( front-end tasks)
: latency tolerant (batch tasks)
● Uses containers/cgroups to isolate
applications.
Borg: Efficiency with multiple tiers
Large Scale Cluster Management at Google with Borg
Isolation in Borg
Borg: CPU isolation for latency-sensitive (LS) tasks
● Linux Completely Fair Scheduling (CFS) is a throughput-oriented
scheduler; no support for differentiated latency
● Google-specific extensions for low-latency scheduling response
● Enforce strict priority for LS tasks over batch workloads
○ LS tasks always preempt batch tasks
○ Batch never preempts latency-sensitive on wakeup
○ Bounded execution time for batch tasks
● Batch tasks treated as minimum weight entities
○ Further tuning to ensure aggressive distribution of batch tasks over available cores
Borg : NUMA Locality
Good NUMA locality can have a
significant performance impact
(10-20%)*
Borg isolates LS tasks to a single
socket, when possible
Batch tasks are allowed to run on
all sockets for better throughput
* The NUMA experience
Borg : Enforcing locality for performance
Borg isolates LS tasks to a single socket, when possible
Batch tasks are allowed to run on all sockets for better throughput
LS1
LS2
LS3
Batch
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 0 Socket 1
8 9 10 11 12 13 14 15
Borg : Dealing with LS-LS interference
Use reserved CPU sets to limit interference for highly sensitive jobs
○ Better wakeup latencies
○ Still allows batch workloads as they have minimum weight and always yield
Socket 0
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 1
LS1
LS2 (reserved)
LS3
Batch
8 9 10 11 12 13 14 15LS4
Borg : Micro-architectural interference
● Use exclusive CPU sets to limit microarchitectural interference
○ Disallow batch tasks from running on cores of an LS task
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 0 Socket 1
LS1
LS2 (reserved)
LS3 (exclusive)
LS4
Batch
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 15
Borg : Isolation for highly sensitive tasks
● CFS offers low scheduling latency
● NUMA locality provides local memory and cache
● Reserved cores keep LS tasks with comparable weights from
interfering
● Exclusive cores keep cache-heavy batch tasks away from L1, L2
caches
This should be as good as running on a non-shared infrastructure!
Co-located Exclusive LS & streaming MR
Start of streaming MR
github.com/google/multichase
Exclusive job with great
latency
Performance for latency sensitive tasks
At lower utilization, latency
sensitive tasks need more
cache protection.
Interference can degrade
performance up to 300% even
when all other resources are
well isolated.
Mo Cores, Mo Problems
*Heracles
CAT
Resource Director Technology (RDT)
● Monitoring:
○ Cache Monitoring Technology (CMT)
○ Memory Bandwidth Monitoring (MBM)
● Allocation:
○ Cache Allocation Technology (CAT)
■ L2 and L3 Caches
○ Code and Data Prioritization (CDP)
Actively allocate resources to achieve better QoS and performance
Allows general grouping to enable monitoring/allocation for VMs,
containers, and arbitrary threads and processes
Introduction to CAT
Cache Allocation Technology (CAT)
● Provides software control to
isolate last-level cache access
between applications.
● CLOS: Class of service
corresponding to a cache
allocation setting
● CBM: Cache Capacity Bitmasks
to map a CLOS id to an
allocation mask
Introduction to CAT
Setting up CAT
Introduction to CAT
Let’s add CAT to our service ...
Add CAT to the mix
Start of streaming MR
Restricting MR cache
use to 50%
CAT Deployment: Batch Jails
Data for batch jobs Data for latency sensitive jobs
Batch jail
(shared between all tasks, including LS)
Dedicated for latency sensitive
(only LS can use)
CAT Deployment: Cache cgroup
Cache
T1 T2
CPU
T1 T2
Memory
T1 T2
Cgroup
Root
● Every app gets its own cgroup
● Set CBM for all batch tasks to
same mask
● Easy to inspect, recover
● Easy to integrate into existing
container mechanisms
○ Docker
○ Kubernetes
CAT experiments with YouTube transcoder
CAT experiments with YouTube
CPI as a good measure for cache interference
lower is
better
Antagonistcacheoccupancy(%ofL3)
CPI
0%
25%
50%
75%
100%
Production rollout
Impact of batch jails
Higher gains for smaller jail
+0%
lower is
better
LS tasks avg CPI comparison
LS tasks CPI percentile comparisonComparison of LS tasks CPI PDF
Batch jails deployment
Batch jailing shifts
CPI lower
Higher benefits of
CAT for tail tasks
+0%
Batch jails deployment
Smaller jails lead to higher
impact on batch jobs
lower is
better
Batch tasks avg CPI comparison
+0%
The Downside: Increased memory pressure
BW spike!
BW hungry
Batch job starts
BW hungry
Batch job stops
Jailing LLC increases DRAM
BW pressure for Batch
System Memory BW
Controlling memory bandwidth impact
Intel RDT: CMT (Cache Monitoring Technology)
- Monitor and profile cache usage pattern for all applications
Intel RDT: MBM (Memory Bandwidth Monitoring)
- Monitor memory bandwidth usage per application
Controls:
- CPU throttling
- Scheduling
- Newer platforms will provide control for memory bandwidth per
application
Controlling infrastructure processes
Many system daemons tend to periodically thrash caches
- None of them are latency sensitive
- Stable behavior, easy to identify
Jailing for daemons!
- Requires ability to restrict kernel threads to a mask
What about the noisy neighbors?
Noisy neighbors hurting
performance (Intel RDT)
● Use CMT to detect; CAT to
control
● Integrated into CPI2
signals
○ CPI2
built for noisy neighbor
detection
○ Dynamically throttle noisy tasks
○ Possibly tag for scheduling hints
Observer
Master
Nodes
CPI
Samples
CPI
Spec
CMT issues with cgroups
● Usage model: many many cgroups, but can’t run perf on all of them
all the time
○ Run perf periodically on a sample of cgroups
○ Use same RMID for a bunch of cgroups
○ Rotate cgroups out every sampling period
● HW counts cache allocations - deallocations, not occupancy:
○ Cache lines allocated before perf runs are not accounted
○ Can get non-sensical results, even zero cache occupancy
○ Work-around requires to run perf for life-time of monitored cgroup
○ Unacceptable context switch overhead
● David Carrillo-Cisneros & Stephane Eranian working on a newer
version for CMT support with perf
CAT implementation
Cache Cgroup
Cache
T1 T2
CPU
T1 T2
Memory
T1 T2
Cgroup
Root
● Every app gets its own cgroup
● Set CBM for all batch tasks to
same mask
● Easy to inspect, recover
● Easy to integrate into existing
container mechanisms
○ Docker
○ Kubernetes
● Issues with the patch:
○ Per-socket masks
○ Not a good fit?
○ Thread-based isolation
vs cgroup v2
New patch: rscctrl interface
● Patches by Intel from Fenghua Yu
○ Mounted under /sys/fs/rscctrl
○ Currently used for L2 and L3 cache masks
○ Create new grouping with mkdir /sys/fs/rscctrl/LS1
○ Files under /sys/fs/rscctrl/LS1:
■ tasks: threads in the group
■ cpus: cpus to control with the setting in this group
■ schemas: write L2 and L3 CBMs to this file
● Aligns better with the h/w capabilities provided
● Gives finer control without worrying about cgroup restrictions
● Gives control over kernel threads as well as user threads
● Allows resource allocation policies to be tied to certain cpus across all
contexts
Current Kernel patch progress
David Carrillo-Cisneros, Fenghua Yu, Vikas Shivappa, and others at Intel
working on improving CMT and MBM support for cgroups
Changes to support cgroup monitoring as opposed to attach to process
forever model
Challenges that are being faced:
● Sampled collections
● Not enough RMIDs to go around
○ Use per-package allocation of RMIDs
○ Reserved RMIDs (do not rotate)
Takeaways
● With larger machines, isolation between
workloads is more important than ever.
● RDT extensions work really great at scale:
○ Easy to set up static policies.
○ Lot of flexibility.
● CAT is only one of the first
isolation/monitoring features.
○ Avoid ad-hoc solutions
● At Google, we cgroups and containers:
○ Rolled out cgroup based CAT support to the fleet.
● Let’s get the right abstractions in place.
If you are interested,
talk to us here or find us
online:
jnagal
davidlo
davidcc
eranian
@google
Thanks!
● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
MMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and OrchestratorMMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and Orchestrator
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
 
MySQL Failover and Orchestrator
MySQL Failover and OrchestratorMySQL Failover and Orchestrator
MySQL Failover and Orchestrator
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
 
MySQL NDB Cluster 101
MySQL NDB Cluster 101MySQL NDB Cluster 101
MySQL NDB Cluster 101
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
MEC (Mobile Edge Computing) + GPUコンピューティングについて
MEC (Mobile Edge Computing) + GPUコンピューティングについてMEC (Mobile Edge Computing) + GPUコンピューティングについて
MEC (Mobile Edge Computing) + GPUコンピューティングについて
 
MySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn TutorialMySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn Tutorial
 
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
 

Destaque

VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
aidanshribman
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
Davidlohr Bueso
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
sprdd
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
Sim Janghoon
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
inside-BigData.com
 

Destaque (20)

VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql server
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
 
美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
Aca 2
Aca 2Aca 2
Aca 2
 
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaafTechnical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performance
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Multiprocessor system
Multiprocessor systemMultiprocessor system
Multiprocessor system
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
 

Semelhante a Cat @ scale

Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Michael Christofferson
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Semelhante a Cat @ scale (20)

Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
RTDroid_Presentation
RTDroid_PresentationRTDroid_Presentation
RTDroid_Presentation
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Js on-microcontrollers
Js on-microcontrollersJs on-microcontrollers
Js on-microcontrollers
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Q2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP SchedulingQ2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP Scheduling
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Realtime
RealtimeRealtime
Realtime
 
The Tux 3 Linux Filesystem
The Tux 3 Linux FilesystemThe Tux 3 Linux Filesystem
The Tux 3 Linux Filesystem
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 

Mais de Rohit Jnagal (6)

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015
 
Docker n co
Docker n coDocker n co
Docker n co
 
Docker Overview
Docker OverviewDocker Overview
Docker Overview
 
Docker internals
Docker internalsDocker internals
Docker internals
 

Último

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 

Último (20)

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 

Cat @ scale

  • 1. CAT @ Scale Deploying cache isolation in a mixed-workload environment Rohit Jnagal jnagal@google David Lo davidlo@google
  • 2. David Rohit Borg : Google cluster manager ● Admits, schedules, starts, restarts, and monitors the full range of applications that Google runs. ● Mixed workload system - two tiered : latency sensitive ( front-end tasks) : latency tolerant (batch tasks) ● Uses containers/cgroups to isolate applications.
  • 3. Borg: Efficiency with multiple tiers Large Scale Cluster Management at Google with Borg
  • 5. Borg: CPU isolation for latency-sensitive (LS) tasks ● Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no support for differentiated latency ● Google-specific extensions for low-latency scheduling response ● Enforce strict priority for LS tasks over batch workloads ○ LS tasks always preempt batch tasks ○ Batch never preempts latency-sensitive on wakeup ○ Bounded execution time for batch tasks ● Batch tasks treated as minimum weight entities ○ Further tuning to ensure aggressive distribution of batch tasks over available cores
  • 6. Borg : NUMA Locality Good NUMA locality can have a significant performance impact (10-20%)* Borg isolates LS tasks to a single socket, when possible Batch tasks are allowed to run on all sockets for better throughput * The NUMA experience
  • 7. Borg : Enforcing locality for performance Borg isolates LS tasks to a single socket, when possible Batch tasks are allowed to run on all sockets for better throughput LS1 LS2 LS3 Batch Affinity masks for tasks on a machine 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 0 Socket 1 8 9 10 11 12 13 14 15
  • 8. Borg : Dealing with LS-LS interference Use reserved CPU sets to limit interference for highly sensitive jobs ○ Better wakeup latencies ○ Still allows batch workloads as they have minimum weight and always yield Socket 0 Affinity masks for tasks on a machine 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 1 LS1 LS2 (reserved) LS3 Batch 8 9 10 11 12 13 14 15LS4
  • 9. Borg : Micro-architectural interference ● Use exclusive CPU sets to limit microarchitectural interference ○ Disallow batch tasks from running on cores of an LS task 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 0 Socket 1 LS1 LS2 (reserved) LS3 (exclusive) LS4 Batch Affinity masks for tasks on a machine 8 9 10 11 12 13 14 15
  • 10. Borg : Isolation for highly sensitive tasks ● CFS offers low scheduling latency ● NUMA locality provides local memory and cache ● Reserved cores keep LS tasks with comparable weights from interfering ● Exclusive cores keep cache-heavy batch tasks away from L1, L2 caches This should be as good as running on a non-shared infrastructure!
  • 11. Co-located Exclusive LS & streaming MR Start of streaming MR github.com/google/multichase Exclusive job with great latency
  • 12. Performance for latency sensitive tasks At lower utilization, latency sensitive tasks need more cache protection. Interference can degrade performance up to 300% even when all other resources are well isolated. Mo Cores, Mo Problems *Heracles
  • 13. CAT
  • 14. Resource Director Technology (RDT) ● Monitoring: ○ Cache Monitoring Technology (CMT) ○ Memory Bandwidth Monitoring (MBM) ● Allocation: ○ Cache Allocation Technology (CAT) ■ L2 and L3 Caches ○ Code and Data Prioritization (CDP) Actively allocate resources to achieve better QoS and performance Allows general grouping to enable monitoring/allocation for VMs, containers, and arbitrary threads and processes Introduction to CAT
  • 15. Cache Allocation Technology (CAT) ● Provides software control to isolate last-level cache access between applications. ● CLOS: Class of service corresponding to a cache allocation setting ● CBM: Cache Capacity Bitmasks to map a CLOS id to an allocation mask Introduction to CAT
  • 17. Let’s add CAT to our service ...
  • 18. Add CAT to the mix Start of streaming MR Restricting MR cache use to 50%
  • 19. CAT Deployment: Batch Jails Data for batch jobs Data for latency sensitive jobs Batch jail (shared between all tasks, including LS) Dedicated for latency sensitive (only LS can use)
  • 20. CAT Deployment: Cache cgroup Cache T1 T2 CPU T1 T2 Memory T1 T2 Cgroup Root ● Every app gets its own cgroup ● Set CBM for all batch tasks to same mask ● Easy to inspect, recover ● Easy to integrate into existing container mechanisms ○ Docker ○ Kubernetes
  • 21. CAT experiments with YouTube transcoder
  • 22. CAT experiments with YouTube CPI as a good measure for cache interference lower is better Antagonistcacheoccupancy(%ofL3) CPI 0% 25% 50% 75% 100%
  • 24. Impact of batch jails Higher gains for smaller jail +0% lower is better LS tasks avg CPI comparison
  • 25. LS tasks CPI percentile comparisonComparison of LS tasks CPI PDF Batch jails deployment Batch jailing shifts CPI lower Higher benefits of CAT for tail tasks +0%
  • 26. Batch jails deployment Smaller jails lead to higher impact on batch jobs lower is better Batch tasks avg CPI comparison +0%
  • 27. The Downside: Increased memory pressure BW spike! BW hungry Batch job starts BW hungry Batch job stops Jailing LLC increases DRAM BW pressure for Batch System Memory BW
  • 28. Controlling memory bandwidth impact Intel RDT: CMT (Cache Monitoring Technology) - Monitor and profile cache usage pattern for all applications Intel RDT: MBM (Memory Bandwidth Monitoring) - Monitor memory bandwidth usage per application Controls: - CPU throttling - Scheduling - Newer platforms will provide control for memory bandwidth per application
  • 29. Controlling infrastructure processes Many system daemons tend to periodically thrash caches - None of them are latency sensitive - Stable behavior, easy to identify Jailing for daemons! - Requires ability to restrict kernel threads to a mask
  • 30. What about the noisy neighbors? Noisy neighbors hurting performance (Intel RDT) ● Use CMT to detect; CAT to control ● Integrated into CPI2 signals ○ CPI2 built for noisy neighbor detection ○ Dynamically throttle noisy tasks ○ Possibly tag for scheduling hints Observer Master Nodes CPI Samples CPI Spec
  • 31. CMT issues with cgroups ● Usage model: many many cgroups, but can’t run perf on all of them all the time ○ Run perf periodically on a sample of cgroups ○ Use same RMID for a bunch of cgroups ○ Rotate cgroups out every sampling period ● HW counts cache allocations - deallocations, not occupancy: ○ Cache lines allocated before perf runs are not accounted ○ Can get non-sensical results, even zero cache occupancy ○ Work-around requires to run perf for life-time of monitored cgroup ○ Unacceptable context switch overhead ● David Carrillo-Cisneros & Stephane Eranian working on a newer version for CMT support with perf
  • 33. Cache Cgroup Cache T1 T2 CPU T1 T2 Memory T1 T2 Cgroup Root ● Every app gets its own cgroup ● Set CBM for all batch tasks to same mask ● Easy to inspect, recover ● Easy to integrate into existing container mechanisms ○ Docker ○ Kubernetes ● Issues with the patch: ○ Per-socket masks ○ Not a good fit? ○ Thread-based isolation vs cgroup v2
  • 34. New patch: rscctrl interface ● Patches by Intel from Fenghua Yu ○ Mounted under /sys/fs/rscctrl ○ Currently used for L2 and L3 cache masks ○ Create new grouping with mkdir /sys/fs/rscctrl/LS1 ○ Files under /sys/fs/rscctrl/LS1: ■ tasks: threads in the group ■ cpus: cpus to control with the setting in this group ■ schemas: write L2 and L3 CBMs to this file ● Aligns better with the h/w capabilities provided ● Gives finer control without worrying about cgroup restrictions ● Gives control over kernel threads as well as user threads ● Allows resource allocation policies to be tied to certain cpus across all contexts
  • 35. Current Kernel patch progress David Carrillo-Cisneros, Fenghua Yu, Vikas Shivappa, and others at Intel working on improving CMT and MBM support for cgroups Changes to support cgroup monitoring as opposed to attach to process forever model Challenges that are being faced: ● Sampled collections ● Not enough RMIDs to go around ○ Use per-package allocation of RMIDs ○ Reserved RMIDs (do not rotate)
  • 36. Takeaways ● With larger machines, isolation between workloads is more important than ever. ● RDT extensions work really great at scale: ○ Easy to set up static policies. ○ Lot of flexibility. ● CAT is only one of the first isolation/monitoring features. ○ Avoid ad-hoc solutions ● At Google, we cgroups and containers: ○ Rolled out cgroup based CAT support to the fleet. ● Let’s get the right abstractions in place. If you are interested, talk to us here or find us online: jnagal davidlo davidcc eranian @google Thanks!
  • 37. ● Friday 8am - 1pm @ Google's Toronto office ● Hear real life experiences of two companies using GKE ● Share war stories with your peers ● Learn about future plans for microservice management from Google ● Help shape our roadmap g.co/microservicesroundtable † Must be able to sign digital NDA Join our Microservices Customer Roundtable