SlideShare uma empresa Scribd logo
1 de 36
Hardware-aware thread scheduling: the
case of asymmetric multicore processors
    Achille Peternier*, Danilo Ansaloni, Daniele Bonetta,
             Cesare Pautasso and Walter Binder

                 * achille.peternier@usi.ch
                  http://sosoa.inf.unisi.ch
Introduction

CONTEXT AND OVERALL IDEA


                           2
Context
• Modern CPUs increase the computational
  power through additional cores
• HW architectures are becoming increasingly
  more complex
  – Shared caches
  – Non Uniform Memory Access (NUMA)
  – Single Instruction Multiple Data (SIMD) registers
  – Simultaneous MultiThreading (SMT) units


                                                        3
Context
• Operating System (OS) kernel and scheduler
  try to automatically optimize applications’
  performance according to the available
  resources
  – Based on the underlying HW
  – Using a limited set of performance indicators (CPU
    time, memory usage, etc.)



                                                     4
“Today it is impossible to estimate performance:
you have to measure it. Programming has become
an empirical science.”

 Performance Anxiety: Performance analysis in the new millennium
                                       Joshua Bloch, Google Inc.




                                                             5
Contributions
1) Automated workload analysis technique relying on a
specific set of performance metrics that are currently not
used by common OS schedulers


2) Hardware-aware optimized scheduler performing
decisions based on hardware resource usage and the
output of the workload analysis
       - to improve processing units occupancy on
       SMT/asymmetric processors
                                                         6
The big picture
     Monitoring daemon




                                   FPU
                                             INT

                                   Workload characterization




        OS threads and processes
                                                               7
The big picture



                           FPU
Hardware-aware scheduler             INT

                           Workload characterization




                                                       8
Target architecture

AMD BULLDOZER PROCESSOR


                          9
AMD Bulldozer
• AMD Bulldozer architecture
  – Each CPU is implemented as a series of modules
    (a.k.a. “cores”) with two cores (a.k.a. “processing
    or SMT units”)
  – Arithmetic-Logic Units (ALUs) are really available
    per SMT unit
  – A module is more similar to:
     • A dual core when doing integer ops
     • A single core with SMT=2 when
       doing floating point ops

                                                          10
AMD Bulldozer




                11
AMD Bulldozer




                12
AMD Bulldozer




                13
WORKLOAD CHARACTERIZATION


                            14
Workload characterization

• Is used to sort processes and threads that are
  floating point intensive
  – Among the X most running threads
     • (where X = the number of cores available)


• Based on realtime monitoring system using
  Hardware Performance Counters (HPCs)

                                                   15
…about HPCs…
• Registers embedded into processors to keep track
  of hardware-related events such as cache misses,
  number of CPU cycles, branch mispredictions,
  etc.
• Very low overhead (about 1%)
• Extremely accurate
• Limited resources, only few of them can be used
  at the same time
  – This limits their wide adoption (yet) on large scale
• HW-specific

                                                           16
Workload characterization
• HPCs used:
  – PERF_COUNT_HW_CPU_CYCLES: measures the
    total number of CPU cycles consumed by a thread
    during its execution time
  – CYCLES_FPU_EMPTY: keeps track of the number
    of CPU cycles the floating point units are not being
    used by a thread during its execution time
  – L2_CACHE_MISSES: counts the number of L2
    cache misses generated by a thread during its
    execution time

                                                       17
MONITORING AND SCHEDULING
INFRASTUCTURE DESING

                            18
BulldOver design
• Bulldozer Overseer -> BulldOver
• Client-server architecture




                                    19
BulldOver design
• Server
  – Daemon
  – Scans the underlying architecture
  – Time-based HPC monitoring (once per sec)
     • We target scientific workloads, short-lived threads are
       not well suitable
  – Applies scheduling policies
  – libHpcOverseer, hwloc, libpfm


                                                                 20
BulldOver design
• Client
  – Command-line tool
     • prompt> bulldover java myprogram
  – Traces the creation/termination of
    threads/processes
  – Share information through shared memory with
    the server
  – libmonitor, boost


                                                   21
BulldOver design



                   User space




                          22
EVALUATION


             23
Testing environment
• Dell PowerEdge M915
  – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
    modules each)
     • Limited to 1 CPU with 8 cores/4 modules
  – Test limited to a single NUMA node
     • Avoiding latencies and other NUMA-related well known
       effects
  – Turbo mode and freq. scaling disabled


                                                          24
Benchmark suites
• SPEC CPU 2006
  – Perfect match for evaluating Integer vs. Floating point
    behaviors

• SciMark 2.0
  – Java based
  – Noisy environment (additional threads for garbage
    collection, JIT, etc.)
  – Mainly FPU-oriented, with different levels of stress
  – Modified multi-threaded version running several
    random benchmarks over a thread-pool

                                                           25
Workload characterization
Spec CPU 2006




                                              26
                 Empty FPU Cycles   Total CPU Cycles
Workload characterization
SciMark 2.0




                 Empty FPU Cycles   Total CPU Cycles


                                              27
FPU usage and caches




                       28
Results for SPEC CPU 2006
                     Running 4x Int and 4x FPU
                     benchmarks on a single NUMA
                     node (4 modules/8 cores)



                          Inefficient baseline

                          Improved scheduling

                          Default OS scheduling




                                                  29
Discussion
• BulldOver avoids the worst case scenario
  – The default OS scheduler is not aware of the
    workload characterization
• Benefits coming both from improved cache
  usage AND better FPU/Integer units
  occupancy




                                                   30
Results for Scimark 2.0
                          Running 8x randomly changing
                          over-time benchmarks on a
                          single NUMA node (4 modules/8
                          cores)

                               Default OS scheduling

                               Improved scheduling




                                                 31
Discussion
• All the threads are FPU-intensive
  – But at different levels
• Still a reasonable speedup “for free”
• Dynamic adaptation, since the FPU usage
  intensity varies over time
  – BulldOver reacts accordingly




                                            32
Conclusions
- We show how thread scheduling not aware of the shared
  HW resources available on the AMD Bulldozer processor
  can incur a significant performance penalty
- We presented a monitoring system that is able to
  characterize the most active threads according to their
  FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can
  be applied and performance improved
- Our system is very low intrusive:
   -   Low overhead (below 2%)
   -   No kernel patching required
   -   No code instrumentation
   -   Works on any application

                                                             33
Conclusions
• Currently tuned for a specific HW architecture
• Good for scientific workloads
  – Sampling rate is required (1 sec in our case, could
    be less but can’t be 0…)
• Based on a very simple scheduling policy
  – More sophisticated policies could be used




                                                          34
Thanks!




   Achille Peternier
                       achille.peternier@usi.ch
                       http://sosoa.inf.unisi.ch

                                                   35
“Pow7Over”
• Work in progress on IBM Power7 processors
   – 1 CPU, 8 cores, up to 4 SMT units per core
   – Completely different…
      •   …operating system: RHEL 6.3
      •   …architecture: PowerPC
      •   …HPCs: IBM-specific ones (more than 500 available…)
      •   …compiler: autotools 6.0
• Similar approach
• Slightly less significant speedup
   – But this is a full SMT
   – Similar overall behavior both for the PUs and L2 caches

                                                                36

Mais conteúdo relacionado

Mais procurados

The Real World of Virtual Datacenters + Supporting Materials
The Real World of Virtual Datacenters + Supporting MaterialsThe Real World of Virtual Datacenters + Supporting Materials
The Real World of Virtual Datacenters + Supporting MaterialsX. Breogan COSTA
 
Multiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image ProcessingMultiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image Processingmayank.grd
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processorMazin Alwaaly
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi coremukul bhardwaj
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processorZeeshan Aslam
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-reviewMothi R
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
并行计算与分布式计算的区别
并行计算与分布式计算的区别并行计算与分布式计算的区别
并行计算与分布式计算的区别xiazdong
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 

Mais procurados (20)

The Real World of Virtual Datacenters + Supporting Materials
The Real World of Virtual Datacenters + Supporting MaterialsThe Real World of Virtual Datacenters + Supporting Materials
The Real World of Virtual Datacenters + Supporting Materials
 
Lect18
Lect18Lect18
Lect18
 
Multiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image ProcessingMultiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image Processing
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processor
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
27 multicore
27 multicore27 multicore
27 multicore
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Rtos
RtosRtos
Rtos
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
Rtos
RtosRtos
Rtos
 
Multi core processor
Multi core processorMulti core processor
Multi core processor
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
并行计算与分布式计算的区别
并行计算与分布式计算的区别并行计算与分布式计算的区别
并行计算与分布式计算的区别
 
Crusoe processor
Crusoe    processorCrusoe    processor
Crusoe processor
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 

Semelhante a Hardware-aware thread scheduling: the case of asymmetric multicore processors

Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalTommy Lee
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016Kuniyasu Suzaki
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
Introduction to Embedded System
Introduction to Embedded SystemIntroduction to Embedded System
Introduction to Embedded SystemZakaria Gomaa
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)Nicola Bonelli
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsNational Cheng Kung University
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 

Semelhante a Hardware-aware thread scheduling: the case of asymmetric multicore processors (20)

Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
esunit1.pptx
esunit1.pptxesunit1.pptx
esunit1.pptx
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Introduction to Embedded System
Introduction to Embedded SystemIntroduction to Embedded System
Introduction to Embedded System
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Introduction to Microcontroller
Introduction to MicrocontrollerIntroduction to Microcontroller
Introduction to Microcontroller
 
Introduction to Microcontroller
Introduction to MicrocontrollerIntroduction to Microcontroller
Introduction to Microcontroller
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 

Mais de Achille Peternier

MVisio: A Computer Graphics Platform for Virtual Reality, Science and Education
MVisio: A Computer Graphics Platform for Virtual Reality, Science and EducationMVisio: A Computer Graphics Platform for Virtual Reality, Science and Education
MVisio: A Computer Graphics Platform for Virtual Reality, Science and EducationAchille Peternier
 
Brainstorming on Web Technologies and the Pipeorgan Database
Brainstorming on Web Technologies and the Pipeorgan DatabaseBrainstorming on Web Technologies and the Pipeorgan Database
Brainstorming on Web Technologies and the Pipeorgan DatabaseAchille Peternier
 
S: a Scripting Language for High-Performance RESTful Web Services
S: a Scripting Language for High-Performance RESTful Web ServicesS: a Scripting Language for High-Performance RESTful Web Services
S: a Scripting Language for High-Performance RESTful Web ServicesAchille Peternier
 
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...Achille Peternier
 
Overseer: Low-Level Hardware Monitoring and Management for Java
Overseer: Low-Level Hardware Monitoring and Management for JavaOverseer: Low-Level Hardware Monitoring and Management for Java
Overseer: Low-Level Hardware Monitoring and Management for JavaAchille Peternier
 
Cost-Optimal Outsourcing of Applications into the Clouds
Cost-Optimal Outsourcing of Applications into the CloudsCost-Optimal Outsourcing of Applications into the Clouds
Cost-Optimal Outsourcing of Applications into the CloudsAchille Peternier
 
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...Achille Peternier
 

Mais de Achille Peternier (7)

MVisio: A Computer Graphics Platform for Virtual Reality, Science and Education
MVisio: A Computer Graphics Platform for Virtual Reality, Science and EducationMVisio: A Computer Graphics Platform for Virtual Reality, Science and Education
MVisio: A Computer Graphics Platform for Virtual Reality, Science and Education
 
Brainstorming on Web Technologies and the Pipeorgan Database
Brainstorming on Web Technologies and the Pipeorgan DatabaseBrainstorming on Web Technologies and the Pipeorgan Database
Brainstorming on Web Technologies and the Pipeorgan Database
 
S: a Scripting Language for High-Performance RESTful Web Services
S: a Scripting Language for High-Performance RESTful Web ServicesS: a Scripting Language for High-Performance RESTful Web Services
S: a Scripting Language for High-Performance RESTful Web Services
 
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...
Optimizing the Tradeoff between Discovery, Composition, and Execution Cost in...
 
Overseer: Low-Level Hardware Monitoring and Management for Java
Overseer: Low-Level Hardware Monitoring and Management for JavaOverseer: Low-Level Hardware Monitoring and Management for Java
Overseer: Low-Level Hardware Monitoring and Management for Java
 
Cost-Optimal Outsourcing of Applications into the Clouds
Cost-Optimal Outsourcing of Applications into the CloudsCost-Optimal Outsourcing of Applications into the Clouds
Cost-Optimal Outsourcing of Applications into the Clouds
 
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...
Shepherd: Node Monitors for Fault-Tolerant Distributed Process Execution in O...
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Hardware-aware thread scheduling: the case of asymmetric multicore processors

  • 1. Hardware-aware thread scheduling: the case of asymmetric multicore processors Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder * achille.peternier@usi.ch http://sosoa.inf.unisi.ch
  • 3. Context • Modern CPUs increase the computational power through additional cores • HW architectures are becoming increasingly more complex – Shared caches – Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers – Simultaneous MultiThreading (SMT) units 3
  • 4. Context • Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources – Based on the underlying HW – Using a limited set of performance indicators (CPU time, memory usage, etc.) 4
  • 5. “Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.” Performance Anxiety: Performance analysis in the new millennium Joshua Bloch, Google Inc. 5
  • 6. Contributions 1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers 2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis - to improve processing units occupancy on SMT/asymmetric processors 6
  • 7. The big picture Monitoring daemon FPU INT Workload characterization OS threads and processes 7
  • 8. The big picture FPU Hardware-aware scheduler INT Workload characterization 8
  • 10. AMD Bulldozer • AMD Bulldozer architecture – Each CPU is implemented as a series of modules (a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”) – Arithmetic-Logic Units (ALUs) are really available per SMT unit – A module is more similar to: • A dual core when doing integer ops • A single core with SMT=2 when doing floating point ops 10
  • 15. Workload characterization • Is used to sort processes and threads that are floating point intensive – Among the X most running threads • (where X = the number of cores available) • Based on realtime monitoring system using Hardware Performance Counters (HPCs) 15
  • 16. …about HPCs… • Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc. • Very low overhead (about 1%) • Extremely accurate • Limited resources, only few of them can be used at the same time – This limits their wide adoption (yet) on large scale • HW-specific 16
  • 17. Workload characterization • HPCs used: – PERF_COUNT_HW_CPU_CYCLES: measures the total number of CPU cycles consumed by a thread during its execution time – CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time – L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time 17
  • 19. BulldOver design • Bulldozer Overseer -> BulldOver • Client-server architecture 19
  • 20. BulldOver design • Server – Daemon – Scans the underlying architecture – Time-based HPC monitoring (once per sec) • We target scientific workloads, short-lived threads are not well suitable – Applies scheduling policies – libHpcOverseer, hwloc, libpfm 20
  • 21. BulldOver design • Client – Command-line tool • prompt> bulldover java myprogram – Traces the creation/termination of threads/processes – Share information through shared memory with the server – libmonitor, boost 21
  • 22. BulldOver design User space 22
  • 24. Testing environment • Dell PowerEdge M915 – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8 modules each) • Limited to 1 CPU with 8 cores/4 modules – Test limited to a single NUMA node • Avoiding latencies and other NUMA-related well known effects – Turbo mode and freq. scaling disabled 24
  • 25. Benchmark suites • SPEC CPU 2006 – Perfect match for evaluating Integer vs. Floating point behaviors • SciMark 2.0 – Java based – Noisy environment (additional threads for garbage collection, JIT, etc.) – Mainly FPU-oriented, with different levels of stress – Modified multi-threaded version running several random benchmarks over a thread-pool 25
  • 26. Workload characterization Spec CPU 2006 26 Empty FPU Cycles Total CPU Cycles
  • 27. Workload characterization SciMark 2.0 Empty FPU Cycles Total CPU Cycles 27
  • 28. FPU usage and caches 28
  • 29. Results for SPEC CPU 2006 Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores) Inefficient baseline Improved scheduling Default OS scheduling 29
  • 30. Discussion • BulldOver avoids the worst case scenario – The default OS scheduler is not aware of the workload characterization • Benefits coming both from improved cache usage AND better FPU/Integer units occupancy 30
  • 31. Results for Scimark 2.0 Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores) Default OS scheduling Improved scheduling 31
  • 32. Discussion • All the threads are FPU-intensive – But at different levels • Still a reasonable speedup “for free” • Dynamic adaptation, since the FPU usage intensity varies over time – BulldOver reacts accordingly 32
  • 33. Conclusions - We show how thread scheduling not aware of the shared HW resources available on the AMD Bulldozer processor can incur a significant performance penalty - We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage - Thanks to the realtime analysis, improved scheduling can be applied and performance improved - Our system is very low intrusive: - Low overhead (below 2%) - No kernel patching required - No code instrumentation - Works on any application 33
  • 34. Conclusions • Currently tuned for a specific HW architecture • Good for scientific workloads – Sampling rate is required (1 sec in our case, could be less but can’t be 0…) • Based on a very simple scheduling policy – More sophisticated policies could be used 34
  • 35. Thanks! Achille Peternier achille.peternier@usi.ch http://sosoa.inf.unisi.ch 35
  • 36. “Pow7Over” • Work in progress on IBM Power7 processors – 1 CPU, 8 cores, up to 4 SMT units per core – Completely different… • …operating system: RHEL 6.3 • …architecture: PowerPC • …HPCs: IBM-specific ones (more than 500 available…) • …compiler: autotools 6.0 • Similar approach • Slightly less significant speedup – But this is a full SMT – Similar overall behavior both for the PUs and L2 caches 36