SlideShare a Scribd company logo
1 of 37
Download to read offline
A Survey on in-a-box parallel computing
 and its implications on system software
                                research

        Changwoo Min (multics69@gmail.com)
Motivation
   Technology ratios matter, Jim Gray




   In the face of such "10X" forces, you can lose control of your
    destiny, Andrew S Grove




   What is the implications of multicore evolution for system software
    researcher?
Survey Scope and Strategy


        Parallel                   Parallel
       Application               Middleware

          Parallel Programming Model

                  System Library

                 Operating System

     Multicore       Multicore Machine Monitor
                          Virtual
                                  GPGPU
      CPU             CPU
     Multicore       Multicore
                                     GPGPU       GPGPU   …
      CPU             CPU
Contents
   Background

   Parallel Programming Model and Productivity Tools

   Optimization of System Software

   Supporting GPU in a Virtualized Environment

   Utilizing GPU in Middleware

   Conclusion
Background
Why multicore?
   Multicore CPU
       Power wall
       ILP(instruction level parallelism) wall
       Memory wall
       Wire delay

   GPGPU(General Purpose computing on a Graphic Processing
    Unit)
       GPU typically handles computation only for computer graphics.
       Add followings to the rendering pipelines
           programmable stages
           higher precision arithmetic
       Use stream processing on non-graphics data.
Architecture of GPGPU core
Parallel Programming Model and
               Productivity Tools
OpenMP
   Parallel Programming API for shared memory
    multiprocessing programming in C, C++, Fortran

   Use language extension – “#pragma omp”
       Need compiler support
OpenMP (cont’d)
   Fork-and-join model
       Bounded parallel loop, reduction




   Task-creation-and-join model
       Unbounded loop, recursive algorithm, producer/consumer
Intel TBB                (Threading Building Block)
   Similar to OpenMP
       API for shared memory multiprocessing
       Fork-and-join
           parallel-for, parallel-reduce
       Task-creation-and-join
           Task scheduler

   Different from OpenMP
       C++ template library
       Concurrent container class
           Hash map, vector, queue
       Various synchronization mechanism
           mutex, spin lock, …
       Atomic type, atomic operations
       Scalable memory allocator
Nvidia CUDA                   (Compute Unified Device Architecture)

   CUDA
       Computing engine in Nvidia GPU
       Programming framework for Nvidia GPU
       Use CUDA extended C
           declspecs, keywords, intrinsic, runtime API, function launch, …




        CUDA extended C             Compiling CUDA Code       Processing flow on CUDA
Nvidia CUDA (cont’d)




     Execution Model   Kernel Memory Access
OpenCL          (Open Compute Language)

   CPU/GPU heterogeneous computing framework
    standardized by Khronous group




      OpenCL Memory Model          CUDA, OpenCL Example
Lithe: Enabling Efficient Composition of
Parallel Libraries
   Who?
       ParLab, UC Berkeley, HotPar’09

   Problem
       Composition of parallel libraries shows performance anomaly
Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
   Solution
       Virtualized thread are bad for parallel libraries.
       Harts
           Unvirtualized hardware thread context
           Sharing harts
       Lithe
           Cooperative hierarchical scheduler framework for harts
Concurrency bug detection: DataCollider
   Who?
       Microsoft Research, OSDI’10
   Problem
       Detecting concurrency data race bug is difficult.
       For large system such as Windows kernel, runtime overhead is
        critical.
   Solution
       Sampling using code break point
       When a code break point is trapped,
           Set data break point for its operand
           Sleep for a while
           If the data is changed, it could be data race.
Concurrency bug detection: SyncFinder
   Who?
       UC San Diego, OSDI ’10
   Problem
       How to find ad-hoc synchronization
   Solution
       Formalize patterns of ad-hoc synchronization
       Detect such patterns using LLVM
Optimization of System Software
Memory Allocation: Hoard
   Who?
       UT, ASPLOS’00
   Problem
       Memory allocator is performance bottleneck in multi
        processor environment.
       Lock contention, False sharing, Blow up




          Allocator induced false sharing
Memory Allocation: Hoard (cont’d)
   Solution
       Per-processor heap to reduce
        lock contention and false
        sharing
       Global heap
           Borrow memory from global
            heap to increase per-processor
            heap
           Return memory to global heap if
            there are too much free memory
            in a per-processor heap
Memory Allocation: Xmalloc
   Who?
       UIUC, ICCIT’10
   Problem
       Scalable malloc for CUDA whereby hundreds of threads run
        concurrently.
   Solution
       Memory allocation coalescing
System Call: FlexSC
   Who?
       University of Toronto, OSDI’10
   Problem
       Negative performance impact of system call is huge.
           Direct cost + indirect cost
   Solution
       Batching, asynchronous system call
Revisiting OS Architecture
Multikernel
   Who?
       ETH Zurich, Microsoft Research Cambridge, SOSP’09
   Problem
       System diversity
           It is no longer acceptable (or useful) to tune a general-purpose OS
            design for a particular hardware model.
Multikernel (cont’d)
   Problem (cont’d)




                                                                           SHM:stalled cycle (no locking!)
       The interconnects matters




    8-socket Nahelem     On-chip interconnects   SHM vs. Message Passing


       Core diversity
           Programmable NICs
           GPU
           FPGA in CPU sockets
Multikernel (cont’d)
   Solution
       Today’s computer is already a distributed system. Why isn’t
        your OS?




       Barallelfish
           Implementation of the multikernel approach
           Message passing, shared nothing, replica maintenance
An Analysis of Linux Scalability to Many
Cores
   Who?
       MIT CSAIL, OSDI’10
   Problem
       If so, is Linux scalable enough?
   Solution
       Test linux scalability using 48 Intel cores with 7 applications
       No kernel problems up to 48 cores
           3002 LOC patches


                                      Sloopy counter
                                      : replicated reference counter
Supporting GPU in a virtualized
                  environment
HyVM             (Hybrid Virtual Machines)

   Who?
       Georgia Tech
   Problem
       Asymmetries in performance, memory and cache
       Functional differences
           Multiple accelerators
           Vector processor
           Floating point
           Additional instructions for accelerations
   Solution
       heterogeneity- and asymmetry-aware hypervisors
HyVM (cont’d)
   Solution (cont’d)




           HyVM Architecture       GViM: GPU Virtualization Architecture




       Memory management in GViM    Harmony CPU/GPU co-scheduling
VMGL          (Virtualizing OpenGL)

   Who?
       University of Toronto, VEE’07
   Problem
       How to support OpenGL in a virtual machine environment
   Solution
       Forward OpenGL command to the driver domain
Utilizing GPU in Middleware
StoreGPU
   Who?
       University of British Columbia, HDPC’10
   Problem
       In CAS(Contents Addressable Storage),
           How to minimizing hash calculation cost
   Solution
       Offloading to GPU



                                           StoreGPU Architecture
PacketShader
   Who?
       KAIST, SIGCOMM’10, NSDI’11
   Problem
       How to boot up performance of software router
   Solution
       Offload stateless (parallelizable) packet processing to GPU




          PacketShader Architecture     Basic Workflow of PacketShader
Conclusion
S

More Related Content

What's hot

Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...
"The OpenVX Computer Vision and Neural Network Inference Library Standard for..."The OpenVX Computer Vision and Neural Network Inference Library Standard for...
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...Edge AI and Vision Alliance
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
 
Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Ece Rljit
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...智啓 出川
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentNVIDIA Taiwan
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 

What's hot (20)

Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...
"The OpenVX Computer Vision and Neural Network Inference Library Standard for..."The OpenVX Computer Vision and Neural Network Inference Library Standard for...
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Hpc4
Hpc4Hpc4
Hpc4
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken Content
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 

Viewers also liked

Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD supportJoel Falcou
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
SFO15-110: Toolchain Collaboration
SFO15-110: Toolchain CollaborationSFO15-110: Toolchain Collaboration
SFO15-110: Toolchain CollaborationLinaro
 
System Software Conferences
System  Software  ConferencesSystem  Software  Conferences
System Software ConferencesChangWoo Min
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Intel x86 Architecture
Intel x86 ArchitectureIntel x86 Architecture
Intel x86 ArchitectureChangWoo Min
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Viewers also liked (9)

Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
SFO15-110: Toolchain Collaboration
SFO15-110: Toolchain CollaborationSFO15-110: Toolchain Collaboration
SFO15-110: Toolchain Collaboration
 
System Software Conferences
System  Software  ConferencesSystem  Software  Conferences
System Software Conferences
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Intel x86 Architecture
Intel x86 ArchitectureIntel x86 Architecture
Intel x86 Architecture
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to A Survey on in-a-box parallel computing and its implications on system software research

Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingMesbah Uddin Khan
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureTalbott Crowell
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesAdam DeConinck
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0Ganesan Narayanasamy
 
Presentation
PresentationPresentation
Presentationbutest
 

Similar to A Survey on in-a-box parallel computing and its implications on system software research (20)

Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 
defense-linkedin
defense-linkedindefense-linkedin
defense-linkedin
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of Actuaries
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Par com
Par comPar com
Par com
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Presentation
PresentationPresentation
Presentation
 

A Survey on in-a-box parallel computing and its implications on system software research

  • 1. A Survey on in-a-box parallel computing and its implications on system software research Changwoo Min (multics69@gmail.com)
  • 2. Motivation  Technology ratios matter, Jim Gray  In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove  What is the implications of multicore evolution for system software researcher?
  • 3. Survey Scope and Strategy Parallel Parallel Application Middleware Parallel Programming Model System Library Operating System Multicore Multicore Machine Monitor Virtual GPGPU CPU CPU Multicore Multicore GPGPU GPGPU … CPU CPU
  • 4. Contents  Background  Parallel Programming Model and Productivity Tools  Optimization of System Software  Supporting GPU in a Virtualized Environment  Utilizing GPU in Middleware  Conclusion
  • 6. Why multicore?  Multicore CPU  Power wall  ILP(instruction level parallelism) wall  Memory wall  Wire delay  GPGPU(General Purpose computing on a Graphic Processing Unit)  GPU typically handles computation only for computer graphics.  Add followings to the rendering pipelines  programmable stages  higher precision arithmetic  Use stream processing on non-graphics data.
  • 8. Parallel Programming Model and Productivity Tools
  • 9. OpenMP  Parallel Programming API for shared memory multiprocessing programming in C, C++, Fortran  Use language extension – “#pragma omp”  Need compiler support
  • 10. OpenMP (cont’d)  Fork-and-join model  Bounded parallel loop, reduction  Task-creation-and-join model  Unbounded loop, recursive algorithm, producer/consumer
  • 11. Intel TBB (Threading Building Block)  Similar to OpenMP  API for shared memory multiprocessing  Fork-and-join  parallel-for, parallel-reduce  Task-creation-and-join  Task scheduler  Different from OpenMP  C++ template library  Concurrent container class  Hash map, vector, queue  Various synchronization mechanism  mutex, spin lock, …  Atomic type, atomic operations  Scalable memory allocator
  • 12. Nvidia CUDA (Compute Unified Device Architecture)  CUDA  Computing engine in Nvidia GPU  Programming framework for Nvidia GPU  Use CUDA extended C  declspecs, keywords, intrinsic, runtime API, function launch, … CUDA extended C Compiling CUDA Code Processing flow on CUDA
  • 13. Nvidia CUDA (cont’d) Execution Model Kernel Memory Access
  • 14. OpenCL (Open Compute Language)  CPU/GPU heterogeneous computing framework standardized by Khronous group OpenCL Memory Model CUDA, OpenCL Example
  • 15. Lithe: Enabling Efficient Composition of Parallel Libraries  Who?  ParLab, UC Berkeley, HotPar’09  Problem  Composition of parallel libraries shows performance anomaly
  • 16. Lithe: Enabling Efficient Composition of Parallel Libraries (cont’d)  Solution  Virtualized thread are bad for parallel libraries.  Harts  Unvirtualized hardware thread context  Sharing harts  Lithe  Cooperative hierarchical scheduler framework for harts
  • 17. Concurrency bug detection: DataCollider  Who?  Microsoft Research, OSDI’10  Problem  Detecting concurrency data race bug is difficult.  For large system such as Windows kernel, runtime overhead is critical.  Solution  Sampling using code break point  When a code break point is trapped,  Set data break point for its operand  Sleep for a while  If the data is changed, it could be data race.
  • 18. Concurrency bug detection: SyncFinder  Who?  UC San Diego, OSDI ’10  Problem  How to find ad-hoc synchronization  Solution  Formalize patterns of ad-hoc synchronization  Detect such patterns using LLVM
  • 20. Memory Allocation: Hoard  Who?  UT, ASPLOS’00  Problem  Memory allocator is performance bottleneck in multi processor environment.  Lock contention, False sharing, Blow up Allocator induced false sharing
  • 21. Memory Allocation: Hoard (cont’d)  Solution  Per-processor heap to reduce lock contention and false sharing  Global heap  Borrow memory from global heap to increase per-processor heap  Return memory to global heap if there are too much free memory in a per-processor heap
  • 22. Memory Allocation: Xmalloc  Who?  UIUC, ICCIT’10  Problem  Scalable malloc for CUDA whereby hundreds of threads run concurrently.  Solution  Memory allocation coalescing
  • 23. System Call: FlexSC  Who?  University of Toronto, OSDI’10  Problem  Negative performance impact of system call is huge.  Direct cost + indirect cost  Solution  Batching, asynchronous system call
  • 25. Multikernel  Who?  ETH Zurich, Microsoft Research Cambridge, SOSP’09  Problem  System diversity  It is no longer acceptable (or useful) to tune a general-purpose OS design for a particular hardware model.
  • 26. Multikernel (cont’d)  Problem (cont’d) SHM:stalled cycle (no locking!)  The interconnects matters 8-socket Nahelem On-chip interconnects SHM vs. Message Passing  Core diversity  Programmable NICs  GPU  FPGA in CPU sockets
  • 27. Multikernel (cont’d)  Solution  Today’s computer is already a distributed system. Why isn’t your OS?  Barallelfish  Implementation of the multikernel approach  Message passing, shared nothing, replica maintenance
  • 28. An Analysis of Linux Scalability to Many Cores  Who?  MIT CSAIL, OSDI’10  Problem  If so, is Linux scalable enough?  Solution  Test linux scalability using 48 Intel cores with 7 applications  No kernel problems up to 48 cores  3002 LOC patches Sloopy counter : replicated reference counter
  • 29. Supporting GPU in a virtualized environment
  • 30. HyVM (Hybrid Virtual Machines)  Who?  Georgia Tech  Problem  Asymmetries in performance, memory and cache  Functional differences  Multiple accelerators  Vector processor  Floating point  Additional instructions for accelerations  Solution  heterogeneity- and asymmetry-aware hypervisors
  • 31. HyVM (cont’d)  Solution (cont’d) HyVM Architecture GViM: GPU Virtualization Architecture Memory management in GViM Harmony CPU/GPU co-scheduling
  • 32. VMGL (Virtualizing OpenGL)  Who?  University of Toronto, VEE’07  Problem  How to support OpenGL in a virtual machine environment  Solution  Forward OpenGL command to the driver domain
  • 33. Utilizing GPU in Middleware
  • 34. StoreGPU  Who?  University of British Columbia, HDPC’10  Problem  In CAS(Contents Addressable Storage),  How to minimizing hash calculation cost  Solution  Offloading to GPU StoreGPU Architecture
  • 35. PacketShader  Who?  KAIST, SIGCOMM’10, NSDI’11  Problem  How to boot up performance of software router  Solution  Offload stateless (parallelizable) packet processing to GPU PacketShader Architecture Basic Workflow of PacketShader
  • 37. S