SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Alberto Parravicini
alberto.parravicini@polimi.it
01/02/2021
Alberto Parravicini
for i in range(10):
flag = f1(x)
f2(x, y)
if flag:
f3(z)
else:
f4(z)
f5(x)
Some random Python code,
nothing unusual, right?
01/02/2021
Alberto Parravicini
for i in range(10):
flag = f1(x)
f2(x, y)
if flag:
f3(z)
else:
f4(z)
f5(x)
● f2 and f3 (or f4) run concurrently
● f5 waits only f2, not f3 or f4
It also works with R, JS, Scala, and more
01/02/2021
Alberto Parravicini
f1(x)
f2(y)
f3(x, y)
01/02/2021
Alberto Parravicini
f1(x)
f2(y)
f3(x, y)
cudaGraphCreate(&graph, 0);
void *kernel_1_args[3] = {(void *)&x, (void *)&x1, &N};
void *kernel_2_args[3] = {(void *)&y, (void *)&y1, &N};
void *kernel_3_args[4] = {(void *)&x1, (void *)&y1, (void *)&res, &N};
dim3 tb(block_size_1d);
dim3 bs(num_blocks);
kernel_1_params.func = (void *)f1;
kernel_1_params.blockDim = bs;
kernel_1_params.gridDim = tb;
kernel_1_params.kernelParams = kernel_1_args;
kernel_1_params.sharedMemBytes = 0;
kernel_1_params.extra = NULL;
cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_1_params);
kernel_2_params.func = (void *)f2;
kernel_2_params.blockDim = bs;
kernel_2_params.gridDim = tb;
kernel_2_params.kernelParams = kernel_2_args;
kernel_2_params.sharedMemBytes = 0;
kernel_2_params.extra = NULL;
cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_2_params);
nodeDependencies.push_back(kernel_1);
nodeDependencies.push_back(kernel_2);
kernel_3_params.func = (void *)f3;
kernel_3_params.blockDim = bs;
kernel_3_params.gridDim = tb;
kernel_3_params.kernelParams = kernel_3_args;
kernel_3_params.sharedMemBytes = 0;
kernel_3_params.extra = NULL;
cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_3_params);
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, s1);
err = cudaStreamSynchronize(s1);
01/02/2021
Alberto Parravicini
1. How we can perform heterogeneous & asynchronous GPU
scheduling
2. We support many high-level languages like Python, R, JS, etc.
3. Same performance (if not better!) as low-level, cumbersome,
hand-optimized CUDA code
01/02/2021
Alberto Parravicini
The work presented today is the result of the ongoing collaboration
between NECSTLab @ Polimi & Oracle Labs
● This work will appear as “DAG-based Scheduling with Resource Sharing for Multi-task
Applications in a Polyglot GPU Runtime” in IPDPS 2021. Preprint:
https://arxiv.org/pdf/2012.09646.pdf
● It’s also open-source! Go play with it! github.com/AlbertoParravicini/grcuda
Big thanks to my co-authors Arnaud Delamare, Marco Arnaboldi and
Prof. Marco Santambrogio!
Also thanks to the original authors and developers of GrCUDA,
Rene Mueller and Lukas Stadler!
01/02/2021
Alberto Parravicini
We have a puzzle with 4 pieces!
Why do we care
about GPUs?
01/02/2021
Alberto Parravicini
We have a puzzle with 4 pieces!
Why do we care
about GPUs?
GraalVM & Polyglot Magic
01/02/2021
Alberto Parravicini
We have a puzzle with 4 pieces!
Why do we care
about GPUs?
GraalVM & Polyglot Magic
Enter GrCUDA!
01/02/2021
Alberto Parravicini
We have a puzzle with 4 pieces!
Why do we care
about GPUs?
GraalVM & Polyglot Magic
Our marvelous
GPU scheduler
Enter GrCUDA!
01/02/2021
Alberto Parravicini
GPUs were originally built for 3D graphic rendering
● But now they are everywhere: deep learning, engineering,
image processing, finance, etc.
They are still hard to use though, with 3 main issues
1. Programming GPUs is hard, they require knowledge of the
architecture and thread-based programming model.
Orthogonal to our work
01/02/2021
Alberto Parravicini
GPUs were originally built for 3D graphic rendering
● But now they are everywhere: deep learning, engineering,
image processing, finance, etc.
They are still hard to use though, with 3 main issues
1. Programming GPUs is hard, they require knowledge of the
architecture and thread-based programming model.
Orthogonal to our work
2. They are difficult to integrate. Lot of boilerplate host code,
and robust APIs are C++ only. GrCUDA to the rescue!
01/02/2021
Alberto Parravicini
GPUs were originally built for 3D graphic rendering
● But now they are everywhere: deep learning, engineering,
image processing, finance, etc.
They are still hard to use though, with 3 main issues
1. Programming GPUs is hard, they require knowledge of the
architecture and thread-based programming model.
Orthogonal to our work
2. They are difficult to integrate. Lot of boilerplate host code,
and robust APIs are C++ only. GrCUDA to the rescue!
3. The runtime is difficult to exploit, asynchronous
execution, CPU+GPU cooperation, etc.
Today we focus on this!
01/02/2021
Alberto Parravicini
GraalVM is a giant mega-project at Oracle Labs
Main idea: high-performance polyglot JVM
runtime
● It supports many languages (Python, R,
Scala, JavaScript, etc.), compiled to Java
bytecode and ran in a JVM
● All languages are intercompatible,
e.g. call JS code directly from your Java
application
(example from www.graalvm.org/docs/getting-started/
#polyglot-capabilities-of-native-images)
// PrettyPrintJSON.java
import java.io.*;
import java.util.stream.*;
import org.graalvm.polyglot.*;
public class PrettyPrintJSON {
public static void main(String[] args) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));
String input = reader.lines()
.collect(Collectors.joining(System.lineSeparator()));
try (Context context = Context.create("js")) {
Value parse = context.eval("js", "JSON.parse");
Value stringify = context.eval("js", "JSON.stringify");
Value result = stringify.execute(parse.execute(input), null, 2);
System.out.println(result.asString());
}
}
}
01/02/2021
Alberto Parravicini
GraalVM provides 3 big advantages
1. All languages are built on top of the same backend,
and benefit from the same optimizations
2. It is polyglot, languages can easily cooperate
3. It’s easy to add new languages!
All languages are mapped to the same Intermediate Representation (Truffle)
● This (tree-like) IR is optimized, possibly in part or speculatively, and translated to
Java bytecode and then to machine code
● It’s really complex! But it allows to create new dynamic languages without
worrying about optimizations!
01/02/2021
Alberto Parravicini
GrCUDA is a GraalVM-based DSL that exposes the CUDA API to Java, R, Python, JavaScript, etc.
● GPU acceleration for high-level languages through a unified backend
GrCUDA provides many benefits
● Simplify data-transfer with Unified Memory
● Just-In-Time CUDA compilation
● Support for any CUDA kernel and library
01/02/2021
Alberto Parravicini
GPU KERNEL
__global__ void inc_kernel(int* x, int N) {
for (int i = threadIdx.x + blockIdx.x * blockDim.x; i<N;
i += gridDim.x * blockDim.x) {
x[i] += 1;
}
}
PYTHON
import polyglot
cu = polyglot.eval(language='grcuda', string='CU')
inc_kernel = cu.buildkernel(INC_KERNEL_STR,
'inc_kernel(x: inout pointer sint32, N: uint64)')
device_array = cu.DeviceArray('int', 100)
for i in range(len(device_array)):
device_array[i] = i
inc_kernel(32, 256)(device_array, len(device_array))
R
cu <- eval.polyglot('grcuda', 'CU')
inc_kernel <- cu$buildkernel(KERNEL_STR,'...')
num_elements <- 100
device_array <- cu$DeviceArray('int', num_elements)
(array init omitted)
inc_kernel(32, 256)(device_array, num_elements)
JAVASCRIPT
const cu = Polyglot.eval('grcuda', 'CU')
const inc_kernel = cu.buildkernel(INC_KERNEL_STR,
`inc_kernel(x: inout pointer sint32, N: sint32)`)
const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) deviceArray[i] = i
inc_kernel(32, 256)(deviceArray, n)
01/02/2021
Alberto Parravicini
Biggest limitation, no support for asynchronous execution
● Huge performance gains left on the table
GPUs are great for parallel computing,
but also excel in multi-kernel asynchronous computations
1. Run concurrent GPU computations (space-sharing)
2. Run GPU computations concurrently to CPU
3. Overlap data-transfer with computations
Extracting full performance in multi-kernel computations is hard
● Synchronization events and data-movement must be hand-optimized
● Full CUDA API is only available to C/C++
Asynchronous
execution provides
an average of
60% speedup
on a Tesla P100
01/02/2021
Alberto Parravicini
Our goals
● Extract every ounce of asynchronicity
from GrCUDA
● Do it automatically, transparently
to the user
We represent GPU computations as vertices
of a DAG, connected through data
dependencies
● We schedule parallel computations and
limit synchronizations
01/02/2021
Alberto Parravicini
Some frameworks deal with GPU scheduling, such as TensorFlow and CUDA Graphs by Nvidia
What’s new here?
1. It’s fully transparent to the user, the API of GrCUDA is not modified
2. Dependencies are computed at runtime, not at compile time or eagerly
● GraalVM partial evaluation minimizes the runtime overheads (e.g. repeated array
accesses)
3. Updates to the GrCUDA runtime are immediately available to every GraalVM language
● Instead of having different libraries: PyCUDA, JCuda, GPU.js, etc.
01/02/2021
Alberto Parravicini
● Computations (GPU kernels and CPU array
accesses) are abstracted as DAG vertices
● Kernel invocation is asynchronous, the CPU
execution is blocked only when it needs
results
● Computations executed from the host
language are captured and added to the DAG
● Data dependency computation is aware of
read-only arguments and finished
computations
● Data can be prefetched for maximum
performance, and kernels use multiple
streams
No user-defined dependencies in the scheduling
01/02/2021
Alberto Parravicini
● Kernel invocations are wrapped into
computational elements (1)
● The GrCUDA execution context computes
data-dependencies and updates the DAG (2, 3)
● The computation is assigned a CUDA stream
based on dependencies and availability (4)
● The execution context schedules the
computation on GPU (5, 6)
● Data prefetching and event
synchronizations are non-blocking and
asynchronous
● New components are highlighted in red
01/02/2021
Alberto Parravicini
6 custom benchmarks to evaluate multi-task computations
● Tested on Nvidia Tesla P100 (high-end data-center GPU) and Nvidia GTX 1660 Super and GTX 960
(customer-grade GPUs)
● Note: dependency DAGs shown for clarity, but we never require the full DAGs!
01/02/2021
Alberto Parravicini
● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching
● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
01/02/2021
Alberto Parravicini
● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching
● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
01/02/2021
Alberto Parravicini
Our scheduler exploits untapped
GPU resources
Higher values for
● Device memory throughput
● L2 cache utilization
● Instructions completed per clock
(IPC)
● GFLOPS (single and double
precision)
01/02/2021
Alberto Parravicini
● Started development for multi-GPU support
● Big thanks to Qi Zhou!
● Scheduling is more complex: some benchmarks are faster (B&S,
1.8x), some are slower (VEC, 0.35x)
● Other possible directions
● Applications on top of GrCUDA: e.g. sparse linear algebra, GrCUDA
transparently maintains multiple data layouts (CSC, CSR, etc.)
● Integration with DSL: taking full advantage of asynchronous
execution, and simplify GPU code writing
● A new scheduler for GrCUDA for transparent async execution
● 44% faster than synchronous execution
● Fully integrated with GraalVM, available for Python, R, Java,
JavaScript, etc.
● Open Source: github.com/AlbertoParravicini/grcuda
● Paper: arxiv.org/pdf/2012.09646.pdf
Alberto Parravicini
alberto.parravicini@polimi.it
2020-02-01
01/02/2021
Alberto Parravicini
Our goals
● Extract every ounce of asynchronicity from GrCUDA
● Do it automatically, transparently to the user
We represent GPU computations as vertices of a DAG,
connected through data dependencies
● We schedule parallel computations and limit
synchronizations
Plenty of use cases:
● GPU Graph/Database querying (union of subqueries)
● Image processing pipelines (combine multiple filters)
● Ensemble of ML models
● Combine predictions from different models
on the same data

Mais conteúdo relacionado

Mais de NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...NECST Lab @ Politecnico di Milano
 
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020NECST Lab @ Politecnico di Milano
 
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...ECCO: An Electron Counting Implementation for Image Compression and Optimizat...
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...NECST Lab @ Politecnico di Milano
 

Mais de NECST Lab @ Politecnico di Milano (20)

Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 
EMoCy - Emotions Monitoring via wearable Computing System
EMoCy - Emotions Monitoring via wearable Computing SystemEMoCy - Emotions Monitoring via wearable Computing System
EMoCy - Emotions Monitoring via wearable Computing System
 
Approximate Personalized PageRank on FPGA .
Approximate Personalized PageRank on FPGA . Approximate Personalized PageRank on FPGA .
Approximate Personalized PageRank on FPGA .
 
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...
NECSTTechTalk: La didattica del Politecnico di Milano (e non solo!) ai tempi ...
 
ReWArDS - NECSTTechTalk 11/06/2020
ReWArDS - NECSTTechTalk 11/06/2020ReWArDS - NECSTTechTalk 11/06/2020
ReWArDS - NECSTTechTalk 11/06/2020
 
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
 
DRACO - NECSTTechTalk 28/05/2020
DRACO - NECSTTechTalk 28/05/2020DRACO - NECSTTechTalk 28/05/2020
DRACO - NECSTTechTalk 28/05/2020
 
HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020
 
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...ECCO: An Electron Counting Implementation for Image Compression and Optimizat...
ECCO: An Electron Counting Implementation for Image Compression and Optimizat...
 

Último

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Último (20)

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?

  • 2. 01/02/2021 Alberto Parravicini for i in range(10): flag = f1(x) f2(x, y) if flag: f3(z) else: f4(z) f5(x) Some random Python code, nothing unusual, right?
  • 3. 01/02/2021 Alberto Parravicini for i in range(10): flag = f1(x) f2(x, y) if flag: f3(z) else: f4(z) f5(x) ● f2 and f3 (or f4) run concurrently ● f5 waits only f2, not f3 or f4 It also works with R, JS, Scala, and more
  • 5. 01/02/2021 Alberto Parravicini f1(x) f2(y) f3(x, y) cudaGraphCreate(&graph, 0); void *kernel_1_args[3] = {(void *)&x, (void *)&x1, &N}; void *kernel_2_args[3] = {(void *)&y, (void *)&y1, &N}; void *kernel_3_args[4] = {(void *)&x1, (void *)&y1, (void *)&res, &N}; dim3 tb(block_size_1d); dim3 bs(num_blocks); kernel_1_params.func = (void *)f1; kernel_1_params.blockDim = bs; kernel_1_params.gridDim = tb; kernel_1_params.kernelParams = kernel_1_args; kernel_1_params.sharedMemBytes = 0; kernel_1_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_1_params); kernel_2_params.func = (void *)f2; kernel_2_params.blockDim = bs; kernel_2_params.gridDim = tb; kernel_2_params.kernelParams = kernel_2_args; kernel_2_params.sharedMemBytes = 0; kernel_2_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_2_params); nodeDependencies.push_back(kernel_1); nodeDependencies.push_back(kernel_2); kernel_3_params.func = (void *)f3; kernel_3_params.blockDim = bs; kernel_3_params.gridDim = tb; kernel_3_params.kernelParams = kernel_3_args; kernel_3_params.sharedMemBytes = 0; kernel_3_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_3_params); cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0); cudaGraphLaunch(graphExec, s1); err = cudaStreamSynchronize(s1);
  • 6. 01/02/2021 Alberto Parravicini 1. How we can perform heterogeneous & asynchronous GPU scheduling 2. We support many high-level languages like Python, R, JS, etc. 3. Same performance (if not better!) as low-level, cumbersome, hand-optimized CUDA code
  • 7. 01/02/2021 Alberto Parravicini The work presented today is the result of the ongoing collaboration between NECSTLab @ Polimi & Oracle Labs ● This work will appear as “DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime” in IPDPS 2021. Preprint: https://arxiv.org/pdf/2012.09646.pdf ● It’s also open-source! Go play with it! github.com/AlbertoParravicini/grcuda Big thanks to my co-authors Arnaud Delamare, Marco Arnaboldi and Prof. Marco Santambrogio! Also thanks to the original authors and developers of GrCUDA, Rene Mueller and Lukas Stadler!
  • 8. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs?
  • 9. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic
  • 10. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic Enter GrCUDA!
  • 11. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic Our marvelous GPU scheduler Enter GrCUDA!
  • 12. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work
  • 13. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work 2. They are difficult to integrate. Lot of boilerplate host code, and robust APIs are C++ only. GrCUDA to the rescue!
  • 14. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work 2. They are difficult to integrate. Lot of boilerplate host code, and robust APIs are C++ only. GrCUDA to the rescue! 3. The runtime is difficult to exploit, asynchronous execution, CPU+GPU cooperation, etc. Today we focus on this!
  • 15. 01/02/2021 Alberto Parravicini GraalVM is a giant mega-project at Oracle Labs Main idea: high-performance polyglot JVM runtime ● It supports many languages (Python, R, Scala, JavaScript, etc.), compiled to Java bytecode and ran in a JVM ● All languages are intercompatible, e.g. call JS code directly from your Java application (example from www.graalvm.org/docs/getting-started/ #polyglot-capabilities-of-native-images) // PrettyPrintJSON.java import java.io.*; import java.util.stream.*; import org.graalvm.polyglot.*; public class PrettyPrintJSON { public static void main(String[] args) throws java.io.IOException { BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); String input = reader.lines() .collect(Collectors.joining(System.lineSeparator())); try (Context context = Context.create("js")) { Value parse = context.eval("js", "JSON.parse"); Value stringify = context.eval("js", "JSON.stringify"); Value result = stringify.execute(parse.execute(input), null, 2); System.out.println(result.asString()); } } }
  • 16. 01/02/2021 Alberto Parravicini GraalVM provides 3 big advantages 1. All languages are built on top of the same backend, and benefit from the same optimizations 2. It is polyglot, languages can easily cooperate 3. It’s easy to add new languages! All languages are mapped to the same Intermediate Representation (Truffle) ● This (tree-like) IR is optimized, possibly in part or speculatively, and translated to Java bytecode and then to machine code ● It’s really complex! But it allows to create new dynamic languages without worrying about optimizations!
  • 17. 01/02/2021 Alberto Parravicini GrCUDA is a GraalVM-based DSL that exposes the CUDA API to Java, R, Python, JavaScript, etc. ● GPU acceleration for high-level languages through a unified backend GrCUDA provides many benefits ● Simplify data-transfer with Unified Memory ● Just-In-Time CUDA compilation ● Support for any CUDA kernel and library
  • 18. 01/02/2021 Alberto Parravicini GPU KERNEL __global__ void inc_kernel(int* x, int N) { for (int i = threadIdx.x + blockIdx.x * blockDim.x; i<N; i += gridDim.x * blockDim.x) { x[i] += 1; } } PYTHON import polyglot cu = polyglot.eval(language='grcuda', string='CU') inc_kernel = cu.buildkernel(INC_KERNEL_STR, 'inc_kernel(x: inout pointer sint32, N: uint64)') device_array = cu.DeviceArray('int', 100) for i in range(len(device_array)): device_array[i] = i inc_kernel(32, 256)(device_array, len(device_array)) R cu <- eval.polyglot('grcuda', 'CU') inc_kernel <- cu$buildkernel(KERNEL_STR,'...') num_elements <- 100 device_array <- cu$DeviceArray('int', num_elements) (array init omitted) inc_kernel(32, 256)(device_array, num_elements) JAVASCRIPT const cu = Polyglot.eval('grcuda', 'CU') const inc_kernel = cu.buildkernel(INC_KERNEL_STR, `inc_kernel(x: inout pointer sint32, N: sint32)`) const n = 100 let deviceArray = cu.DeviceArray('int', n) for (let i = 0; i < n; i++) deviceArray[i] = i inc_kernel(32, 256)(deviceArray, n)
  • 19. 01/02/2021 Alberto Parravicini Biggest limitation, no support for asynchronous execution ● Huge performance gains left on the table GPUs are great for parallel computing, but also excel in multi-kernel asynchronous computations 1. Run concurrent GPU computations (space-sharing) 2. Run GPU computations concurrently to CPU 3. Overlap data-transfer with computations Extracting full performance in multi-kernel computations is hard ● Synchronization events and data-movement must be hand-optimized ● Full CUDA API is only available to C/C++ Asynchronous execution provides an average of 60% speedup on a Tesla P100
  • 20. 01/02/2021 Alberto Parravicini Our goals ● Extract every ounce of asynchronicity from GrCUDA ● Do it automatically, transparently to the user We represent GPU computations as vertices of a DAG, connected through data dependencies ● We schedule parallel computations and limit synchronizations
  • 21. 01/02/2021 Alberto Parravicini Some frameworks deal with GPU scheduling, such as TensorFlow and CUDA Graphs by Nvidia What’s new here? 1. It’s fully transparent to the user, the API of GrCUDA is not modified 2. Dependencies are computed at runtime, not at compile time or eagerly ● GraalVM partial evaluation minimizes the runtime overheads (e.g. repeated array accesses) 3. Updates to the GrCUDA runtime are immediately available to every GraalVM language ● Instead of having different libraries: PyCUDA, JCuda, GPU.js, etc.
  • 22. 01/02/2021 Alberto Parravicini ● Computations (GPU kernels and CPU array accesses) are abstracted as DAG vertices ● Kernel invocation is asynchronous, the CPU execution is blocked only when it needs results ● Computations executed from the host language are captured and added to the DAG ● Data dependency computation is aware of read-only arguments and finished computations ● Data can be prefetched for maximum performance, and kernels use multiple streams No user-defined dependencies in the scheduling
  • 23. 01/02/2021 Alberto Parravicini ● Kernel invocations are wrapped into computational elements (1) ● The GrCUDA execution context computes data-dependencies and updates the DAG (2, 3) ● The computation is assigned a CUDA stream based on dependencies and availability (4) ● The execution context schedules the computation on GPU (5, 6) ● Data prefetching and event synchronizations are non-blocking and asynchronous ● New components are highlighted in red
  • 24. 01/02/2021 Alberto Parravicini 6 custom benchmarks to evaluate multi-task computations ● Tested on Nvidia Tesla P100 (high-end data-center GPU) and Nvidia GTX 1660 Super and GTX 960 (customer-grade GPUs) ● Note: dependency DAGs shown for clarity, but we never require the full DAGs!
  • 25. 01/02/2021 Alberto Parravicini ● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching ● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
  • 26. 01/02/2021 Alberto Parravicini ● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching ● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
  • 27. 01/02/2021 Alberto Parravicini Our scheduler exploits untapped GPU resources Higher values for ● Device memory throughput ● L2 cache utilization ● Instructions completed per clock (IPC) ● GFLOPS (single and double precision)
  • 28. 01/02/2021 Alberto Parravicini ● Started development for multi-GPU support ● Big thanks to Qi Zhou! ● Scheduling is more complex: some benchmarks are faster (B&S, 1.8x), some are slower (VEC, 0.35x) ● Other possible directions ● Applications on top of GrCUDA: e.g. sparse linear algebra, GrCUDA transparently maintains multiple data layouts (CSC, CSR, etc.) ● Integration with DSL: taking full advantage of asynchronous execution, and simplify GPU code writing
  • 29. ● A new scheduler for GrCUDA for transparent async execution ● 44% faster than synchronous execution ● Fully integrated with GraalVM, available for Python, R, Java, JavaScript, etc. ● Open Source: github.com/AlbertoParravicini/grcuda ● Paper: arxiv.org/pdf/2012.09646.pdf Alberto Parravicini alberto.parravicini@polimi.it 2020-02-01
  • 30. 01/02/2021 Alberto Parravicini Our goals ● Extract every ounce of asynchronicity from GrCUDA ● Do it automatically, transparently to the user We represent GPU computations as vertices of a DAG, connected through data dependencies ● We schedule parallel computations and limit synchronizations Plenty of use cases: ● GPU Graph/Database querying (union of subqueries) ● Image processing pipelines (combine multiple filters) ● Ensemble of ML models ● Combine predictions from different models on the same data