NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?

Alberto Parravicini
alberto.parravicini@polimi.it

01/02/2021
Alberto Parravicini
for i in range(10):
flag = f1(x)
f2(x, y)
if flag:
f3(z)
else:
f4(z)
f5(x)
Some random Python code,
nothing unusual, right?

01/02/2021
Alberto Parravicini
for i in range(10):
flag = f1(x)
f2(x, y)
if flag:
f3(z)
else:
f4(z)
f5(x)
● f2 and f3 (or f4) run concurrently
● f5 waits only f2, not f3 or f4
It also works with R, JS, Scala, and more

01/02/2021
Alberto Parravicini
f1(x)
f2(y)
f3(x, y)

01/02/2021
Alberto Parravicini
f1(x)
f2(y)
f3(x, y)
cudaGraphCreate(&graph, 0);
void *kernel_1_args[3] = {(void *)&x, (void *)&x1, &N};
void *kernel_2_args[3] = {(void *)&y, (void *)&y1, &N};
void *kernel_3_args[4] = {(void *)&x1, (void *)&y1, (void *)&res, &N};
dim3 tb(block_size_1d);
dim3 bs(num_blocks);
kernel_1_params.func = (void *)f1;
kernel_1_params.blockDim = bs;
kernel_1_params.gridDim = tb;
kernel_1_params.kernelParams = kernel_1_args;
kernel_1_params.sharedMemBytes = 0;
kernel_1_params.extra = NULL;
cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_1_params);
nodeDependencies.push_back(kernel_1);
nodeDependencies.push_back(kernel_2);
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, s1);
err = cudaStreamSynchronize(s1);

01/02/2021
Alberto Parravicini
1. How we can perform heterogeneous & asynchronous GPU
scheduling
2. We support many high-level languages like Python, R, JS, etc.
3. Same performance (if not better!) as low-level, cumbersome,
hand-optimized CUDA code

01/02/2021
Alberto Parravicini
The work presented today is the result of the ongoing collaboration
between NECSTLab @ Polimi & Oracle Labs
● This work will appear as “DAG-based Scheduling with Resource Sharing for Multi-task
Applications in a Polyglot GPU Runtime” in IPDPS 2021. Preprint:
https://arxiv.org/pdf/2012.09646.pdf
● It’s also open-source! Go play with it! github.com/AlbertoParravicini/grcuda
Big thanks to my co-authors Arnaud Delamare, Marco Arnaboldi and
Prof. Marco Santambrogio!
Also thanks to the original authors and developers of GrCUDA,
Rene Mueller and Lukas Stadler!

01/02/2021
Alberto Parravicini
We have a puzzle with 4 pieces!
Why do we care
about GPUs?

01/02/2021
Alberto Parravicini
Why do we care
about GPUs?
GraalVM & Polyglot Magic

01/02/2021
Alberto Parravicini
Why do we care
about GPUs?
Enter GrCUDA!

01/02/2021
Alberto Parravicini
Why do we care
about GPUs?
Our marvelous
GPU scheduler
Enter GrCUDA!

01/02/2021
Alberto Parravicini
GPUs were originally built for 3D graphic rendering
● But now they are everywhere: deep learning, engineering,
image processing, ﬁnance, etc.
They are still hard to use though, with 3 main issues
1. Programming GPUs is hard, they require knowledge of the
architecture and thread-based programming model.
Orthogonal to our work

01/02/2021
Alberto Parravicini
2. They are difficult to integrate. Lot of boilerplate host code,
and robust APIs are C++ only. GrCUDA to the rescue!

01/02/2021
Alberto Parravicini
2. They are difficult to integrate. Lot of boilerplate host code,
and robust APIs are C++ only. GrCUDA to the rescue!
3. The runtime is difficult to exploit, asynchronous
execution, CPU+GPU cooperation, etc.
Today we focus on this!

01/02/2021
Alberto Parravicini
GraalVM is a giant mega-project at Oracle Labs
Main idea: high-performance polyglot JVM
runtime
● It supports many languages (Python, R,
Scala, JavaScript, etc.), compiled to Java
bytecode and ran in a JVM
● All languages are intercompatible,
e.g. call JS code directly from your Java
application
(example from www.graalvm.org/docs/getting-started/
#polyglot-capabilities-of-native-images)
// PrettyPrintJSON.java
import java.io.*;
import java.util.stream.*;
import org.graalvm.polyglot.*;
public class PrettyPrintJSON {
public static void main(String[] args) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));
String input = reader.lines()
.collect(Collectors.joining(System.lineSeparator()));
try (Context context = Context.create("js")) {
Value parse = context.eval("js", "JSON.parse");
Value stringify = context.eval("js", "JSON.stringify");
Value result = stringify.execute(parse.execute(input), null, 2);
System.out.println(result.asString());
}
}
}

01/02/2021
Alberto Parravicini
GraalVM provides 3 big advantages
1. All languages are built on top of the same backend,
and beneﬁt from the same optimizations
2. It is polyglot, languages can easily cooperate
3. It’s easy to add new languages!
All languages are mapped to the same Intermediate Representation (Truffle)
● This (tree-like) IR is optimized, possibly in part or speculatively, and translated to
Java bytecode and then to machine code
● It’s really complex! But it allows to create new dynamic languages without
worrying about optimizations!

01/02/2021
Alberto Parravicini
GrCUDA is a GraalVM-based DSL that exposes the CUDA API to Java, R, Python, JavaScript, etc.
● GPU acceleration for high-level languages through a unified backend
GrCUDA provides many benefits
● Simplify data-transfer with Unified Memory
● Just-In-Time CUDA compilation
● Support for any CUDA kernel and library

01/02/2021
Alberto Parravicini
GPU KERNEL
__global__ void inc_kernel(int* x, int N) {
for (int i = threadIdx.x + blockIdx.x * blockDim.x; i<N;
i += gridDim.x * blockDim.x) {
x[i] += 1;
}
}
PYTHON
import polyglot
cu = polyglot.eval(language='grcuda', string='CU')
inc_kernel = cu.buildkernel(INC_KERNEL_STR,
'inc_kernel(x: inout pointer sint32, N: uint64)')
device_array = cu.DeviceArray('int', 100)
for i in range(len(device_array)):
device_array[i] = i
inc_kernel(32, 256)(device_array, len(device_array))
R
cu <- eval.polyglot('grcuda', 'CU')
inc_kernel <- cu$buildkernel(KERNEL_STR,'...')
num_elements <- 100
device_array <- cu$DeviceArray('int', num_elements)
(array init omitted)
inc_kernel(32, 256)(device_array, num_elements)
JAVASCRIPT
const cu = Polyglot.eval('grcuda', 'CU')
const inc_kernel = cu.buildkernel(INC_KERNEL_STR,
`inc_kernel(x: inout pointer sint32, N: sint32)`)
const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) deviceArray[i] = i
inc_kernel(32, 256)(deviceArray, n)

01/02/2021
Alberto Parravicini
Biggest limitation, no support for asynchronous execution
● Huge performance gains left on the table
GPUs are great for parallel computing,
but also excel in multi-kernel asynchronous computations
1. Run concurrent GPU computations (space-sharing)
2. Run GPU computations concurrently to CPU
3. Overlap data-transfer with computations
Extracting full performance in multi-kernel computations is hard
● Synchronization events and data-movement must be hand-optimized
● Full CUDA API is only available to C/C++
Asynchronous
execution provides
an average of
60% speedup
on a Tesla P100

01/02/2021
Alberto Parravicini
Our goals
● Extract every ounce of asynchronicity
from GrCUDA
● Do it automatically, transparently
to the user
We represent GPU computations as vertices
of a DAG, connected through data
dependencies
● We schedule parallel computations and
limit synchronizations

01/02/2021
Alberto Parravicini
Some frameworks deal with GPU scheduling, such as TensorFlow and CUDA Graphs by Nvidia
What’s new here?
1. It’s fully transparent to the user, the API of GrCUDA is not modiﬁed
2. Dependencies are computed at runtime, not at compile time or eagerly
● GraalVM partial evaluation minimizes the runtime overheads (e.g. repeated array
accesses)
3. Updates to the GrCUDA runtime are immediately available to every GraalVM language
● Instead of having different libraries: PyCUDA, JCuda, GPU.js, etc.

01/02/2021
Alberto Parravicini
● Computations (GPU kernels and CPU array
accesses) are abstracted as DAG vertices
● Kernel invocation is asynchronous, the CPU
execution is blocked only when it needs
results
● Computations executed from the host
language are captured and added to the DAG
● Data dependency computation is aware of
read-only arguments and ﬁnished
computations
● Data can be prefetched for maximum
performance, and kernels use multiple
streams
No user-deﬁned dependencies in the scheduling

01/02/2021
Alberto Parravicini
● Kernel invocations are wrapped into
computational elements (1)
● The GrCUDA execution context computes
data-dependencies and updates the DAG (2, 3)
● The computation is assigned a CUDA stream
based on dependencies and availability (4)
● The execution context schedules the
computation on GPU (5, 6)
● Data prefetching and event
synchronizations are non-blocking and
asynchronous
● New components are highlighted in red

01/02/2021
Alberto Parravicini
6 custom benchmarks to evaluate multi-task computations
● Tested on Nvidia Tesla P100 (high-end data-center GPU) and Nvidia GTX 1660 Super and GTX 960
(customer-grade GPUs)
● Note: dependency DAGs shown for clarity, but we never require the full DAGs!

01/02/2021
Alberto Parravicini
● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching
● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies

01/02/2021
Alberto Parravicini
Our scheduler exploits untapped
GPU resources
Higher values for
● Device memory throughput
● L2 cache utilization
● Instructions completed per clock
(IPC)
● GFLOPS (single and double
precision)

01/02/2021
Alberto Parravicini
● Started development for multi-GPU support
● Big thanks to Qi Zhou!
● Scheduling is more complex: some benchmarks are faster (B&S,
1.8x), some are slower (VEC, 0.35x)
● Other possible directions
● Applications on top of GrCUDA: e.g. sparse linear algebra, GrCUDA
transparently maintains multiple data layouts (CSC, CSR, etc.)
● Integration with DSL: taking full advantage of asynchronous
execution, and simplify GPU code writing

● A new scheduler for GrCUDA for transparent async execution
● 44% faster than synchronous execution
● Fully integrated with GraalVM, available for Python, R, Java,
JavaScript, etc.
● Open Source: github.com/AlbertoParravicini/grcuda
● Paper: arxiv.org/pdf/2012.09646.pdf
Alberto Parravicini
alberto.parravicini@polimi.it
2020-02-01

01/02/2021
Alberto Parravicini
Our goals
● Extract every ounce of asynchronicity from GrCUDA
● Do it automatically, transparently to the user
We represent GPU computations as vertices of a DAG,
connected through data dependencies
● We schedule parallel computations and limit
synchronizations
Plenty of use cases:
● GPU Graph/Database querying (union of subqueries)
● Image processing pipelines (combine multiple ﬁlters)
● Ensemble of ML models
● Combine predictions from different models
on the same data

NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de NECST Lab @ Politecnico di Milano

Mais de NECST Lab @ Politecnico di Milano (20)

Último

Último (20)

NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?