GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
4. 4#UnifiedDataAnalytics #SparkAISummit
Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with Rapids
– Summary
6. 6#UnifiedDataAnalytics #SparkAISummit
Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
7. 7#UnifiedDataAnalytics #SparkAISummit
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session
8. 8#UnifiedDataAnalytics #SparkAISummit
LEARNING FROM
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert Arrow Memory
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Each system has its own internal memory format
70-80% computation wasted on serialization & deserialization
Similar functionality implemented in multiple projects
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality
Source: https://bit.ly/apachearrow
11. 11#UnifiedDataAnalytics #SparkAISummit
A step-by-step installation guide (MS Azure)
1. Create a NC6s_v2 virtual machine, and select NVIDIA GPU Cloud Image for Deep Learning and HPC.
2. Connect to the virtual machine:
$ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 username@public_ip_address
3. Pull the RAPIDS container from NGC. Run it.
$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
$ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786
nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
4. Run JupyterLab:
(rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh
5. Open your browser, and navigate to http://localhost:8080.
6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.
14. 14#UnifiedDataAnalytics #SparkAISummit
• Follow Pandas APIs and provide >10x speedup
– CSV Reader/Writer
– Parquet Reader
– ORC Reader
– JSON Reader
– Avro Reader
• GPU Direct Storage integration in progress for
bypassing PCIe bottlenecks!
• Key is GPU-accelerating both parsing and
decompression wherever possible
EXTRACTION IS THE CORNERSTONE
cuDF for Faster Data Loading
15. 15#UnifiedDataAnalytics #SparkAISummit
ETL – THE BACKBONE OF DATA SCIENCE
libcuDF is… cuDF is…
• Low level library containing function
implementations and C/C++ API
• Importing/exporting Apache Arrow in GPU
memory using CUDA IPC
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, reduction, etc.
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames following the Pandas API
• Python interface to CUDA C++ library with
additional functionality
• Create GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
CUDA C++ Library Python Library
16. 16#UnifiedDataAnalytics #SparkAISummit
BENCHMARKS
Single-GPU Speedup vs Pandas
Environment
• cuDF v0.9
• Pandas v0.24.2
• GPU NVIDIA Tesla V100 32GB
• CPU Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Benchmark Setup
• DataFrames:
2x int32 columns key columns,
3x int32 value columns.
• Inner Merge
• GroupBy:
count, sum, min, max.
calculated for each value column.
17. 17#UnifiedDataAnalytics #SparkAISummit
Create an empty DataFrame, and add a column Create a DataFrame with two columns
Load a CSV file into a GPU DataFrame
Use Pandas to load a CSV file, and copy its content
into a GPU DataFrame
LOADING DATA INTO A GPU DATAFRAME
cuDF code examples
18. 18#UnifiedDataAnalytics #SparkAISummit
Return the first three rows as a new DataFrame Row slicing with column selection
Find the mean and standard deviation of a column Count number of occurrences per value, and number
of unique values
Transform column values with a custom function Change the data type of a column
WORKING WITH GPU DATAFRAMES
cuDF code examples
19. 19#UnifiedDataAnalytics #SparkAISummit
Query a DataFrame with a boolean expression
Return the first ‘n’ rows ordered by ‘columns’
Sort a column by its values
One-hot encoding
Group by column with aggregate function
Join and merge DataFrames
cuDF code examples
QUERY, SORT, GROUP, JOIN, …
21. 21#UnifiedDataAnalytics #SparkAISummit
cuML roadmap
October 2019 – RAPIDS 0.10
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM
22. 22#UnifiedDataAnalytics #SparkAISummit
cuML roadmap
March 2020 – RAPIDS 0.14
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM
23. 23#UnifiedDataAnalytics #SparkAISummit
CPU vs GPU
Training results
CPU: 57.1 seconds
GPU: 4.28 seconds
PRINCIPAL
COMPONENT
ANALYSIS
(PCA)
Specific: Import CPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: DataFrame from Pandas to GPU
Common: Model training Common: Model training
Specific: Import GPU algorithm
System: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB
24. 24#UnifiedDataAnalytics #SparkAISummit
CPU vs GPU
K-NEAREST
NEIGHBORS
(KNN)
Specific: DataFrame from Pandas to GPU
Specific: Import CPU algorithm Specific: Import GPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: Model training
Specific: Model trainingSystem: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB
Training results
CPU: 537 seconds
GPU: 4.28 seconds
25. 25
#UnifiedDataAnalytics #SparkAISummit
The bigger the dataset is, the higher the
training performance difference is
between CPU and GPU.
Dataset size trained in 15 minutes.
CPU: ~130.000 rows
GPU: ~5.900.000 rows
Specs NC6s_vs
Cores
(Broadwell 2.6Ghz)
6
GPU 1 x P100
Memory 112 GB
Local Disk ~700 GB SSD
Network Azure Network
CPU vs GPU
TRAINING TIME COMPARISON
27. 27#UnifiedDataAnalytics #SparkAISummit
Focus on Features and User Experience
GOALS AND BENEFITS OF CUGRAPH
• Property Graph support via DataFrames
Seamless Integration with cuDF & cuML
• Up to 500 million edges on a single 32GB GPU
• Multi-GPU support for scaling into the billions
of edges
Breakthrough Performance
• Python: Familiar NetworkX-like API
• C/C++: lower-level granular control for
application developers
Multiple APIs
• Extensive collection of algorithm, primitive,
and utility functions
Growing Functionality
28. 28#UnifiedDataAnalytics #SparkAISummit
Louvain Single Run
Louvain returns:
cudf.DataFrame with two names columns:
louvain["vertex"]: The vertex id.
louvain["partition"]: The assigned partition.
G = cugraph.Graph()
G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"])
df, mod = cugraph.nvLouvain(G)
33. 33#UnifiedDataAnalytics #SparkAISummit
Input: age, gender, hair colour, … Does the person like computer games?
Age < 30
Is male?
+2 +1 -1 -1 -1Prediction score in each leaf
Single Decision Tree
HOW XGBOOST WORKS
34. 34#UnifiedDataAnalytics #SparkAISummit
Age < 30
Is male?
+2 +1 -1 -1 -1
Use computer
daily?
+0.9 +0.9 +0.9 -0.9 -0.9
Tree 1 Tree 2
f(‘Bill’) = 2 + 0.9 = 2.9 f(‘Sam’) = -1 - 0.9 = -1.9
Ensembled Decision Trees
HOW XGBOOST WORKS
37. 37#UnifiedDataAnalytics #SparkAISummit
Winner of Caterpiller Kaggle Contest 2015
– Machinery component pricing
Winner of CERN Large Hadron Collider Kaggle Contest 2015
– Classification of rare particle decay phenomena
Winner of KDD Cup 2016
– Research institutions’ impact on the acceptance of submitted
academic papers
Winner of ACM RecSys Challenge 2017
– Job posting recommendation
A STRONG HISTORY OF SUCCESS
On a Wide Range of Problems
40. 40#UnifiedDataAnalytics #SparkAISummit
XGBoost
– Tuned for eXtreme performance and high efficiency
– Multi-GPU and Multi-Node Support
RAPIDS
– E2E data science & analytics pipeline entirely on GPU
– User-friendly Python interfaces
– Relies on CUDA primitives, exposes parallelism and
high-memory bandwidth
– Dask integration for managing workers and data in
distributed environments
+
41. 41#UnifiedDataAnalytics #SparkAISummit
2,290
1,956
1,999
1,948
169
157
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuDF – Load and Data Prep cuML – XGBoost
cuDF (Load and Data Preparation)
0 5,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
8,763
6,147
3,926
3,221
322
213
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
End-to-End
BENCHMARKS
XGBoostData Conversion
Time in seconds — Shorter is better
46. 46#UnifiedDataAnalytics #SparkAISummit
GPU Accelerated Data Science
RAPIDS is a set of open source software libraries which
gives you the freedom to execute end-to-end data science
and analytics pipelines entirely on GPUs.
www.rapids.ai
48. 48#UnifiedDataAnalytics #SparkAISummit
Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with Rapids
– Summary
49. 49#UnifiedDataAnalytics #SparkAISummit
Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
53. 53#UnifiedDataAnalytics #SparkAISummit
+
Training Performance with T4
Preliminary Results on Google Cloud
Accuracy (AUC) Training Loop (Seconds)
Max Tree
Depth = 8
CPU (15 threads) 0.832 1071.0
GPU 0.832 139.6
Speedup 767%
Max Tree
Depth = 20
CPU (15 threads) 0.833 1088.7
GPU 0.833 165.9
Speedup 656%
55. 55#UnifiedDataAnalytics #SparkAISummit
What about end-to-end?
● XGBoost training is pretty fast on GPUs
● Data loading/ETL is slow in comparison
● We need to accelerate the machine learning workflow end
to end
++
56. 56#UnifiedDataAnalytics #SparkAISummit
● GPU-accelerated DataFrames
● Data science operations: filter, sort, join, groupby,…
● Read CSV/Parquet/Orc
● Pandas-like API
● Bare-metal, CUDA/C++ performance
● cuDF is Apache Arrow compatible memory format
● Contiguous, column-major data representation
++
RAPIDS cuDF
57. 57#UnifiedDataAnalytics #SparkAISummit
● Read CSV/Parquet/Orc directly to GPU memory
● Converts column-major cuDF to sparse, row-major DMatrix
● Requires memcpy, but all data stays on GPU memory
++
XGBoost using cuDF
58. 58#UnifiedDataAnalytics #SparkAISummit
RAPIDS XGBOOST4J-SPARK
Training classification model for 17 year mortgage data (190GB)
On-prem: 9X speedup/5X cost saving
• 2 Dell PowerEdge R740 (16 cores)
• 2 PowerEdge R740 each w/ 4-T4 (16GB)
AWS: 34X speedup/6X cost saving
• 4 r5a.4xlarge servers (16 cores)
• 1 P3.16xlarge servers w/ 8-V100 (16GB)
61. 61#UnifiedDataAnalytics #SparkAISummit
● Blog post: https://news.developer.nvidia.com/gpu-accelerated-spark-xgboost/
● Sample applications, notebooks and getting started guides at:
https://github.com/rapidsai/spark-examples
● Submit issues at https://github.com/rapidsai/spark-examples/issues and join
the conversation!
++
65. 65#UnifiedDataAnalytics #SparkAISummit
● Accelerator-aware scheduling (SPARK-24615)
○ Request Executor resources (GPU, FPGA, etc)
○ Request Driver resources
○ Discover resources
○ Resource Task resources
○ Scheduler assigns resources to Tasks
○ API to get resources
○ Supported on YARN, Kubernetes, and Standalone
○ Spark 3.0
GPU Scheduling
66. 66#UnifiedDataAnalytics #SparkAISummit
EXAMPLE :
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh
GPU Scheduling
Example Discovery script in Spark github
67. 67#UnifiedDataAnalytics #SparkAISummit
Task API:
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses
...
… Pass assignedGpuAddrs into Tensorflow or other AI code
GPU Scheduling
71. 71#UnifiedDataAnalytics #SparkAISummit
● SPIP Documentation:
https://issues.apache.org/jira/browse/SPARK-24615
● User Documentation:
https://github.com/apache/spark/blob/master/docs/configuration.md
#custom-resource-scheduling-and-configuration-overview
● Working with Berkley RISELab to install GPUs in CI Jenkins nodes
● Fractional support: SPARK-29151
GPU Scheduling
75. 75#UnifiedDataAnalytics #SparkAISummit
... do some ETL using default configs then…
val rp = new ResourceProfile()
rp.require(new ExecutorResourceRequest("memory", 2048))
rp.require(new ExecutorResourceRequest("cores", 2))
rp.require(new ExecutorResourceRequest("gpu", 1,
Some("/opt/gpuScripts/getGpus")))
rp.require(new TaskResourceRequest("gpu", 1))
val rdd = sc.makeRDD(1 to 10, 5).mapPartitions { it =>
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses.iterator
… feed into ML algorithm…
}.withResources(rp)
rdd.collect()
Stage Level Scheduling
77. 77#UnifiedDataAnalytics #SparkAISummit
● Users want to process data in a columnar manner - for instance run
on the GPU.
○ Vectorized R Execution in Apache Spark
○ Making Nested Columns as First Citizen in Apache Spark SQL
- Apple
○ Vectorized Query Execution in Apache Spark at Facebook
● The Dataset/DataFrame API in Spark only exposes one row at a
time when processing data. This doesn’t take advantage of
columnar processing.
+
SQL GPU Columnar
Processing
78. 78#UnifiedDataAnalytics #SparkAISummit
● Columnar processing (SPARK-27396)
○ Catalyst API for columnar processing
○ Plugins can modify query plan with columnar ops
○ Spark 3.0
+
SQL GPU Columnar
Processing
79. 79#UnifiedDataAnalytics #SparkAISummit
● Plugin that allows running Spark on a GPU
● Requires Spark 3.0
● No code changes required
● Only need plugin jar, cuDF jar and set spark.sql.extensions
● Runs the operations available on the GPU.
● If operation not implemented or not compatible with GPU will run on
CPU
● Handles transitioning from Row to Columnar and back
● Uses Rapids cuDF library
+
SQL GPU Columnar
Processing
RAPIDS for Apache SPARK Plugin
82. 82#UnifiedDataAnalytics #SparkAISummit
BENCHMARK
select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and
o_orderdate < date '1993-07-01' + interval '3' month and exists ( select * from lineitem where l_orderkey =
o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority
TPCH Query 4
3.55X
speedup
Setup: 100GB data, GCP, 5 workers, n1-highmem-16 using 80 shuffle partitions
83. 83#UnifiedDataAnalytics #SparkAISummit
● Optimizer to choose when to run things on GPU
● More operations
● Handle larger data - spilling
● GPU direct storage
● RDMA and Shuffle improvements
● Better exchange format for ETL to ML
+
SQL GPU Columnar
Processing
Future Enhancements