Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Miguel Martínez, NVIDIA
Thomas Graves, NVIDIA
Accelerating Apache
Spark with RAPIDS.ai
#UnifiedDataAnalytics #SparkAISummit

3#UnifiedDataAnalytics #SparkAISummit

Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with Rapids
– Summary

WHAT IS RAPIDS

Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics

cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session

LEARNING FROM
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert Arrow Memory
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Each system has its own internal memory format
70-80% computation wasted on serialization & deserialization
Similar functionality implemented in multiple projects
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality
Source: https://bit.ly/apachearrow

HOW TO START

On-premises
In the cloud
https://github.com/rapidsai
Source code on GitHub
https://ngc.nvidia.com
Containers on NGC & Docker Hub
https://anaconda.org/rapidsai
Conda packages

A step-by-step installation guide (MS Azure)
1. Create a NC6s_v2 virtual machine, and select NVIDIA GPU Cloud Image for Deep Learning and HPC.
2. Connect to the virtual machine:
$ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 username@public_ip_address
3. Pull the RAPIDS container from NGC. Run it.
$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
$ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786
nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
4. Run JupyterLab:
(rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh
5. Open your browser, and navigate to http://localhost:8080.
6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.

cuDF

GPU-Accelerated ETL
The average data scientist spends 90+% of their
time in ETL, as opposed to training models

• Follow Pandas APIs and provide >10x speedup
– CSV Reader/Writer
– Parquet Reader
– ORC Reader
– JSON Reader
– Avro Reader
• GPU Direct Storage integration in progress for
bypassing PCIe bottlenecks!
• Key is GPU-accelerating both parsing and
decompression wherever possible
EXTRACTION IS THE CORNERSTONE
cuDF for Faster Data Loading

ETL – THE BACKBONE OF DATA SCIENCE
libcuDF is… cuDF is…
• Low level library containing function
implementations and C/C++ API
• Importing/exporting Apache Arrow in GPU
memory using CUDA IPC
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, reduction, etc.
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames following the Pandas API
• Python interface to CUDA C++ library with
additional functionality
• Create GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
CUDA C++ Library Python Library

BENCHMARKS
Single-GPU Speedup vs Pandas
Environment
• cuDF v0.9
• Pandas v0.24.2
• GPU NVIDIA Tesla V100 32GB
• CPU Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Benchmark Setup
• DataFrames:
2x int32 columns key columns,
3x int32 value columns.
• Inner Merge
• GroupBy:
count, sum, min, max.
calculated for each value column.

Create an empty DataFrame, and add a column Create a DataFrame with two columns
Load a CSV file into a GPU DataFrame
Use Pandas to load a CSV file, and copy its content
into a GPU DataFrame
LOADING DATA INTO A GPU DATAFRAME
cuDF code examples

Return the first three rows as a new DataFrame Row slicing with column selection
Find the mean and standard deviation of a column Count number of occurrences per value, and number
of unique values
Transform column values with a custom function Change the data type of a column
WORKING WITH GPU DATAFRAMES
cuDF code examples

Query a DataFrame with a boolean expression
Return the first ‘n’ rows ordered by ‘columns’
Sort a column by its values
One-hot encoding
Group by column with aggregate function
Join and merge DataFrames
cuDF code examples
QUERY, SORT, GROUP, JOIN, …

cuML

cuML roadmap
October 2019 – RAPIDS 0.10
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM

cuML roadmap
March 2020 – RAPIDS 0.14
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM

CPU vs GPU
Training results
CPU: 57.1 seconds
GPU: 4.28 seconds
PRINCIPAL
COMPONENT
ANALYSIS
(PCA)
Specific: Import CPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: DataFrame from Pandas to GPU
Common: Model training Common: Model training
Specific: Import GPU algorithm
System: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB

CPU vs GPU
K-NEAREST
NEIGHBORS
(KNN)
Specific: DataFrame from Pandas to GPU
Specific: Import CPU algorithm Specific: Import GPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: Model training
Specific: Model trainingSystem: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB
Training results
CPU: 537 seconds
GPU: 4.28 seconds

25
#UnifiedDataAnalytics #SparkAISummit
The bigger the dataset is, the higher the
training performance difference is
between CPU and GPU.
Dataset size trained in 15 minutes.
CPU: ~130.000 rows
GPU: ~5.900.000 rows
Specs NC6s_vs
Cores
(Broadwell 2.6Ghz)
6
GPU 1 x P100
Memory 112 GB
Local Disk ~700 GB SSD
Network Azure Network
CPU vs GPU
TRAINING TIME COMPARISON

cuGraph

Focus on Features and User Experience
GOALS AND BENEFITS OF CUGRAPH
• Property Graph support via DataFrames
Seamless Integration with cuDF & cuML
• Up to 500 million edges on a single 32GB GPU
• Multi-GPU support for scaling into the billions
of edges
Breakthrough Performance
• Python: Familiar NetworkX-like API
• C/C++: lower-level granular control for
application developers
Multiple APIs
• Extensive collection of algorithm, primitive,
and utility functions
Growing Functionality

Louvain Single Run
Louvain returns:
cudf.DataFrame with two names columns:
louvain["vertex"]: The vertex id.
louvain["partition"]: The assigned partition.
G = cugraph.Graph()
G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"])
df, mod = cugraph.nvLouvain(G)

cuGraph roadmap
October 2019 – RAPIDS 0.10
cuGraph Single-GPU Multi-GPU
Multi-Node
Multi-GPU
Jaccard and Weighted Jaccard
Page Rank
Personal Page Rank
SSSP
BFS
Triangle Counting
Subgraph Extraction
Katz Centrality
Betweenness Centrality
Connected Components
Louvain
Spectral Clustering
K-Cores

cuGraph roadmap
March 2020 – RAPIDS 0.14
cuGraph Single-GPU Multi-GPU
Multi-Node
Multi-GPU
Jaccard and Weighted Jaccard
Page Rank
Personal Page Rank
SSSP
BFS
Triangle Counting
Subgraph Extraction
Katz Centrality
Betweenness Centrality
Connected Components
Louvain
Spectral Clustering
K-Cores

WHAT IS XGBOOST

XGBOOST
DEFINITION
XGBoost is an implementation of gradient
boosted decision trees designed for speed
and performance.
It is a powerful tool for
solving classification and
regression problems in a
supervised learning setting.

Input: age, gender, hair colour, … Does the person like computer games?
Age < 30
Is male?
+2 +1 -1 -1 -1Prediction score in each leaf
Single Decision Tree
HOW XGBOOST WORKS

Age < 30
Is male?
+2 +1 -1 -1 -1
Use computer
daily?
+0.9 +0.9 +0.9 -0.9 -0.9
Tree 1 Tree 2
f(‘Bill’) = 2 + 0.9 = 2.9 f(‘Sam’) = -1 - 0.9 = -1.9
Ensembled Decision Trees
HOW XGBOOST WORKS

TRAINED MODELS VISUALIZATION
Source: https://goo.gl/GWNdEm
Single Decision Tree vs Ensembled Decision Trees

WHY XGBOOST

Winner of Caterpiller Kaggle Contest 2015
– Machinery component pricing
Winner of CERN Large Hadron Collider Kaggle Contest 2015
– Classification of rare particle decay phenomena
Winner of KDD Cup 2016
– Research institutions’ impact on the acceptance of submitted
academic papers
Winner of ACM RecSys Challenge 2017
– Job posting recommendation
A STRONG HISTORY OF SUCCESS
On a Wide Range of Problems

Source: https://goo.gl/R8Y8Pp
Lower
Is
better
WHICH ML ALGORITHM PERFORMS BETTER
Average Rank Across 165 Datasets

XGBOOST + RAPIDS

XGBoost
– Tuned for eXtreme performance and high efficiency
– Multi-GPU and Multi-Node Support
RAPIDS
– E2E data science & analytics pipeline entirely on GPU
– User-friendly Python interfaces
– Relies on CUDA primitives, exposes parallelism and
high-memory bandwidth
– Dask integration for managing workers and data in
distributed environments
+

2,290
1,956
1,999
1,948
169
157
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuDF – Load and Data Prep cuML – XGBoost
cuDF (Load and Data Preparation)
0 5,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
8,763
6,147
3,926
3,221
322
213
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
End-to-End
BENCHMARKS
XGBoostData Conversion
Time in seconds — Shorter is better

LEARN MORE

www.rapids.ai

DEMO

SUMMARY

GPU Accelerated Data Science
RAPIDS is a set of open source software libraries which
gives you the freedom to execute end-to-end data science
and analytics pipelines entirely on GPUs.
www.rapids.ai

COMING UP NEXT
– Spark SQL with RAPIDS
– Summary

Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– Spark SQL with Rapids
– Summary

Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics

XGBoost4J-
Spark

+
● XGBoost4J-Spark enables XGBoost to train and inference
data on Apache Spark across nodes.
● Works with Apache Spark 2.x clusters.

+
DISTRIBUTED TRAINING
MULTI-GPU, MULTI-NODE
● Leverage NCCL
● Similar to dmlc/rabit (Allreduce/Broadcast)

+
Training Performance with T4
Preliminary Results on Google Cloud
Accuracy (AUC) Training Loop (Seconds)
Max Tree
Depth = 8
CPU (15 threads) 0.832 1071.0
GPU 0.832 139.6
Speedup 767%
Max Tree
Depth = 20
CPU (15 threads) 0.833 1088.7
GPU 0.833 165.9
Speedup 656%

XGBOOST + SPARK
+ RAPIDS

What about end-to-end?
● XGBoost training is pretty fast on GPUs
● Data loading/ETL is slow in comparison
● We need to accelerate the machine learning workflow end
to end
++

● GPU-accelerated DataFrames
● Data science operations: filter, sort, join, groupby,…
● Read CSV/Parquet/Orc
● Pandas-like API
● Bare-metal, CUDA/C++ performance
● cuDF is Apache Arrow compatible memory format
● Contiguous, column-major data representation
++
RAPIDS cuDF

● Read CSV/Parquet/Orc directly to GPU memory
● Converts column-major cuDF to sparse, row-major DMatrix
● Requires memcpy, but all data stays on GPU memory
++
XGBoost using cuDF

RAPIDS XGBOOST4J-SPARK
Training classification model for 17 year mortgage data (190GB)
On-prem: 9X speedup/5X cost saving
• 2 Dell PowerEdge R740 (16 cores)
• 2 PowerEdge R740 each w/ 4-T4 (16GB)
AWS: 34X speedup/6X cost saving
• 4 r5a.4xlarge servers (16 cores)
• 1 P3.16xlarge servers w/ 8-V100 (16GB)

Databricks XGBoost
DEMO

Collaboration with the community:
● cuDF integration into XGBoost (XGBoost-4745)
● External memory support for GPU/Out-of-core XGBoost
GPU (XGBoost-4357)
++

● Blog post: https://news.developer.nvidia.com/gpu-accelerated-spark-xgboost/
● Sample applications, notebooks and getting started guides at:
https://github.com/rapidsai/spark-examples
● Submit issues at https://github.com/rapidsai/spark-examples/issues and join
the conversation!
++

● GPU scheduling
● Stage Level scheduling
● Columnar processing on the GPU
Spark 3.0 Empowers GPU Apps

SPARK GPU
SCHEDULING

● GPUs used in deep learning and other workloads.
● Spark unaware of GPUs and cannot schedule directly
● Difficult to use
GPU Scheduling

● Accelerator-aware scheduling (SPARK-24615)
○ Request Executor resources (GPU, FPGA, etc)
○ Request Driver resources
○ Discover resources
○ Resource Task resources
○ Scheduler assigns resources to Tasks
○ API to get resources
○ Supported on YARN, Kubernetes, and Standalone
○ Spark 3.0
GPU Scheduling

EXAMPLE :
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh
GPU Scheduling
Example Discovery script in Spark github

Task API:
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses
...
… Pass assignedGpuAddrs into Tensorflow or other AI code
GPU Scheduling

Driver API:
scala> sc.resources
scala.collection.Map[String,org.apache.spark.resource.ResourceInformation]
= Map(gpu -> [name: gpu, addresses: 0])
scala> sc.resources("gpu").addresses
Array[String] = Array(0)
GPU Scheduling

GPU Scheduling
Submit
App
Cluster Manager
(YARN/K8s/etc)
Node
Executor
CPU GPU
Task
SPARK DRIVER
Request
Executor
Containers
w/GPU(s)
Spark
Launches
Executor
Assign
GPU(s) and
launch task
Executor
Registers
with GPU
addrs
SPARK APPLICATION
TASK CODE
Task runs
and user gets
GPU addrs
assigned
Pass GPU
addrs to
Tensorflow or
other AI algo

GPU Scheduling

● SPIP Documentation:
https://issues.apache.org/jira/browse/SPARK-24615
● User Documentation:
https://github.com/apache/spark/blob/master/docs/configuration.md
#custom-resource-scheduling-and-configuration-overview
● Working with Berkley RISELab to install GPUs in CI Jenkins nodes
● Fractional support: SPARK-29151
GPU Scheduling

SPARK STAGE
LEVEL
SCHEDULING

Stage Level Scheduling
CPU
NODE
GPU
SPARK ML APPLICATION
ETL Stage ML Stage
CPU
NODE

● Stage level resource scheduling (SPARK-27495)
○ Specify Resource requirements per RDD operation
○ Spark dynamically allocates containers to meet resource requirements
○ Spark schedules tasks on appropriate containers

... do some ETL using default configs then…
val rp = new ResourceProfile()
rp.require(new ExecutorResourceRequest("memory", 2048))
rp.require(new ExecutorResourceRequest("cores", 2))
rp.require(new ExecutorResourceRequest("gpu", 1,
Some("/opt/gpuScripts/getGpus")))
rp.require(new TaskResourceRequest("gpu", 1))
val rdd = sc.makeRDD(1 to 10, 5).mapPartitions { it =>
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses.iterator
… feed into ML algorithm…
}.withResources(rp)
rdd.collect()

SPARK SQL WITH
GPU
PROCESSING

● Users want to process data in a columnar manner - for instance run
on the GPU.
○ Vectorized R Execution in Apache Spark
○ Making Nested Columns as First Citizen in Apache Spark SQL
- Apple
○ Vectorized Query Execution in Apache Spark at Facebook
● The Dataset/DataFrame API in Spark only exposes one row at a
time when processing data. This doesn’t take advantage of
columnar processing.
+
SQL GPU Columnar
Processing

● Columnar processing (SPARK-27396)
○ Catalyst API for columnar processing
○ Plugins can modify query plan with columnar ops
○ Spark 3.0
+
SQL GPU Columnar
Processing

● Plugin that allows running Spark on a GPU
● Requires Spark 3.0
● No code changes required
● Only need plugin jar, cuDF jar and set spark.sql.extensions
● Runs the operations available on the GPU.
● If operation not implemented or not compatible with GPU will run on
CPU
● Handles transitioning from Row to Columnar and back
● Uses Rapids cuDF library
+
SQL GPU Columnar
Processing
RAPIDS for Apache SPARK Plugin

+
SQL GPU Columnar
Processing
RAPIDS for Apache SPARK Plugin
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.sql.extensions=ai.rapids.spark.Plugin
--jars ‘rapids-4-spark.jar, cudf.jar’
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh

A.join(B, A("longs") === B("blongs")).sort("longs")
+

BENCHMARK
select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and
o_orderdate < date '1993-07-01' + interval '3' month and exists ( select * from lineitem where l_orderkey =
o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority
TPCH Query 4
3.55X
speedup
Setup: 100GB data, GCP, 5 workers, n1-highmem-16 using 80 shuffle partitions

● Optimizer to choose when to run things on GPU
● More operations
● Handle larger data - spilling
● GPU direct storage
● RDMA and Shuffle improvements
● Better exchange format for ETL to ML
+
SQL GPU Columnar
Processing
Future Enhancements

DEMO

SUMMARY

Spark 3.0 Empowers GPU Apps
● GPU scheduling
● Stage Level scheduling
● Columnar processing
● RAPIDS for Apache Spark Plugin

++
• XGBoost on GPU is ready for wide
adoption
• Classification and regression
• Single GPU; Multi-GPU; Multi-
node
• In-core; External memory
• Available on Spark
• Kubernetes, YARN,
Standalone
• Open source collaboration
• XGBoost
• External memory
• RAPIDS
• cuDF, cuML

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

Semelhante a Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library