Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
4. 4
Pandas
Analytics
CPU Memory
Data Preparation VisualizationModel Training
Scikit-Learn
Machine Learning
NetworkX
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
Matplotlib
Visualization
Dask
Open Standards Data Science Ecosystem
Traditional Python APIs on CPU
5. 5
cuDF cuIO
Analytics
Data Preparation VisualizationModel Training
cuML, XGBoost
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
Dask
GPU Memory
RAPIDS
End-to-End GPU Accelerated Data Science
6. 6
Dask
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
RAPIDS ETL
GPU Accelerated Data Wrangling and Feature Engineering
cuDF cuIO
Analytics
7. 7
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
8. 8
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
RAPIDS
Arrow
Read
ETL
ML
Train
Query
50-100x Improvement
Same Code
Language Flexible
Primarily on GPU
11. 11
ETL - the Backbone of Data Science
PYTHON LIBRARY
▸ A Python library for manipulating GPU
DataFrames following the Pandas API
▸ Python interface to CUDA C++ library with
additional functionality
▸ Creating GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
▸ JIT compilation of User-Defined Functions
(UDFs) using Numba
▸ Most common formats: CSV, Parquet, ORC,
JSON, AVRO, HDF5, and more...
cuDF is…
12. 12
Benchmarks: Single-GPU Speedup vs. Pandas
cuDF v0.13, Pandas 0.25.3
▸ Running on NVIDIA DGX-1:
▸ GPU: NVIDIA Tesla V100 32GB
▸ CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
▸ Benchmark Setup:
▸ RMM Pool Allocator Enabled
▸ DataFrames: 2x int32 columns key columns, 3x int32
value columns
▸ Merge: inner; GroupBy: count, sum, min, max
calculated for each value column
300
900
500
0
Merge Sort
GroupBy
GPUSpeedupOver
CPU
10M 100M
970
500
370
350
330 320
13. 13
PyTorch,
TensorFlow, MxNet
Deep Learning
Dask
cuDF cuIO
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuGraph
Graph Analytics
cuxfilter, pyViz,
plotly
Visualization
Machine Learning with RAPIDS
More Models More Problems
cuML
Machine Learning
19. 19
Forest Inference
cuML’s Forest Inference Library accelerates prediction
(inference) for random forests and boosted decision trees:
▸ Works with existing saved models
(XGBoost, LightGBM, scikit-learn RF, cuML RF)
▸ Lightweight Python API
▸ Single V100 GPU can infer up to 34x faster than
XGBoost dual-CPU node
▸ Over 100 million forest inferences/sec on a DGX-1V
Taking Models From Training to Production
4000
3000
2000
1000
0
Bosch Airline Epsilon
Time(ms)
CPU Time (XGBoost, 40 Cores) FIL GPU Time (1x V100)
Higgs
XGBoost CPU Inference vs. FIL GPU (1000 trees)
23x
36x
34x
23x
20. 20
XGBoost + RAPIDS: Better Together
▸ RAPIDS comes paired with XGBoost 1.2 (as of
0.15)
▸ XGBoost now builds on the GoAI interface
standards to provide zero-copy data import from
cuDF, cuPY, Numba, PyTorch and more
▸ Official Dask API makes it easy to scale to
multiple nodes or multiple GPUs
▸ gpu_hist tree builder delivers huge perf gains
Memory usage when importing GPU data
decreased by 2/3 or more
▸ New objectives support Learning to Rank on GPU
All RAPIDS changes are integrated upstream and provided to
all XGBoost users – via pypi or RAPIDS conda
22. 22
Exactly as it sounds—our goal is to make
RAPIDS as usable and performant as
possible wherever data science is done.
We will continue to work with more open
source projects to further democratize
acceleration and efficiency in data science.
RAPIDS Everywhere
The Next Phase of RAPIDS
24. 24
“... an open source platform to manage the ML lifecycle,
including experimentation, reproducibility, deployment,
and a central model registry.”
- mlflow.org
…. And it works with RAPIDS, out of the box!
MLflow
25. 25
Why RAPIDS + MLflow?
RAPIDS substantial speedups across a wide range of machine learning and ETL tasks, SKlearn
compatible API.
MLflow improved collaboration, experiment tracking, model storage, registration, and
deployment.
Production /
Engineering
Update
Good?
Training
ValidateUpdate
26. 26
HPO Use Case: 100-Job Random Forest Airline Model
Huge speedups translate into >7x TCO reduction
Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run.
GPU nodes: 10x Standard_NC6s_v3, 1 V100 16G, vCPU 6 memory 112G, Xeon E5-2690 v4 (Broadwell) - $3.366/hour
CPU nodes: 10x Standard_DS5_v2, vCPU 16 memory 56G, Xeon E5-2673 v3 (Haswell) or v4 (Broadwell) - $1.017/hour"
Cost
Time(hours)
29. 29
A Quick Example:
Convert an Existing Project
29
Conversion to RAPIDS and MLflow
Add nesting+HPO and model logging
Add project entry points
Anaconda and Docker training
Deployment A Trained Model
30. 30
Integration and Training:
Basic Conversion
from sklearn.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
mod = RandomForestClassifier(
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds)
return mod, accuracy
Start MLflow ‘run’
from cuml.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
with mlflow.start_run(run_name="RAPIDS-MLFlow"):
mlparams = {
"max_depth": str(max_depth),
"max_features": str(max_features),
"n_estimators": str(n_estimators),
}
mlflow.log_params(mlparams)
mod = RandomForestClassifier(
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds)
mlflow.log_metric("accuracy", accuracy)
return mod
Record
Parameters
Record Performance
Metrics
Unmodified Training Code
Augmented Training Code
SKlearn to cuML
31. 31
Integration:
Nesting+HPO and Model Logging
hpo_runner = HPO_Runner(hpo_train)
with mlflow.start_run(run_name=f"RAPIDS-HPO", nested=True):
search_space = [
uniform("max_depth", 5, 20),
uniform("max_features", 0.1, 1.0),
uniform("n_estimators", 150, 1000),
]
hpo_results = hpo_runner(fpath, search_space)
artifact_path = "rapids-mlflow-example"
with mlflow.start_run(run_name='Final Classifier', nested=True):
mlflow.sklearn.log_model(hpo_results.best_model,
artifact_path=artifact_path,
registered_model_name="rapids-mlflow-example",
conda_env='conda/conda.yaml')
from cuml.ensemble import RandomForestClassifier
from your_hpo_library import HPO_Runner
# Called by hpo_runner
def hpo_train(params):
X_train, X_test, y_train, y_test = load_data(params.fpath)
with mlflow.start_run(run_name=f”Trial {params.trail}",
nested=True):
mod = RandomForestClassifier(
max_depth=params.max_depth,
max_features=params.max_features,
n_estimators=params.n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds
return mod, accuracy
Add HPO Runner
Log Runs and Best Model
Import our HPO library
Update Nested Training
Register Best Result
33. 33
Integration and Training:
Bringing Things Together
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
## Train in a custom Conda Environment
$ mlflow run --experiment-name "RAPIDS-MLflow-Conda"
--entry-point hpo_run ./
....
Created version '10' of model 'rapids_mlflow_cli'.
Model uri:
./mlruns/3/c20642df4137490fba2ca96a7b4431b0/artifacts/Airline-De
mo
2020/09/29 23:36:37 INFO mlflow.projects: === Run (ID
'c20642df4137490fba2ca96a7b4431b0') succeeded ==
Anaconda
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Export our conda environment so we can deploy later
$ docker build --tag mlflow-rapids-example --file
./Dockerfile.training ./
....
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
$ mlflow run --experiment-name "RAPIDS-MLflow-Docker"
--entry-point hpo_run ./
Docker
$ vi /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": { .... }
}
}
Nvidia-Docker