O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Richard Whitcomb, NVIDIA
Rong Ou, NVIDIA
Accelerating Machine Learning
Workloads and Apache Spark
Applications via CUDA an...
About Us
Richard Whitcomb: Senior Engineer working on AI
Infrastructure. Previously at Spotify, Twitter.
Rong Ou: Principa...
Why Spark on GPU?
Spark GPU: A Machine Learning Story
• Problem: predict loan delinquency
• Dataset: Fannie Mae loan performance data
• Libr...
Dataset
• Fannie Mae single-family loan performance data
• 18 years: 2000 - 2017
• # loans: 38,964,685
• # performance rec...
XGBoost
• Popular gradient boosting library
• Distributed mode via Spark
• GPU support via CUDA
• Multi-GPU support via NC...
Spark Cluster
• Standalone cluster on GCP
• 5 virtual machines, each has:
– 64 vCPUs (32 physical cores)
– 416 GB memory
–...
Sample Code
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 20,
"m...
Preliminary Results
Accuracy
(AUC)
Training Loop
(Seconds)
Max Tree
Depth = 8
CPU 0.832 1071.002
GPU 0.832 139.641
Speedup...
But...
• XGBoost training is pretty fast on GPUs
• ETL is slow in comparison
• We need to accelerate the machine learning
...
Apache Arrow RDD
• Store Arrow batches directly in RDDs
• Already has library support
• Moving between RDD->CUDA with Zero...
Arrow RDD Problems
• Users moving to Dataset/DataFrame API
• Difficult to use (columns vs rows)
• Most of Spark features a...
Moving towards DataFrames
• Can we provide similar speed improvements
under the DataFrame API?
• Little to no code changes...
ETL on GPUs
• Ability to process columnar across ops is key
• Added interface so DataFrame ops can “opt-in”
to consume and...
Simple Benchmark
18x
speedup
dfc = spark.schema(schema).csv("...")
dfc.groupBy("id").agg(F.sum("x"))
No user code changes
...
Spark on GPU
• Encouraging early results with room to improve
– 6X speedup of XGBoost training loop
– 18X speedup of datas...
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL
Próximos SlideShares
Carregando em…5
×

Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

270 visualizações

Publicada em

"Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL.

We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (github.com/rapidsai/cudf), can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs.

For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost.

Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores. "

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

  1. 1. Richard Whitcomb, NVIDIA Rong Ou, NVIDIA Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL #UnifiedAnalytics #SparkAISummit
  2. 2. About Us Richard Whitcomb: Senior Engineer working on AI Infrastructure. Previously at Spotify, Twitter. Rong Ou: Principal Engineer at Nvidia working on AI Infrastructure. Previously at Google.
  3. 3. Why Spark on GPU?
  4. 4. Spark GPU: A Machine Learning Story • Problem: predict loan delinquency • Dataset: Fannie Mae loan performance data • Library: XGBoost • Platform: Apache Spark • GPU: NVIDIA Tesla T4 5#UnifiedAnalytics #SparkAISummit
  5. 5. Dataset • Fannie Mae single-family loan performance data • 18 years: 2000 - 2017 • # loans: 38,964,685 • # performance records: 2,008,374,244 • Size (CSV): 168 GB
  6. 6. XGBoost • Popular gradient boosting library • Distributed mode via Spark • GPU support via CUDA • Multi-GPU support via NCCL 2 • Recent addition: multi-node GPU support • Experimental: running on Spark with GPUs 7#UnifiedAnalytics #SparkAISummit
  7. 7. Spark Cluster • Standalone cluster on GCP • 5 virtual machines, each has: – 64 vCPUs (32 physical cores) – 416 GB memory – 4 x NVIDIA Tesla T4 – 400 GB SSD persistent disk – Default networking • 4 Spark workers per VM
  8. 8. Sample Code import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 20, "max_leaves" -> 256, "grow_policy" -> "lossguide", "num_round" -> 100, "num_workers" -> 20, "nthread" -> 16, "tree_method" -> "gpu_hist") val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("labels")
  9. 9. Preliminary Results Accuracy (AUC) Training Loop (Seconds) Max Tree Depth = 8 CPU 0.832 1071.002 GPU 0.832 139.641 Speedup 766.97% Max Tree Depth = 20 CPU 0.833 1088.662 GPU 0.833 165.868 Speedup 656.34%
  10. 10. But... • XGBoost training is pretty fast on GPUs • ETL is slow in comparison • We need to accelerate the machine learning workflow end to end
  11. 11. Apache Arrow RDD • Store Arrow batches directly in RDDs • Already has library support • Moving between RDD->CUDA with Zero Copy • Eliminates PySpark serialization overhead – 20x speed improvement in PySpark vs Pickle
  12. 12. Arrow RDD Problems • Users moving to Dataset/DataFrame API • Difficult to use (columns vs rows) • Most of Spark features aren’t usable, mostly works on distributed Pandas dataframes • Users would have to rewrite all of their ETL jobs to make use of GPUs
  13. 13. Moving towards DataFrames • Can we provide similar speed improvements under the DataFrame API? • Little to no code changes for ETL jobs • Same API in which users are comfortable
  14. 14. ETL on GPUs • Ability to process columnar across ops is key • Added interface so DataFrame ops can “opt-in” to consume and produce columnar data • Added columnar processing to a few DataFrame ops (CSV parsing, hash join, hash aggregate, etc.) • Can switch between row/columnar with config
  15. 15. Simple Benchmark 18x speedup dfc = spark.schema(schema).csv("...") dfc.groupBy("id").agg(F.sum("x")) No user code changes Config settings to enable GPU acceleration Uses RAPIDS library under the covers: https://rapids.ai/
  16. 16. Spark on GPU • Encouraging early results with room to improve – 6X speedup of XGBoost training loop – 18X speedup of dataset-based ETL example • Eager to collaborate with the Spark community – Accelerator-aware scheduling (SPARK-24615) – Stage level resource scheduling (SPARK-27495) – Columnar processing (SPARK-27396) – cuDF integration into XGBoost (XGBOOST-3997) – Out-of-core XGBoost GPU (XGBOOST-4357)

×