SlideShare uma empresa Scribd logo
1 de 89
Baixar para ler offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Miguel Martínez, NVIDIA
Thomas Graves, NVIDIA
Accelerating Apache
Spark with RAPIDS.ai
#UnifiedDataAnalytics #SparkAISummit
3#UnifiedDataAnalytics #SparkAISummit
4#UnifiedDataAnalytics #SparkAISummit
Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with Rapids
– Summary
5#UnifiedDataAnalytics #SparkAISummit
WHAT IS RAPIDS
6#UnifiedDataAnalytics #SparkAISummit
Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
7#UnifiedDataAnalytics #SparkAISummit
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session
8#UnifiedDataAnalytics #SparkAISummit
LEARNING FROM
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert Arrow Memory
Pandas
Spark
Drill
Impala
Parquet
Cassandra Kudu
HBase
Each system has its own internal memory format
70-80% computation wasted on serialization & deserialization
Similar functionality implemented in multiple projects
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality
Source: https://bit.ly/apachearrow
9#UnifiedDataAnalytics #SparkAISummit
HOW TO START
10#UnifiedDataAnalytics #SparkAISummit
On-premises
In the cloud
https://github.com/rapidsai
Source code on GitHub
https://ngc.nvidia.com
Containers on NGC & Docker Hub
https://anaconda.org/rapidsai
Conda packages
11#UnifiedDataAnalytics #SparkAISummit
A step-by-step installation guide (MS Azure)
1. Create a NC6s_v2 virtual machine, and select NVIDIA GPU Cloud Image for Deep Learning and HPC.
2. Connect to the virtual machine:
$ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 username@public_ip_address
3. Pull the RAPIDS container from NGC. Run it.
$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
$ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 
nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
4. Run JupyterLab:
(rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh
5. Open your browser, and navigate to http://localhost:8080.
6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.
12#UnifiedDataAnalytics #SparkAISummit
cuDF
13#UnifiedDataAnalytics #SparkAISummit
GPU-Accelerated ETL
The average data scientist spends 90+% of their
time in ETL, as opposed to training models
14#UnifiedDataAnalytics #SparkAISummit
• Follow Pandas APIs and provide >10x speedup
– CSV Reader/Writer
– Parquet Reader
– ORC Reader
– JSON Reader
– Avro Reader
• GPU Direct Storage integration in progress for
bypassing PCIe bottlenecks!
• Key is GPU-accelerating both parsing and
decompression wherever possible
EXTRACTION IS THE CORNERSTONE
cuDF for Faster Data Loading
15#UnifiedDataAnalytics #SparkAISummit
ETL – THE BACKBONE OF DATA SCIENCE
libcuDF is… cuDF is…
• Low level library containing function
implementations and C/C++ API
• Importing/exporting Apache Arrow in GPU
memory using CUDA IPC
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, reduction, etc.
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames following the Pandas API
• Python interface to CUDA C++ library with
additional functionality
• Create GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
CUDA C++ Library Python Library
16#UnifiedDataAnalytics #SparkAISummit
BENCHMARKS
Single-GPU Speedup vs Pandas
Environment
• cuDF v0.9
• Pandas v0.24.2
• GPU NVIDIA Tesla V100 32GB
• CPU Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Benchmark Setup
• DataFrames:
2x int32 columns key columns,
3x int32 value columns.
• Inner Merge
• GroupBy:
count, sum, min, max.
calculated for each value column.
17#UnifiedDataAnalytics #SparkAISummit
Create an empty DataFrame, and add a column Create a DataFrame with two columns
Load a CSV file into a GPU DataFrame
Use Pandas to load a CSV file, and copy its content
into a GPU DataFrame
LOADING DATA INTO A GPU DATAFRAME
cuDF code examples
18#UnifiedDataAnalytics #SparkAISummit
Return the first three rows as a new DataFrame Row slicing with column selection
Find the mean and standard deviation of a column Count number of occurrences per value, and number
of unique values
Transform column values with a custom function Change the data type of a column
WORKING WITH GPU DATAFRAMES
cuDF code examples
19#UnifiedDataAnalytics #SparkAISummit
Query a DataFrame with a boolean expression
Return the first ‘n’ rows ordered by ‘columns’
Sort a column by its values
One-hot encoding
Group by column with aggregate function
Join and merge DataFrames
cuDF code examples
QUERY, SORT, GROUP, JOIN, …
20#UnifiedDataAnalytics #SparkAISummit
cuML
21#UnifiedDataAnalytics #SparkAISummit
cuML roadmap
October 2019 – RAPIDS 0.10
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM
22#UnifiedDataAnalytics #SparkAISummit
cuML roadmap
March 2020 – RAPIDS 0.14
cuML Single-GPU Multi-GPU
Multi-Node
Multi-GPU
XGBoost GBDT
GLM
Logistic Regression
Random Forest
K-Means
K-NN
DBSCAN
UMAP
ARIMA & Holt-Winters
Kalman Filter
t-SNE
Principal Components
Singular Value Decomposition
SVM
23#UnifiedDataAnalytics #SparkAISummit
CPU vs GPU
Training results
CPU: 57.1 seconds
GPU: 4.28 seconds
PRINCIPAL
COMPONENT
ANALYSIS
(PCA)
Specific: Import CPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: DataFrame from Pandas to GPU
Common: Model training Common: Model training
Specific: Import GPU algorithm
System: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB
24#UnifiedDataAnalytics #SparkAISummit
CPU vs GPU
K-NEAREST
NEIGHBORS
(KNN)
Specific: DataFrame from Pandas to GPU
Specific: Import CPU algorithm Specific: Import GPU algorithm
Common: Data loading and algo params Common: Data loading and algo params
Specific: Model training
Specific: Model trainingSystem: AWS p3.8xlarge
CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz,
32 vCPU cores, 244 GB RAM
GPU: Tesla V100 SXM2 16GB
Training results
CPU: 537 seconds
GPU: 4.28 seconds
25
#UnifiedDataAnalytics #SparkAISummit
The bigger the dataset is, the higher the
training performance difference is
between CPU and GPU.
Dataset size trained in 15 minutes.
CPU: ~130.000 rows
GPU: ~5.900.000 rows
Specs NC6s_vs
Cores
(Broadwell 2.6Ghz)
6
GPU 1 x P100
Memory 112 GB
Local Disk ~700 GB SSD
Network Azure Network
CPU vs GPU
TRAINING TIME COMPARISON
26#UnifiedDataAnalytics #SparkAISummit
cuGraph
27#UnifiedDataAnalytics #SparkAISummit
Focus on Features and User Experience
GOALS AND BENEFITS OF CUGRAPH
• Property Graph support via DataFrames
Seamless Integration with cuDF & cuML
• Up to 500 million edges on a single 32GB GPU
• Multi-GPU support for scaling into the billions
of edges
Breakthrough Performance
• Python: Familiar NetworkX-like API
• C/C++: lower-level granular control for
application developers
Multiple APIs
• Extensive collection of algorithm, primitive,
and utility functions
Growing Functionality
28#UnifiedDataAnalytics #SparkAISummit
Louvain Single Run
Louvain returns:
cudf.DataFrame with two names columns:
louvain["vertex"]: The vertex id.
louvain["partition"]: The assigned partition.
G = cugraph.Graph()
G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"])
df, mod = cugraph.nvLouvain(G)
29#UnifiedDataAnalytics #SparkAISummit
cuGraph roadmap
October 2019 – RAPIDS 0.10
cuGraph Single-GPU Multi-GPU
Multi-Node
Multi-GPU
Jaccard and Weighted Jaccard
Page Rank
Personal Page Rank
SSSP
BFS
Triangle Counting
Subgraph Extraction
Katz Centrality
Betweenness Centrality
Connected Components
Louvain
Spectral Clustering
K-Cores
30#UnifiedDataAnalytics #SparkAISummit
cuGraph roadmap
March 2020 – RAPIDS 0.14
cuGraph Single-GPU Multi-GPU
Multi-Node
Multi-GPU
Jaccard and Weighted Jaccard
Page Rank
Personal Page Rank
SSSP
BFS
Triangle Counting
Subgraph Extraction
Katz Centrality
Betweenness Centrality
Connected Components
Louvain
Spectral Clustering
K-Cores
31#UnifiedDataAnalytics #SparkAISummit
WHAT IS XGBOOST
XGBOOST
32#UnifiedDataAnalytics #SparkAISummit
DEFINITION
XGBoost is an implementation of gradient
boosted decision trees designed for speed
and performance.
It is a powerful tool for
solving classification and
regression problems in a
supervised learning setting.
33#UnifiedDataAnalytics #SparkAISummit
Input: age, gender, hair colour, … Does the person like computer games?
Age < 30
Is male?
+2 +1 -1 -1 -1Prediction score in each leaf
Single Decision Tree
HOW XGBOOST WORKS
34#UnifiedDataAnalytics #SparkAISummit
Age < 30
Is male?
+2 +1 -1 -1 -1
Use computer
daily?
+0.9 +0.9 +0.9 -0.9 -0.9
Tree 1 Tree 2
f(‘Bill’) = 2 + 0.9 = 2.9 f(‘Sam’) = -1 - 0.9 = -1.9
Ensembled Decision Trees
HOW XGBOOST WORKS
35#UnifiedDataAnalytics #SparkAISummit
TRAINED MODELS VISUALIZATION
Source: https://goo.gl/GWNdEm
Single Decision Tree vs Ensembled Decision Trees
36#UnifiedDataAnalytics #SparkAISummit
WHY XGBOOST
37#UnifiedDataAnalytics #SparkAISummit
Winner of Caterpiller Kaggle Contest 2015
– Machinery component pricing
Winner of CERN Large Hadron Collider Kaggle Contest 2015
– Classification of rare particle decay phenomena
Winner of KDD Cup 2016
– Research institutions’ impact on the acceptance of submitted
academic papers
Winner of ACM RecSys Challenge 2017
– Job posting recommendation
A STRONG HISTORY OF SUCCESS
On a Wide Range of Problems
38#UnifiedDataAnalytics #SparkAISummit
Source: https://goo.gl/R8Y8Pp
Lower
Is
better
WHICH ML ALGORITHM PERFORMS BETTER
Average Rank Across 165 Datasets
39#UnifiedDataAnalytics #SparkAISummit
XGBOOST + RAPIDS
40#UnifiedDataAnalytics #SparkAISummit
XGBoost
– Tuned for eXtreme performance and high efficiency
– Multi-GPU and Multi-Node Support
RAPIDS
– E2E data science & analytics pipeline entirely on GPU
– User-friendly Python interfaces
– Relies on CUDA primitives, exposes parallelism and
high-memory bandwidth
– Dask integration for managing workers and data in
distributed environments
+
41#UnifiedDataAnalytics #SparkAISummit
2,290
1,956
1,999
1,948
169
157
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuDF – Load and Data Prep cuML – XGBoost
cuDF (Load and Data Preparation)
0 5,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
8,763
6,147
3,926
3,221
322
213
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
End-to-End
BENCHMARKS
XGBoostData Conversion
Time in seconds — Shorter is better
42#UnifiedDataAnalytics #SparkAISummit
LEARN MORE
43#UnifiedDataAnalytics #SparkAISummit
www.rapids.ai
44#UnifiedDataAnalytics #SparkAISummit
DEMO
45#UnifiedDataAnalytics #SparkAISummit
SUMMARY
46#UnifiedDataAnalytics #SparkAISummit
GPU Accelerated Data Science
RAPIDS is a set of open source software libraries which
gives you the freedom to execute end-to-end data science
and analytics pipelines entirely on GPUs.
www.rapids.ai
47#UnifiedDataAnalytics #SparkAISummit
COMING UP NEXT
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with RAPIDS
– Summary
48#UnifiedDataAnalytics #SparkAISummit
Miguel Martínez. NVIDIA.
Deep Learning Solution Architect
Thomas Graves. NVIDIA.
Distributed Systems Software Engineer
– What is RAPIDS
– How to start
– cuDF
– cuML
– cuGraph
– XGBoost
– Summary
– Coming up next
– XGBoost4j-Spark Deep Dive
– Spark GPU Scheduling
– Spark Stage Level Scheduling
– Spark SQL with Rapids
– Summary
49#UnifiedDataAnalytics #SparkAISummit
Data Preparation VisualizationModel Training
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
the end-to-end data science and analytics pipelines.
rapids.ai
In GPU Memory
cuXFilter
Visualization
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
50#UnifiedDataAnalytics #SparkAISummit
XGBoost4J-
Spark
51#UnifiedDataAnalytics #SparkAISummit
+
● XGBoost4J-Spark enables XGBoost to train and inference
data on Apache Spark across nodes.
● Works with Apache Spark 2.x clusters.
52#UnifiedDataAnalytics #SparkAISummit
+
DISTRIBUTED TRAINING
MULTI-GPU, MULTI-NODE
● Leverage NCCL
● Similar to dmlc/rabit (Allreduce/Broadcast)
53#UnifiedDataAnalytics #SparkAISummit
+
Training Performance with T4
Preliminary Results on Google Cloud
Accuracy (AUC) Training Loop (Seconds)
Max Tree
Depth = 8
CPU (15 threads) 0.832 1071.0
GPU 0.832 139.6
Speedup 767%
Max Tree
Depth = 20
CPU (15 threads) 0.833 1088.7
GPU 0.833 165.9
Speedup 656%
54#UnifiedDataAnalytics #SparkAISummit
XGBOOST + SPARK
+ RAPIDS
55#UnifiedDataAnalytics #SparkAISummit
What about end-to-end?
● XGBoost training is pretty fast on GPUs
● Data loading/ETL is slow in comparison
● We need to accelerate the machine learning workflow end
to end
++
56#UnifiedDataAnalytics #SparkAISummit
● GPU-accelerated DataFrames
● Data science operations: filter, sort, join, groupby,…
● Read CSV/Parquet/Orc
● Pandas-like API
● Bare-metal, CUDA/C++ performance
● cuDF is Apache Arrow compatible memory format
● Contiguous, column-major data representation
++
RAPIDS cuDF
57#UnifiedDataAnalytics #SparkAISummit
● Read CSV/Parquet/Orc directly to GPU memory
● Converts column-major cuDF to sparse, row-major DMatrix
● Requires memcpy, but all data stays on GPU memory
++
XGBoost using cuDF
58#UnifiedDataAnalytics #SparkAISummit
RAPIDS XGBOOST4J-SPARK
Training classification model for 17 year mortgage data (190GB)
On-prem: 9X speedup/5X cost saving
• 2 Dell PowerEdge R740 (16 cores)
• 2 PowerEdge R740 each w/ 4-T4 (16GB)
AWS: 34X speedup/6X cost saving
• 4 r5a.4xlarge servers (16 cores)
• 1 P3.16xlarge servers w/ 8-V100 (16GB)
59#UnifiedDataAnalytics #SparkAISummit
Databricks XGBoost
DEMO
60#UnifiedDataAnalytics #SparkAISummit
Collaboration with the community:
● cuDF integration into XGBoost (XGBoost-4745)
● External memory support for GPU/Out-of-core XGBoost
GPU (XGBoost-4357)
++
61#UnifiedDataAnalytics #SparkAISummit
● Blog post: https://news.developer.nvidia.com/gpu-accelerated-spark-xgboost/
● Sample applications, notebooks and getting started guides at:
https://github.com/rapidsai/spark-examples
● Submit issues at https://github.com/rapidsai/spark-examples/issues and join
the conversation!
++
62#UnifiedDataAnalytics #SparkAISummit
● GPU scheduling
● Stage Level scheduling
● Columnar processing on the GPU
Spark 3.0 Empowers GPU Apps
63#UnifiedDataAnalytics #SparkAISummit
SPARK GPU
SCHEDULING
64#UnifiedDataAnalytics #SparkAISummit
● GPUs used in deep learning and other workloads.
● Spark unaware of GPUs and cannot schedule directly
● Difficult to use
GPU Scheduling
65#UnifiedDataAnalytics #SparkAISummit
● Accelerator-aware scheduling (SPARK-24615)
○ Request Executor resources (GPU, FPGA, etc)
○ Request Driver resources
○ Discover resources
○ Resource Task resources
○ Scheduler assigns resources to Tasks
○ API to get resources
○ Supported on YARN, Kubernetes, and Standalone
○ Spark 3.0
GPU Scheduling
66#UnifiedDataAnalytics #SparkAISummit
EXAMPLE :
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh
GPU Scheduling
Example Discovery script in Spark github
67#UnifiedDataAnalytics #SparkAISummit
Task API:
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses
...
… Pass assignedGpuAddrs into Tensorflow or other AI code
GPU Scheduling
68#UnifiedDataAnalytics #SparkAISummit
Driver API:
scala> sc.resources
scala.collection.Map[String,org.apache.spark.resource.ResourceInformation]
= Map(gpu -> [name: gpu, addresses: 0])
scala> sc.resources("gpu").addresses
Array[String] = Array(0)
GPU Scheduling
69#UnifiedDataAnalytics #SparkAISummit
GPU Scheduling
Submit
App
Cluster Manager
(YARN/K8s/etc)
Node
Executor
CPU GPU
Task
SPARK DRIVER
Request
Executor
Containers
w/GPU(s)
Spark
Launches
Executor
Assign
GPU(s) and
launch task
Executor
Registers
with GPU
addrs
SPARK APPLICATION
TASK CODE
Task runs
and user gets
GPU addrs
assigned
Pass GPU
addrs to
Tensorflow or
other AI algo
70#UnifiedDataAnalytics #SparkAISummit
GPU Scheduling
71#UnifiedDataAnalytics #SparkAISummit
● SPIP Documentation:
https://issues.apache.org/jira/browse/SPARK-24615
● User Documentation:
https://github.com/apache/spark/blob/master/docs/configuration.md
#custom-resource-scheduling-and-configuration-overview
● Working with Berkley RISELab to install GPUs in CI Jenkins nodes
● Fractional support: SPARK-29151
GPU Scheduling
72#UnifiedDataAnalytics #SparkAISummit
SPARK STAGE
LEVEL
SCHEDULING
73#UnifiedDataAnalytics #SparkAISummit
Stage Level Scheduling
CPU
NODE
GPU
SPARK ML APPLICATION
ETL Stage ML Stage
CPU
NODE
74#UnifiedDataAnalytics #SparkAISummit
● Stage level resource scheduling (SPARK-27495)
○ Specify Resource requirements per RDD operation
○ Spark dynamically allocates containers to meet resource requirements
○ Spark schedules tasks on appropriate containers
Stage Level Scheduling
75#UnifiedDataAnalytics #SparkAISummit
... do some ETL using default configs then…
val rp = new ResourceProfile()
rp.require(new ExecutorResourceRequest("memory", 2048))
rp.require(new ExecutorResourceRequest("cores", 2))
rp.require(new ExecutorResourceRequest("gpu", 1,
Some("/opt/gpuScripts/getGpus")))
rp.require(new TaskResourceRequest("gpu", 1))
val rdd = sc.makeRDD(1 to 10, 5).mapPartitions { it =>
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses.iterator
… feed into ML algorithm…
}.withResources(rp)
rdd.collect()
Stage Level Scheduling
76#UnifiedDataAnalytics #SparkAISummit
SPARK SQL WITH
GPU
PROCESSING
77#UnifiedDataAnalytics #SparkAISummit
● Users want to process data in a columnar manner - for instance run
on the GPU.
○ Vectorized R Execution in Apache Spark
○ Making Nested Columns as First Citizen in Apache Spark SQL
- Apple
○ Vectorized Query Execution in Apache Spark at Facebook
● The Dataset/DataFrame API in Spark only exposes one row at a
time when processing data. This doesn’t take advantage of
columnar processing.
+
SQL GPU Columnar
Processing
78#UnifiedDataAnalytics #SparkAISummit
● Columnar processing (SPARK-27396)
○ Catalyst API for columnar processing
○ Plugins can modify query plan with columnar ops
○ Spark 3.0
+
SQL GPU Columnar
Processing
79#UnifiedDataAnalytics #SparkAISummit
● Plugin that allows running Spark on a GPU
● Requires Spark 3.0
● No code changes required
● Only need plugin jar, cuDF jar and set spark.sql.extensions
● Runs the operations available on the GPU.
● If operation not implemented or not compatible with GPU will run on
CPU
● Handles transitioning from Row to Columnar and back
● Uses Rapids cuDF library
+
SQL GPU Columnar
Processing
RAPIDS for Apache SPARK Plugin
80#UnifiedDataAnalytics #SparkAISummit
+
SQL GPU Columnar
Processing
RAPIDS for Apache SPARK Plugin
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.sql.extensions=ai.rapids.spark.Plugin
--jars ‘rapids-4-spark.jar, cudf.jar’
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh
81#UnifiedDataAnalytics #SparkAISummit
A.join(B, A("longs") === B("blongs")).sort("longs")
+
82#UnifiedDataAnalytics #SparkAISummit
BENCHMARK
select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and
o_orderdate < date '1993-07-01' + interval '3' month and exists ( select * from lineitem where l_orderkey =
o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority
TPCH Query 4
3.55X
speedup
Setup: 100GB data, GCP, 5 workers, n1-highmem-16 using 80 shuffle partitions
83#UnifiedDataAnalytics #SparkAISummit
● Optimizer to choose when to run things on GPU
● More operations
● Handle larger data - spilling
● GPU direct storage
● RDMA and Shuffle improvements
● Better exchange format for ETL to ML
+
SQL GPU Columnar
Processing
Future Enhancements
84#UnifiedDataAnalytics #SparkAISummit
DEMO
85#UnifiedDataAnalytics #SparkAISummit
SUMMARY
86#UnifiedDataAnalytics #SparkAISummit
Spark 3.0 Empowers GPU Apps
● GPU scheduling
● Stage Level scheduling
● Columnar processing
● RAPIDS for Apache Spark Plugin
87#UnifiedDataAnalytics #SparkAISummit
++
• XGBoost on GPU is ready for wide
adoption
• Classification and regression
• Single GPU; Multi-GPU; Multi-
node
• In-core; External memory
• Available on Spark
• Kubernetes, YARN,
Standalone
• Open source collaboration
• XGBoost
• External memory
• RAPIDS
• cuDF, cuML
88#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Mais conteúdo relacionado

Mais procurados

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineUsing ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineScyllaDB
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Jeremy Eder
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
 

Mais procurados (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineUsing ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0
 

Semelhante a Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 

Semelhante a Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library (20)

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Último (20)

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library