SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
LEVERAGING SPARK FOR
SCALABLE DATA PREP AND
INFERENCE IN DEEP
LEARNING
James Nguyen
Data & AI Cloud Solution Architect, Microsoft
Agenda
INTRODUCTION LARGE SCALE DATA PREPARATION FOR DEEP
LEARNING
SCALABLE DEEP LEARNING INFERENCE IN
SPARK
• PANDAS UDFS
• SPARK’S BINARY AND TENSORFLOW FORMATS SUPPORT
• SCORE EXTERNALLY HOSTED ML MODEL
• LOAD AND SCORE DL ML MODEL WITHIN SPARK
Introduction
◦ Distributed Deep Learning Frameworks
scale Deep Learning well when input data
elements are independent, allowing
parallel processing to start immediately
◦ However preprocessing and featurization
steps, crucial to Deep Learning
development, might involve complex
business logic with computations across
multiple data elements
◦ In addition, support for Batch Inference is
limited compared to Online Inference.
◦ We can leverage new features in Spark 3.0
about support for binary data and new
Pandas UDFs to address these gaps
Different stages in ML development. Stages in aqua green can be offloaded to Spark
Data
transformation
Inference
Featurization
Model Ready
Data
Training and
Testing
Deployment
Collect data
Ingest and
transform data
Data source 1
Data source 2
Data source 3
Using Spark to Accelerate Data Prep and Featurization
Data prep and featurization in Deep Learning pipeline
1.Data acquisition
and initial
transformation
2. Data preparation
for ML task
3. Featurization 4. ML training
Data query/extraction
tools
Single node
Python
Pandas
Traditional way
Multi-node Spark pipeline: Query + Transformation +
Featurization
Tensorflow/Pytorch
data APIs
Combine tools and scale out with Spark
Pandas UDF
Spark’s Pandas UDFs: Parallelizing Python Computation
Input Spark DataFrame
Grouped/split into
parallel batches
Transformation/
scoring logicInput:
Pandas DF/Series
Output:
Pandas DF/Series
Output Spark DataFrame
Transformation
/scoring logic
Transformation
/scoring logic
….
Vectorized operation: Pyarrow convert from JVM data to Python DF/Series
Spark’s Pandas UDFs: Types and Performance
◦ Scalar UDFs
◦ Column values are split into batches of Pandas series
to pass to UDF
◦ UDF also returns Pandas Series
◦ Good for direct parallel column values computation
◦ Grouped map UDFs
◦ Implements split-apply-pattern: Group by each
column value to form Pandas DataFrames then pass
on to UDF
◦ Returns Pandas DataFrame
◦ All data of a group-by value is loaded into memory
◦ Scalar iterator UDFs (Spark 3.0)
◦ Same with Scala UDF except:
◦ UDF takes iterator of batches instead of single batch
◦ Return iterator of batches or yield batches
◦ Good for initializing some state (e.g. load ML model)
Pandas UDFs perform much better than row-at-a-time UDFs across the board,
ranging from 3x to over 100x (source databricks.com)
ML Training
Spark’s Binary Data and Tensorflow’s TFrecords Formats
support
Binary files (image, audio..)
spark.read.format("binaryFile“)
Import custom libraries
Transformation/scoring
Pandas UDF
• Reading binary data using Binary Files type Spark DataFrame
• Select binary content column into a UDF function to extract feature
• Select other columns such as file path into another UDF as needed (for example to create a label column from the filename)
• Inside UDF function, import needed libraries to extract features from binary data
Scaling Up Data Prep Example 1: Multivariate Time Series
Classification
• ML model to predict customer churn based on their historical interaction (events)
• Each event is a multivariate entity with attributes in categorical, numeric, embedding…
• Each training example is a fixed window of 14 days and the outcome(churn vs. stay).
Challenges:
- There can be millions of customer
- Each customer may have a long
history
- Each history need to generate
100x pairs of training examples
with computation needed to
build features
- Result is billions of records and it
would take days to run in a single
node vs. 2 hours in a 30 nodes
clusters
Data preprocessing plan
Scaling Up Data Prep Example 1: Multivariate Time Series
Classification (cont.)
Read input data
from sources
and combine
Collect event
history for each
customer
In each
customer history
generate
overlapping
windows
Within each
window,
generate and
compute
features
Output data for
training
Spark SQL to select data
from sources
Group by customer
Df.groupby(“customer”)
@pandas_udf(pandas_dec_str,
PandasUDFType.GROUPED_MAP)
output_df.
orderBy(rand()).repartition(10
0).write.format("tfrecords")
Scaling Up Data Prep Example 2: Speech Recognition
• Use deep learning to recognize speech from audio data
• Data is in the form of audio files in wav format. Large volume training requires hundred thousands clips and together with data augmentation can result
in millions of training example
• Processing is computing intensive with audio libraries
ML Training
Wave files
spark.read.format("binaryFile“)
Process core binary
content using
librosa and extract
spectrogram as
features
Pandas UDF 1
Get input file path
and extract file
name and look up
index position in a
label list
Pandas UDF 2
Using Spark for large scale batch inference
Big dataset
Distributed Data
preprocessing
Distributed scoring
Calling externally
hosted APIs
Loading model
and score
Result dataset
Spark is very good for regular map reduce style processing.
The same advantage can apply for ML batch inference
Hosted ML
Service
ML
Model
Load Model and Score within Spark
Model distribution
(sparkcontext.addfile()
or store model file at
shared storage)
Input Spark DataFrame
Pandas Scalar UDFs
Scoring Output:
PD Series
Input:
PD Series
Model
Loading
Pandas Scalar Iterator UDF
(recommended for Deep Learning)
ScoringInput:
Pandas DF/Series
Model
LoadingInput:
Iterator of
Series
Output:
Iterator of
Series or
yield Series
• Model loading can be done from model file cached at worker
machine by addfile() method or from shared cloud storage
• Pandas Scala Iterator UDF flavor reduces the frequency of loading
deep ML model which can be an expensive operation
Deep Learning model is
large in size and is not
serializable, so broadcast
won’t work
Load Model and Score within Spark- Code Example
Calling External APIs in an UDF
Input Spark DataFrame
Pandas Scalar UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Hosted ML
Service
Pandas Scalar Iterator UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Batch input
Batch output
Batch input
Batch output
Calling External APIs in an UDF
References
◦ Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team, Distributed Deep Learning on Big-Data Clusters,
2017
◦ Databricks,Spark Deep Learning Pipeline , 2017.
◦ Apache Spark Org, Pandas UDF, 2017
◦ Databricks & Apache Spark Org, Pandas UDF Scalar Iterator, 2019.
◦ Databricks & Apache Spark Org, Spark binaryFiles DataFrame, 2019.
◦ Tensorflow team, Spark Tensorflow connector, 2016
THANK YOU!
Your feedback is important to us.
Don’t forget to rate and review the
sessions.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Ash architecture and advanced usage rmoug2014
Ash architecture and advanced usage rmoug2014Ash architecture and advanced usage rmoug2014
Ash architecture and advanced usage rmoug2014
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
DB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningDB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and Planning
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard Broker
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - Presentation
 
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive PresentationNabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Sql Server Performance Tuning
Sql Server Performance TuningSql Server Performance Tuning
Sql Server Performance Tuning
 
Oracle 12c Multitenant architecture
Oracle 12c Multitenant architectureOracle 12c Multitenant architecture
Oracle 12c Multitenant architecture
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 

Semelhante a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Semelhante a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning (20)

BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

  • 1. LEVERAGING SPARK FOR SCALABLE DATA PREP AND INFERENCE IN DEEP LEARNING James Nguyen Data & AI Cloud Solution Architect, Microsoft
  • 2. Agenda INTRODUCTION LARGE SCALE DATA PREPARATION FOR DEEP LEARNING SCALABLE DEEP LEARNING INFERENCE IN SPARK • PANDAS UDFS • SPARK’S BINARY AND TENSORFLOW FORMATS SUPPORT • SCORE EXTERNALLY HOSTED ML MODEL • LOAD AND SCORE DL ML MODEL WITHIN SPARK
  • 3. Introduction ◦ Distributed Deep Learning Frameworks scale Deep Learning well when input data elements are independent, allowing parallel processing to start immediately ◦ However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements ◦ In addition, support for Batch Inference is limited compared to Online Inference. ◦ We can leverage new features in Spark 3.0 about support for binary data and new Pandas UDFs to address these gaps Different stages in ML development. Stages in aqua green can be offloaded to Spark Data transformation Inference Featurization Model Ready Data Training and Testing Deployment Collect data Ingest and transform data Data source 1 Data source 2 Data source 3
  • 4. Using Spark to Accelerate Data Prep and Featurization Data prep and featurization in Deep Learning pipeline 1.Data acquisition and initial transformation 2. Data preparation for ML task 3. Featurization 4. ML training Data query/extraction tools Single node Python Pandas Traditional way Multi-node Spark pipeline: Query + Transformation + Featurization Tensorflow/Pytorch data APIs Combine tools and scale out with Spark
  • 5. Pandas UDF Spark’s Pandas UDFs: Parallelizing Python Computation Input Spark DataFrame Grouped/split into parallel batches Transformation/ scoring logicInput: Pandas DF/Series Output: Pandas DF/Series Output Spark DataFrame Transformation /scoring logic Transformation /scoring logic …. Vectorized operation: Pyarrow convert from JVM data to Python DF/Series
  • 6. Spark’s Pandas UDFs: Types and Performance ◦ Scalar UDFs ◦ Column values are split into batches of Pandas series to pass to UDF ◦ UDF also returns Pandas Series ◦ Good for direct parallel column values computation ◦ Grouped map UDFs ◦ Implements split-apply-pattern: Group by each column value to form Pandas DataFrames then pass on to UDF ◦ Returns Pandas DataFrame ◦ All data of a group-by value is loaded into memory ◦ Scalar iterator UDFs (Spark 3.0) ◦ Same with Scala UDF except: ◦ UDF takes iterator of batches instead of single batch ◦ Return iterator of batches or yield batches ◦ Good for initializing some state (e.g. load ML model) Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x (source databricks.com)
  • 7. ML Training Spark’s Binary Data and Tensorflow’s TFrecords Formats support Binary files (image, audio..) spark.read.format("binaryFile“) Import custom libraries Transformation/scoring Pandas UDF • Reading binary data using Binary Files type Spark DataFrame • Select binary content column into a UDF function to extract feature • Select other columns such as file path into another UDF as needed (for example to create a label column from the filename) • Inside UDF function, import needed libraries to extract features from binary data
  • 8. Scaling Up Data Prep Example 1: Multivariate Time Series Classification • ML model to predict customer churn based on their historical interaction (events) • Each event is a multivariate entity with attributes in categorical, numeric, embedding… • Each training example is a fixed window of 14 days and the outcome(churn vs. stay). Challenges: - There can be millions of customer - Each customer may have a long history - Each history need to generate 100x pairs of training examples with computation needed to build features - Result is billions of records and it would take days to run in a single node vs. 2 hours in a 30 nodes clusters Data preprocessing plan
  • 9. Scaling Up Data Prep Example 1: Multivariate Time Series Classification (cont.) Read input data from sources and combine Collect event history for each customer In each customer history generate overlapping windows Within each window, generate and compute features Output data for training Spark SQL to select data from sources Group by customer Df.groupby(“customer”) @pandas_udf(pandas_dec_str, PandasUDFType.GROUPED_MAP) output_df. orderBy(rand()).repartition(10 0).write.format("tfrecords")
  • 10. Scaling Up Data Prep Example 2: Speech Recognition • Use deep learning to recognize speech from audio data • Data is in the form of audio files in wav format. Large volume training requires hundred thousands clips and together with data augmentation can result in millions of training example • Processing is computing intensive with audio libraries ML Training Wave files spark.read.format("binaryFile“) Process core binary content using librosa and extract spectrogram as features Pandas UDF 1 Get input file path and extract file name and look up index position in a label list Pandas UDF 2
  • 11. Using Spark for large scale batch inference Big dataset Distributed Data preprocessing Distributed scoring Calling externally hosted APIs Loading model and score Result dataset Spark is very good for regular map reduce style processing. The same advantage can apply for ML batch inference Hosted ML Service ML Model
  • 12. Load Model and Score within Spark Model distribution (sparkcontext.addfile() or store model file at shared storage) Input Spark DataFrame Pandas Scalar UDFs Scoring Output: PD Series Input: PD Series Model Loading Pandas Scalar Iterator UDF (recommended for Deep Learning) ScoringInput: Pandas DF/Series Model LoadingInput: Iterator of Series Output: Iterator of Series or yield Series • Model loading can be done from model file cached at worker machine by addfile() method or from shared cloud storage • Pandas Scala Iterator UDF flavor reduces the frequency of loading deep ML model which can be an expensive operation Deep Learning model is large in size and is not serializable, so broadcast won’t work
  • 13. Load Model and Score within Spark- Code Example
  • 14. Calling External APIs in an UDF Input Spark DataFrame Pandas Scalar UDF Http Post Output: PD Series Input: PD Series Model Loading Hosted ML Service Pandas Scalar Iterator UDF Http Post Output: PD Series Input: PD Series Model Loading Batch input Batch output Batch input Batch output
  • 16. References ◦ Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team, Distributed Deep Learning on Big-Data Clusters, 2017 ◦ Databricks,Spark Deep Learning Pipeline , 2017. ◦ Apache Spark Org, Pandas UDF, 2017 ◦ Databricks & Apache Spark Org, Pandas UDF Scalar Iterator, 2019. ◦ Databricks & Apache Spark Org, Spark binaryFiles DataFrame, 2019. ◦ Tensorflow team, Spark Tensorflow connector, 2016
  • 17. THANK YOU! Your feedback is important to us. Don’t forget to rate and review the sessions.