SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
#bigdata #deeplearning #optimization
Yanlei Diao
Towards a Unified Data Analytics
Optimizer
Ecole Polytechnique, France
University of Massachusetts Amherst, USA
1
Today’s Data Analytics Service
Cloud Computing (EC2 ~60 instance types)
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large
–c4.2xlarge …
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
2
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Latency Throughput Monetary cost
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
3
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Latency Throughput Monetary cost
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
4
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Today’s Cloud Service guarantees
availability (SLA)
But too many configuraTon issues
are leU to the user…
5
A New Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Latency Throughput Monetary cost
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
6
Dataflow Op*mizer
Bring database optimization
to a broader class of analytics
job configuration
cloud instance
A New Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Latency Throughput Monetary cost
7
Dataflow Optimizer: dataflow programs
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
SessionizaTon
Trainingaclassifier
Prediction
logical optimization of the query plan
Latency Throughput Monetary cost …
8
Dataflow Optimizer: from Logical to Physical Plans
Sessionization
Trainingaclassifier
Prediction
D
M
M
M
M
C
R
R
R
RQD
QMI QRIQMOQMM
Hadoop Streaming [Li et al 2011,2012,2015]
Degree of
parallelism θp
Granularity of
scheduling θs
Memory per
executor…
Logical Plans Phyiscal Plans
9
Challenge 1: Cost Models for the Dataflow Optimizer
SQL Optimizer Dataflow Optimizer
• Cost model for relational algebra
• Tend to run homogenous hardware
• Cost model for arbitrary dataflow
programs (in java, scala, python, …)
• Cloud computing may employ
heterogeneous hardware
Building a general cost model for any
user objective is hard!
• System behaviors are often centered
on CPU and IO
• Modeling diverse system behaviors,
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost model for resource consumption
(CPU and IO counters)
• Cost models for any user-defined
objectives (latency, throughput, cost, …)
10
Challenge 2: Need A New Multi-Objective Optimizer
1M tuples/sec with 1 sec latency 1K tuple/sec with 0.1 sec latency
Latency versus Throughput
Latency versus Monetary Cost
1 sec latency at $800/day 5 sec latency at $300/day
Mul5-Objec5ve Op5miza5on
• Solu5on is a (Pareto) set
• Reveals interes5ng tradeoffs
Monetary cost
Throughput
Latency
<θp, θs, λ, …>
11
Overview: Cost Modeling for Dataflow Optimization
Dataflow Optimizer
• Cost model for arbitrary data
dataflow programs (in java, scala, ...)
• Cloud computing may employ
heterogeneous hardware
• Modeling diverse system behaviors
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost models for any user-defined
objectives
q Deep Learning
• Representation learning for an
arbitrary program
• A (non-linear) function for any
objective
q In-situ modeling
• Learn a model as complex as
necessary for the current
computing environment
12
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Θ(i):
<batch interval,
block interval,
parallelism,
input rate>
13
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Significant change of user preference,
e.g. switching to the replay mode
Θ(2)
Optimization time: TΘ(i) – Tith_preference_change
• TΘ(1) : job first seen, workload encoding and optimization
• TΘ(i),i>1 : job already seen, optimization
14
Inside the Dataflow Optimizer
Multi-Objective
Optimizer
Change
Plan?
New job
configuration
Y
Θ (1)
Objective models
Ψ1, …, Ψk
Dataflow program,
ObjecGves {F1,…, Fk}
In-situ Modeling
No training
training
Pareto Frontier
History
of {"}
Trace Collector
Observations
No manual feature engineering
No user labels
16
1. Building a Cost Model for each User Objective
Given a dataflow, we assume that there is a job-specific encoding Wj (of its
key characteristics) which is invariant to time and runtime parameters.
Definition: For a job j, the workload encoding Wj is a real-valued vector
satisfying three conditions:
1. Invariance: For the same logical dataflow program, Wj should be
(approximately) an invariant during a period of time
2. Reconstruction: Wj should carry all the information for reconstructing the
observations (traces), given a specific job configuration Θj
i
3. Similarity Preserving: For similar programs i and j, their encodings Wi and
Wj should also be similar
17
Formal Description of the Predictive Model
Under our assumption, the model for a user objective (e.g., latency) can be
abstracted as a deterministic function:
Interpreted as an optimization problem, our goal is to find a function s.t.
where average is taken among all training instances ( j, Wj ).
Challenge: Building the Ψ function is easy if both Wj and Θj
i are known,
but the workload encoding Wj is unknown in practice!
Neutral network architectures that can
(1) extract the workload encoding for each new job;
(2) predict on a user objective given a new job configuration.
Wj should be (approximately) an invariant.
3) Reconstruction. Wj should carry all the information
for reconstructing the observation Oi
j, given ✓i
j.
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
Jtrain
total
How
latenc
is only
ral ne
observ
fore, w
h(✓i
j)
Tra
traini
the em
and ca
Never
menta
with a
time t
only t
beddi
more
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
The embedding approach extracts the workload encod-
18
Two Types of Architecture
18
1. Embedding
2. Autoencoder
Regressor
21
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of  do not
permit effective optimization in many cases. (2)  around
5% seems to be a reasonable threshold. (3) ML jobs benefit
from the optimization more than SQL jobs because ML jobs
52 SQL-like jobs from 5 parameterized
templates for click stream analysis, which
use different group-by’s, joins, global or
windowed aggregates, and user-defined
functions. 6 intensive jobs (each with 455
conf.), 46 regular jobs (each with 25 conf.).
12 ML jobs based on binary
classification using Stochastic
Gradient Descent. Each has 25
configurations.
22
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of  do not
permit effective optimization in many cases. (2)  around
5% seems to be a reasonable threshold. (3) ML jobs benefit
from the optimization more than SQL jobs because ML jobs
Predication accuracy on
familiar jobs
Prediction accuracy on new
(seen first time) jobs
23
2. A New Mul+-Objec+ve Op+mizer
v Given a set of user objectives, a Multi-Objective Optimizer uses
the predicted performance measures to construct a Pareto
optimal set (frontier).
v The skyline offers insights on tradeoffs:
• e.g., two configurations consume the same amount of resources, but one
achieves 20% higher throughput with only 1% loss of latency
v It finally chooses one optimal configuration to set the system
parameters (e.g., degree of parallelism, granularity of
scheduling, input rate) for execution.
27
Integration Results
Improvement over
configurations set by
engineers, dominance in
10/15, tradeoffs in 5/15
Improving throughput in
the replay mode, reducing
running time by 50%
28
Messages
² In-situ modeling based on deep learning has the potential to
model any user objective in a given computing environment
² Mul4-objec4ve op4miza4on has the poten7al to explore
tradeoffs and move in direc7on of be;er performance
automa7cally
29
Your comments are very
welcome. Thank you!
Acknowledgements

Mais conteúdo relacionado

Mais procurados

NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
Mohid Nabil
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 

Mais procurados (19)

NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Performance Optimization of Clustering On GPU
 Performance Optimization of Clustering On GPU Performance Optimization of Clustering On GPU
Performance Optimization of Clustering On GPU
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
Lec08 optimizations
Lec08 optimizationsLec08 optimizations
Lec08 optimizations
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 

Semelhante a Towards a Unified Data Analytics Optimizer with Yanlei Diao

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
ssuserb4d806
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 

Semelhante a Towards a Unified Data Analytics Optimizer with Yanlei Diao (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Towards a Unified Data Analytics Optimizer with Yanlei Diao

  • 1. #bigdata #deeplearning #optimization Yanlei Diao Towards a Unified Data Analytics Optimizer Ecole Polytechnique, France University of Massachusetts Amherst, USA
  • 2. 1 Today’s Data Analytics Service Cloud Computing (EC2 ~60 instance types) -t2.nano –t2.micro –t2.small –t2.medium – t2.large –m4.large –m4.xlarge –m4.2xlarge - m4.4xlarge –m4.10xlarge -m3.medium - m3.large –m3.xlarge –m3.2xlarge –c4.large –c4.2xlarge … SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization
  • 3. 2 Today’s Data Analytics Service SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization D M M M M C R R R RQD QMIQRIQMOQMM M M M M D C R R R R Degree of || Memory size Granularity of scheduling Latency Throughput Monetary cost Cloud Computing: ~60 instance types -t2.nano –t2.micro –t2.small –t2.medium – t2.large –m4.large –m4.xlarge –m4.2xlarge - m4.4xlarge –m4.10xlarge -m3.medium - m3.large –m3.xlarge –m3.2xlarge –c4.large – c4.2xlarge …
  • 4. 3 Today’s Data Analytics Service SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization D M M M M C R R R RQD QMIQRIQMOQMM M M M M D C R R R R Degree of || Memory size Granularity of scheduling Latency Throughput Monetary cost Cloud Computing: ~60 instance types -t2.nano –t2.micro –t2.small –t2.medium – t2.large –m4.large –m4.xlarge –m4.2xlarge - m4.4xlarge –m4.10xlarge -m3.medium - m3.large –m3.xlarge –m3.2xlarge –c4.large – c4.2xlarge …
  • 5. 4 Today’s Data Analytics Service SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization Today’s Cloud Service guarantees availability (SLA) But too many configuraTon issues are leU to the user…
  • 6. 5 A New Data Analytics Service SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization Latency Throughput Monetary cost D M M M M C R R R RQD QMIQRIQMOQMM M M M M D C R R R R Degree of || Memory size Granularity of scheduling Cloud Computing: ~60 instance types -t2.nano –t2.micro –t2.small –t2.medium – t2.large –m4.large –m4.xlarge –m4.2xlarge - m4.4xlarge –m4.10xlarge -m3.medium - m3.large –m3.xlarge –m3.2xlarge –c4.large – c4.2xlarge …
  • 7. 6 Dataflow Op*mizer Bring database optimization to a broader class of analytics job configuration cloud instance A New Data Analytics Service SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization Latency Throughput Monetary cost
  • 8. 7 Dataflow Optimizer: dataflow programs SELECT C.uid, avg(P.pagerank) FROM Clicks C, Pages P WHERE C.url = P.url GROUP BY C.uid HAVING avg(P.pagerank) > 0.5 ORDER BY avg(P.pagerank) Predication Training of a classifier Sessionization SessionizaTon Trainingaclassifier Prediction logical optimization of the query plan Latency Throughput Monetary cost …
  • 9. 8 Dataflow Optimizer: from Logical to Physical Plans Sessionization Trainingaclassifier Prediction D M M M M C R R R RQD QMI QRIQMOQMM Hadoop Streaming [Li et al 2011,2012,2015] Degree of parallelism θp Granularity of scheduling θs Memory per executor… Logical Plans Phyiscal Plans
  • 10. 9 Challenge 1: Cost Models for the Dataflow Optimizer SQL Optimizer Dataflow Optimizer • Cost model for relational algebra • Tend to run homogenous hardware • Cost model for arbitrary dataflow programs (in java, scala, python, …) • Cloud computing may employ heterogeneous hardware Building a general cost model for any user objective is hard! • System behaviors are often centered on CPU and IO • Modeling diverse system behaviors, including CPU, IO, shuffling, queuing, data skew, stragglers, failures… • Cost model for resource consumption (CPU and IO counters) • Cost models for any user-defined objectives (latency, throughput, cost, …)
  • 11. 10 Challenge 2: Need A New Multi-Objective Optimizer 1M tuples/sec with 1 sec latency 1K tuple/sec with 0.1 sec latency Latency versus Throughput Latency versus Monetary Cost 1 sec latency at $800/day 5 sec latency at $300/day Mul5-Objec5ve Op5miza5on • Solu5on is a (Pareto) set • Reveals interes5ng tradeoffs Monetary cost Throughput Latency <θp, θs, λ, …>
  • 12. 11 Overview: Cost Modeling for Dataflow Optimization Dataflow Optimizer • Cost model for arbitrary data dataflow programs (in java, scala, ...) • Cloud computing may employ heterogeneous hardware • Modeling diverse system behaviors including CPU, IO, shuffling, queuing, data skew, stragglers, failures… • Cost models for any user-defined objectives q Deep Learning • Representation learning for an arbitrary program • A (non-linear) function for any objective q In-situ modeling • Learn a model as complex as necessary for the current computing environment
  • 13. 12 Trace Collection for Cost Modeling & Optimization HDFS Dataflow Optimizer Spark Listener System Monitor Input Data Physical Execution Plan User Objectives (related measures) Application (Spark related metrics) System (CPU, memory, IO, network, …) Traces Dataflow program, Objectives {F1,…, Fk} Θ(0) Θ(1) Θ(i): <batch interval, block interval, parallelism, input rate>
  • 14. 13 Trace Collection for Cost Modeling & Optimization HDFS Dataflow Optimizer Spark Listener System Monitor Input Data Physical Execution Plan User Objectives (related measures) Application (Spark related metrics) System (CPU, memory, IO, network, …) Traces Dataflow program, Objectives {F1,…, Fk} Θ(0) Θ(1) Significant change of user preference, e.g. switching to the replay mode Θ(2) Optimization time: TΘ(i) – Tith_preference_change • TΘ(1) : job first seen, workload encoding and optimization • TΘ(i),i>1 : job already seen, optimization
  • 15. 14 Inside the Dataflow Optimizer Multi-Objective Optimizer Change Plan? New job configuration Y Θ (1) Objective models Ψ1, …, Ψk Dataflow program, ObjecGves {F1,…, Fk} In-situ Modeling No training training Pareto Frontier History of {"} Trace Collector Observations No manual feature engineering No user labels
  • 16. 16 1. Building a Cost Model for each User Objective Given a dataflow, we assume that there is a job-specific encoding Wj (of its key characteristics) which is invariant to time and runtime parameters. Definition: For a job j, the workload encoding Wj is a real-valued vector satisfying three conditions: 1. Invariance: For the same logical dataflow program, Wj should be (approximately) an invariant during a period of time 2. Reconstruction: Wj should carry all the information for reconstructing the observations (traces), given a specific job configuration Θj i 3. Similarity Preserving: For similar programs i and j, their encodings Wi and Wj should also be similar
  • 17. 17 Formal Description of the Predictive Model Under our assumption, the model for a user objective (e.g., latency) can be abstracted as a deterministic function: Interpreted as an optimization problem, our goal is to find a function s.t. where average is taken among all training instances ( j, Wj ). Challenge: Building the Ψ function is easy if both Wj and Θj i are known, but the workload encoding Wj is unknown in practice! Neutral network architectures that can (1) extract the workload encoding for each new job; (2) predict on a user objective given a new job configuration. Wj should be (approximately) an invariant. 3) Reconstruction. Wj should carry all the information for reconstructing the observation Oi j, given ✓i j. Given the notion of workload encodings, we next define the learning problem. Without loss of generality, let us con- sider latency l as a user objective. The model for latency can be formulated as a deterministic function s.t. (✓i j, Wj) = li j The learning problem is to choose the function that mini- mizes the distance between the predicted latency and the ac- tual latency for all (job, configuration) combinations, called training instances, in the training data: ⇤ = arg min avg(distance( (✓i j, Wj), li j) where the average is taken among all the training instances. If both ✓i j and Wj were given, the problem would be a classic learning problem. Since Wj is unknown, we need to extend existing learning tools to extract Wj. In the following, we propose two deep learning architectures that simultaneously extract the workload encoding from a job and build a pre- dictive model on the configuration and workload encoding. 4.2 Embedding Approach Jtrain total How latenc is only ral ne observ fore, w h(✓i j) Tra traini the em and ca Never menta with a time t only t beddi more Given the notion of workload encodings, we next define the learning problem. Without loss of generality, let us con- sider latency l as a user objective. The model for latency can be formulated as a deterministic function s.t. (✓i j, Wj) = li j The learning problem is to choose the function that mini- mizes the distance between the predicted latency and the ac- tual latency for all (job, configuration) combinations, called training instances, in the training data: ⇤ = arg min avg(distance( (✓i j, Wj), li j) where the average is taken among all the training instances. If both ✓i j and Wj were given, the problem would be a classic learning problem. Since Wj is unknown, we need to extend existing learning tools to extract Wj. In the following, we propose two deep learning architectures that simultaneously extract the workload encoding from a job and build a pre- dictive model on the configuration and workload encoding. 4.2 Embedding Approach The embedding approach extracts the workload encod-
  • 18. 18 Two Types of Architecture 18 1. Embedding 2. Autoencoder Regressor
  • 19. 21 Results on Modeling 2-Inc 2.5% 4.8% 0.9% 5.2% 8.4% 2-Inc 3.6% 5.6% 0.6% 0.5% 5.3% codings. SQL-like Jobs ML Jobs T1 T2 T1 T2 Embedding 9.7% 33.3% 16.5% 16.2% Autoencoder 11.6 % 22.0% 24.9% 53.9% Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0% Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6% Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6% HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6% Table 4: Summary of T1 and T2 errors for latency. The major results are: (1) Very low values of  do not permit effective optimization in many cases. (2)  around 5% seems to be a reasonable threshold. (3) ML jobs benefit from the optimization more than SQL jobs because ML jobs 52 SQL-like jobs from 5 parameterized templates for click stream analysis, which use different group-by’s, joins, global or windowed aggregates, and user-defined functions. 6 intensive jobs (each with 455 conf.), 46 regular jobs (each with 25 conf.). 12 ML jobs based on binary classification using Stochastic Gradient Descent. Each has 25 configurations.
  • 20. 22 Results on Modeling 2-Inc 2.5% 4.8% 0.9% 5.2% 8.4% 2-Inc 3.6% 5.6% 0.6% 0.5% 5.3% codings. SQL-like Jobs ML Jobs T1 T2 T1 T2 Embedding 9.7% 33.3% 16.5% 16.2% Autoencoder 11.6 % 22.0% 24.9% 53.9% Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0% Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6% Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6% HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6% Table 4: Summary of T1 and T2 errors for latency. The major results are: (1) Very low values of  do not permit effective optimization in many cases. (2)  around 5% seems to be a reasonable threshold. (3) ML jobs benefit from the optimization more than SQL jobs because ML jobs Predication accuracy on familiar jobs Prediction accuracy on new (seen first time) jobs
  • 21. 23 2. A New Mul+-Objec+ve Op+mizer v Given a set of user objectives, a Multi-Objective Optimizer uses the predicted performance measures to construct a Pareto optimal set (frontier). v The skyline offers insights on tradeoffs: • e.g., two configurations consume the same amount of resources, but one achieves 20% higher throughput with only 1% loss of latency v It finally chooses one optimal configuration to set the system parameters (e.g., degree of parallelism, granularity of scheduling, input rate) for execution.
  • 22. 27 Integration Results Improvement over configurations set by engineers, dominance in 10/15, tradeoffs in 5/15 Improving throughput in the replay mode, reducing running time by 50%
  • 23. 28 Messages ² In-situ modeling based on deep learning has the potential to model any user objective in a given computing environment ² Mul4-objec4ve op4miza4on has the poten7al to explore tradeoffs and move in direc7on of be;er performance automa7cally
  • 24. 29 Your comments are very welcome. Thank you! Acknowledgements