Towards a Unified Data Analytics Optimizer with Yanlei Diao

#bigdata #deeplearning #optimization
Yanlei Diao
Towards a Unified Data Analytics
Optimizer
Ecole Polytechnique, France
University of Massachusetts Amherst, USA

1
Today’s Data Analytics Service
Cloud Computing (EC2 ~60 instance types)
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large
–c4.2xlarge …
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization

2
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Latency Throughput Monetary cost
Cloud Computing: ~60 instance types
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …

3
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
c4.2xlarge …

4
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization
Today’s Cloud Service guarantees
availability (SLA)
But too many conﬁguraTon issues
are leU to the user…

5
A New Data Analytics Service
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
c4.2xlarge …

6
Dataﬂow Op*mizer
Bring database optimization
to a broader class of analytics
job configuration
cloud instance
A New Data Analytics Service
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization

7
Dataflow Optimizer: dataflow programs
WHERE C.url = P.url
GROUP BY C.uid
Predication
Training of a
classifier
Sessionization
SessionizaTon
Trainingaclassifier
Prediction
logical optimization of the query plan
Latency Throughput Monetary cost …

8
Dataflow Optimizer: from Logical to Physical Plans
Sessionization
Trainingaclassifier
Prediction
D
M
M
M
M
C
R
R
R
RQD
QMI QRIQMOQMM
Hadoop Streaming [Li et al 2011,2012,2015]
Degree of
parallelism θp
Granularity of
scheduling θs
Memory per
executor…
Logical Plans Phyiscal Plans

9
Challenge 1: Cost Models for the Dataflow Optimizer
SQL Optimizer Dataflow Optimizer
• Cost model for relational algebra
• Tend to run homogenous hardware
• Cost model for arbitrary dataflow
programs (in java, scala, python, …)
• Cloud computing may employ
heterogeneous hardware
Building a general cost model for any
user objective is hard!
• System behaviors are often centered
on CPU and IO
• Modeling diverse system behaviors,
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost model for resource consumption
(CPU and IO counters)
• Cost models for any user-defined
objectives (latency, throughput, cost, …)

10
Challenge 2: Need A New Multi-Objective Optimizer
1M tuples/sec with 1 sec latency 1K tuple/sec with 0.1 sec latency
Latency versus Throughput
Latency versus Monetary Cost
1 sec latency at $800/day 5 sec latency at $300/day
Mul5-Objec5ve Op5miza5on
• Solu5on is a (Pareto) set
• Reveals interes5ng tradeoﬀs
Monetary cost
Throughput
Latency
<θp, θs, λ, …>

11
Overview: Cost Modeling for Dataflow Optimization
Dataflow Optimizer
• Cost model for arbitrary data
dataﬂow programs (in java, scala, ...)
• Cloud computing may employ
heterogeneous hardware
• Modeling diverse system behaviors
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost models for any user-defined
objectives
q Deep Learning
• Representation learning for an
arbitrary program
• A (non-linear) function for any
objective
q In-situ modeling
• Learn a model as complex as
necessary for the current
computing environment

12
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Θ(i):
<batch interval,
block interval,
parallelism,
input rate>

13
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Significant change of user preference,
e.g. switching to the replay mode
Θ(2)
Optimization time: TΘ(i) – Tith_preference_change
• TΘ(1) : job first seen, workload encoding and optimization
• TΘ(i),i>1 : job already seen, optimization

14
Inside the Dataflow Optimizer
Multi-Objective
Optimizer
Change
Plan?
New job
configuration
Y
Θ (1)
Objective models
Ψ1, …, Ψk
Dataﬂow program,
ObjecGves {F1,…, Fk}
In-situ Modeling
No training
training
Pareto Frontier
History
of {"}
Trace Collector
Observations
No manual feature engineering
No user labels

16
1. Building a Cost Model for each User Objective
Given a dataflow, we assume that there is a job-specific encoding Wj (of its
key characteristics) which is invariant to time and runtime parameters.
Definition: For a job j, the workload encoding Wj is a real-valued vector
satisfying three conditions:
1. Invariance: For the same logical dataflow program, Wj should be
(approximately) an invariant during a period of time
2. Reconstruction: Wj should carry all the information for reconstructing the
observations (traces), given a specific job configuration Θj
i
3. Similarity Preserving: For similar programs i and j, their encodings Wi and
Wj should also be similar

17
Formal Description of the Predictive Model
Under our assumption, the model for a user objective (e.g., latency) can be
abstracted as a deterministic function:
Interpreted as an optimization problem, our goal is to find a function s.t.
where average is taken among all training instances ( j, Wj ).
Challenge: Building the Ψ function is easy if both Wj and Θj
i are known,
but the workload encoding Wj is unknown in practice!
Neutral network architectures that can
(1) extract the workload encoding for each new job;
(2) predict on a user objective given a new job configuration.
Wj should be (approximately) an invariant.
3) Reconstruction. Wj should carry all the information
for reconstructing the observation Oi
j, given ✓i
j.
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
Jtrain
total
How
latenc
is only
ral ne
observ
fore, w
h(✓i
j)
Tra
traini
the em
and ca
Never
menta
with a
time t
only t
beddi
more
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
The embedding approach extracts the workload encod-

18
Two Types of Architecture
18
1. Embedding
2. Autoencoder
Regressor

21
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of  do not
permit eﬀective optimization in many cases. (2)  around
5% seems to be a reasonable threshold. (3) ML jobs beneﬁt
from the optimization more than SQL jobs because ML jobs
52 SQL-like jobs from 5 parameterized
templates for click stream analysis, which
use different group-by’s, joins, global or
windowed aggregates, and user-defined
functions. 6 intensive jobs (each with 455
conf.), 46 regular jobs (each with 25 conf.).
12 ML jobs based on binary
classification using Stochastic
Gradient Descent. Each has 25
configurations.

22
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of  do not
permit eﬀective optimization in many cases. (2)  around
5% seems to be a reasonable threshold. (3) ML jobs beneﬁt
from the optimization more than SQL jobs because ML jobs
Predication accuracy on
familiar jobs
Prediction accuracy on new
(seen first time) jobs

23
2. A New Mul+-Objec+ve Op+mizer
v Given a set of user objectives, a Multi-Objective Optimizer uses
the predicted performance measures to construct a Pareto
optimal set (frontier).
v The skyline offers insights on tradeoffs:
• e.g., two configurations consume the same amount of resources, but one
achieves 20% higher throughput with only 1% loss of latency
v It finally chooses one optimal configuration to set the system
parameters (e.g., degree of parallelism, granularity of
scheduling, input rate) for execution.

27
Integration Results
Improvement over
configurations set by
engineers, dominance in
10/15, tradeoffs in 5/15
Improving throughput in
the replay mode, reducing
running time by 50%

28
Messages
² In-situ modeling based on deep learning has the potential to
model any user objective in a given computing environment
² Mul4-objec4ve op4miza4on has the poten7al to explore
tradeoﬀs and move in direc7on of be;er performance
automa7cally

29
Your comments are very
welcome. Thank you!
Acknowledgements

Towards a Unified Data Analytics Optimizer with Yanlei Diao

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Towards a Unified Data Analytics Optimizer with Yanlei Diao

Semelhante a Towards a Unified Data Analytics Optimizer with Yanlei Diao (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Towards a Unified Data Analytics Optimizer with Yanlei Diao