Today’s big data analytics systems are best effort only: despite the wide adoption, they still lack the ability to take user monetary constraints and performance goals, and automatically configure an analytic job to achieve those goals. Our work aims to take a step further towards building a new data analytics optimizer that works for arbitrary dataflow programs and determines the job configuration in an automated manner based on user objectives regarding latency, throughput, monetary cost, etc.
At the core of the optimizer are a principled multi-objective optimization framework that enables one to explore the tradeoffs between different objectives, and a deep learning-based modeling approach that can learn a model for each user objective as complex as necessary for the user computing environment. Using both SQL-like and machine learning jobs in Spark, we show that our techniques can learn a model of each objective with high accuracy, and the multi-objective optimizer can automatically recommend new configurations that significantly improve performance from the configurations manually set by engineers.
2. 1
Today’s Data Analytics Service
Cloud Computing (EC2 ~60 instance types)
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large
–c4.2xlarge …
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
3. 2
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Latency Throughput Monetary cost
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
4. 3
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Latency Throughput Monetary cost
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
5. 4
Today’s Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Today’s Cloud Service guarantees
availability (SLA)
But too many configuraTon issues
are leU to the user…
6. 5
A New Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Latency Throughput Monetary cost
D
M
M
M
M
C
R
R
R
RQD
QMIQRIQMOQMM
M M M M
D
C
R R R R
Degree of ||
Memory size
Granularity of
scheduling
Cloud Computing: ~60 instance types
-t2.nano –t2.micro –t2.small –t2.medium –
t2.large –m4.large –m4.xlarge –m4.2xlarge -
m4.4xlarge –m4.10xlarge -m3.medium -
m3.large –m3.xlarge –m3.2xlarge –c4.large –
c4.2xlarge …
7. 6
Dataflow Op*mizer
Bring database optimization
to a broader class of analytics
job configuration
cloud instance
A New Data Analytics Service
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
Latency Throughput Monetary cost
8. 7
Dataflow Optimizer: dataflow programs
SELECT C.uid, avg(P.pagerank)
FROM Clicks C, Pages P
WHERE C.url = P.url
GROUP BY C.uid
HAVING avg(P.pagerank) > 0.5
ORDER BY avg(P.pagerank)
Predication
Training of a
classifier
Sessionization
SessionizaTon
Trainingaclassifier
Prediction
logical optimization of the query plan
Latency Throughput Monetary cost …
9. 8
Dataflow Optimizer: from Logical to Physical Plans
Sessionization
Trainingaclassifier
Prediction
D
M
M
M
M
C
R
R
R
RQD
QMI QRIQMOQMM
Hadoop Streaming [Li et al 2011,2012,2015]
Degree of
parallelism θp
Granularity of
scheduling θs
Memory per
executor…
Logical Plans Phyiscal Plans
10. 9
Challenge 1: Cost Models for the Dataflow Optimizer
SQL Optimizer Dataflow Optimizer
• Cost model for relational algebra
• Tend to run homogenous hardware
• Cost model for arbitrary dataflow
programs (in java, scala, python, …)
• Cloud computing may employ
heterogeneous hardware
Building a general cost model for any
user objective is hard!
• System behaviors are often centered
on CPU and IO
• Modeling diverse system behaviors,
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost model for resource consumption
(CPU and IO counters)
• Cost models for any user-defined
objectives (latency, throughput, cost, …)
11. 10
Challenge 2: Need A New Multi-Objective Optimizer
1M tuples/sec with 1 sec latency 1K tuple/sec with 0.1 sec latency
Latency versus Throughput
Latency versus Monetary Cost
1 sec latency at $800/day 5 sec latency at $300/day
Mul5-Objec5ve Op5miza5on
• Solu5on is a (Pareto) set
• Reveals interes5ng tradeoffs
Monetary cost
Throughput
Latency
<θp, θs, λ, …>
12. 11
Overview: Cost Modeling for Dataflow Optimization
Dataflow Optimizer
• Cost model for arbitrary data
dataflow programs (in java, scala, ...)
• Cloud computing may employ
heterogeneous hardware
• Modeling diverse system behaviors
including CPU, IO, shuffling, queuing,
data skew, stragglers, failures…
• Cost models for any user-defined
objectives
q Deep Learning
• Representation learning for an
arbitrary program
• A (non-linear) function for any
objective
q In-situ modeling
• Learn a model as complex as
necessary for the current
computing environment
13. 12
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Θ(i):
<batch interval,
block interval,
parallelism,
input rate>
14. 13
Trace Collection for Cost Modeling & Optimization
HDFS
Dataflow Optimizer
Spark
Listener
System
Monitor
Input
Data
Physical
Execution Plan
User Objectives
(related measures)
Application
(Spark related metrics)
System
(CPU, memory, IO,
network, …)
Traces
Dataflow program,
Objectives {F1,…, Fk}
Θ(0)
Θ(1)
Significant change of user preference,
e.g. switching to the replay mode
Θ(2)
Optimization time: TΘ(i) – Tith_preference_change
• TΘ(1) : job first seen, workload encoding and optimization
• TΘ(i),i>1 : job already seen, optimization
15. 14
Inside the Dataflow Optimizer
Multi-Objective
Optimizer
Change
Plan?
New job
configuration
Y
Θ (1)
Objective models
Ψ1, …, Ψk
Dataflow program,
ObjecGves {F1,…, Fk}
In-situ Modeling
No training
training
Pareto Frontier
History
of {"}
Trace Collector
Observations
No manual feature engineering
No user labels
16. 16
1. Building a Cost Model for each User Objective
Given a dataflow, we assume that there is a job-specific encoding Wj (of its
key characteristics) which is invariant to time and runtime parameters.
Definition: For a job j, the workload encoding Wj is a real-valued vector
satisfying three conditions:
1. Invariance: For the same logical dataflow program, Wj should be
(approximately) an invariant during a period of time
2. Reconstruction: Wj should carry all the information for reconstructing the
observations (traces), given a specific job configuration Θj
i
3. Similarity Preserving: For similar programs i and j, their encodings Wi and
Wj should also be similar
17. 17
Formal Description of the Predictive Model
Under our assumption, the model for a user objective (e.g., latency) can be
abstracted as a deterministic function:
Interpreted as an optimization problem, our goal is to find a function s.t.
where average is taken among all training instances ( j, Wj ).
Challenge: Building the Ψ function is easy if both Wj and Θj
i are known,
but the workload encoding Wj is unknown in practice!
Neutral network architectures that can
(1) extract the workload encoding for each new job;
(2) predict on a user objective given a new job configuration.
Wj should be (approximately) an invariant.
3) Reconstruction. Wj should carry all the information
for reconstructing the observation Oi
j, given ✓i
j.
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
Jtrain
total
How
latenc
is only
ral ne
observ
fore, w
h(✓i
j)
Tra
traini
the em
and ca
Never
menta
with a
time t
only t
beddi
more
Given the notion of workload encodings, we next define
the learning problem. Without loss of generality, let us con-
sider latency l as a user objective. The model for latency
can be formulated as a deterministic function s.t.
(✓i
j, Wj) = li
j
The learning problem is to choose the function that mini-
mizes the distance between the predicted latency and the ac-
tual latency for all (job, configuration) combinations, called
training instances, in the training data:
⇤
= arg min avg(distance( (✓i
j, Wj), li
j)
where the average is taken among all the training instances.
If both ✓i
j and Wj were given, the problem would be a classic
learning problem. Since Wj is unknown, we need to extend
existing learning tools to extract Wj. In the following, we
propose two deep learning architectures that simultaneously
extract the workload encoding from a job and build a pre-
dictive model on the configuration and workload encoding.
4.2 Embedding Approach
The embedding approach extracts the workload encod-
18. 18
Two Types of Architecture
18
1. Embedding
2. Autoencoder
Regressor
19. 21
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of do not
permit effective optimization in many cases. (2) around
5% seems to be a reasonable threshold. (3) ML jobs benefit
from the optimization more than SQL jobs because ML jobs
52 SQL-like jobs from 5 parameterized
templates for click stream analysis, which
use different group-by’s, joins, global or
windowed aggregates, and user-defined
functions. 6 intensive jobs (each with 455
conf.), 46 regular jobs (each with 25 conf.).
12 ML jobs based on binary
classification using Stochastic
Gradient Descent. Each has 25
configurations.
20. 22
Results on Modeling
2-Inc
2.5%
4.8%
0.9%
5.2%
8.4%
2-Inc
3.6%
5.6%
0.6%
0.5%
5.3%
codings.
SQL-like Jobs ML Jobs
T1 T2 T1 T2
Embedding 9.7% 33.3% 16.5% 16.2%
Autoencoder 11.6 % 22.0% 24.9% 53.9%
Autoencoder + Opt2 11.6 % 13.4% 24.9% 25.0%
Autoencoder + Opt1 8.5% 10.3% 2.9% 2.6%
Autoencoder + Opt1/2 8.5% 8.4% 2.9% 3.6%
HadoopStreaming [17] 31.9% 31.9% 84.6% 84.6%
Table 4: Summary of T1 and T2 errors for latency.
The major results are: (1) Very low values of do not
permit effective optimization in many cases. (2) around
5% seems to be a reasonable threshold. (3) ML jobs benefit
from the optimization more than SQL jobs because ML jobs
Predication accuracy on
familiar jobs
Prediction accuracy on new
(seen first time) jobs
21. 23
2. A New Mul+-Objec+ve Op+mizer
v Given a set of user objectives, a Multi-Objective Optimizer uses
the predicted performance measures to construct a Pareto
optimal set (frontier).
v The skyline offers insights on tradeoffs:
• e.g., two configurations consume the same amount of resources, but one
achieves 20% higher throughput with only 1% loss of latency
v It finally chooses one optimal configuration to set the system
parameters (e.g., degree of parallelism, granularity of
scheduling, input rate) for execution.
23. 28
Messages
² In-situ modeling based on deep learning has the potential to
model any user objective in a given computing environment
² Mul4-objec4ve op4miza4on has the poten7al to explore
tradeoffs and move in direc7on of be;er performance
automa7cally