Putting AI to Work on Apache Spark

Shivnath Babu
Co-Founder, CTO @ Unravel Data
Adjunct Professor @ Duke University
Putting AI to
Work on Spark

Meet the speaker
• Cofounder/CTO at Unravel, Adjunct Professor
of Computer Science at Duke University
• Focusing on ease-of-use and manageability of
data-intensive systems
• Recipient of US National Science Foundation
CAREER Award, three IBM Faculty Awards,
HP Labs Innovation Research Award
2

3
Why is it always
a fight to run
distributed
applications in
production?

My app often fails with Out of Memory…
4
DATA SCIENTIST
#AI7SAIS

How can I make it more reliable?
5
DATA SCIENTIST
#AI7SAIS

My app is too slow…
6
DATA ENGINEER
#AI7SAIS

I need to make it faster…
7
DATA ENGINEER
#AI7SAIS

My app is missing SLA…
8
DATA PIPELINE OWNER
#AI7SAIS

How can I tune my app to guarantee
SLAs?
9
DATA PIPELINE OWNER
#AI7SAIS

This rogue app is wasting resources
and burning money on the cloud
10
OPERATIONS TEAMS
#AI7SAIS

Can this app use less resources/cost
while finishing on time?
11
OPERATIONS TEAMS
#AI7SAIS

Current approach: Find the needle in
the hay STACKS
1 System
1 User
1 App
12
1. Find the app. Review Spark/YARN UI to find the app
2. Review metrics on web UI
3. Review stages associated with the app
4. Identify executors associated with the stages
and outliers
5. Deep dive into “outlier” stage
6. Identify problematic “executors”
7. Review and debug container logs
8. Rinse & repeat across other executor/container logs
to identify the problem
#AI7SAIS

Now imagine the problem at scale
13
1 System
1 User
1 App
10x Systems
100x Users
1000x Apps
#AI7SAIS

Imagine a new world
INPUTS
1. App = Spark Query
2. Goal = Speedup
“I need to make this app faster”
#AI7SAIS

Imagine a new world
15
TIME
APPDURATION
In blink of an eye, user
gets recommendations to
make the app 30% faster
As user finishes
checking email, she
has a verified run
that is 60% faster
User comes back from
lunch. A verified run that
is 90% faster
90%
faster!
#AI7SAIS

We want to bring
the new world to you today
Introducing Unravel Sessions
16#AI7SAIS

Unravel Sessions
18
App
Goal
SQL Program DAG Workload
Reliability
Efficiency
Speedup
SLA
TIME
Unravel
Sessions
will quickly
get you
there!
#AI7SAIS

How Unravel Sessions work
AI
for
Application Performance Management
21#AI7SAIS

Monitoring
Data
Historic Data
&
Probe Data
Recommendation
Algorithm
Cluster Services On-premises and Cloud
App,Goal
Orchestrator
Unravel Sessions Architecture
Xnext
Probe Algorithm
#AI7SAIS 22

spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThres
hold
20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
We represent this setting as vector X
X
PERFORMANCE#AI7SAIS 23

• Goal: Find the setting of
X that best meets the goal
• Challenge: Response
surface y = ƒ(X) is
unknown
X
PERFORMANCE
Given: App + Goal
#AI7SAIS 24

Response Surface Methodology
#AI7SAIS 25
Credit: http://people.csail.mit.edu/hongzi/
Reinforcement Learning

Model the response surface as
The Gaussian Process model captures the
uncertainty in our current knowledge of the
response surface
)()()(ˆ XZXfXy t
+= b
!!
b
!!
)(Xf t
)(XZ
X
PERFORMANCE
Challenge: Response surface
y = ƒ(X) is unknown
Here:
is a regression model
is the residual captured as a
Gaussian Process
#AI7SAIS 26

ò
=
-¥=
-=
)(
)(ˆ
*
*
)())(()(
Xyp
p
Xy dpppdfpXyXEIP
We can now estimate the expected improvement EIP(X) from
doing a probe at any setting X
Gaussian Process model helps estimate EIP(X)
Improvement at any
setting X over the best
performance seen so far
Probability density
function (uncertainty
estimate)
X
Opportunity
#AI7SAIS 27
PERFORMANCE

Get initial set of
monitoring data from
history or via
probes: <X1,y1>,
<X2,y2>, …, <Xn,yn>
1
Select next probe
Xnext based on all
history and probe data
available so far to
calculate the setting
with maximum expected
improvement EIP(X)
2
Bootstrap
Probe Algorithm
Until the
stopping
condition
is
reached
#AI7SAIS 28
PERFORMANCE
X

4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
X
Performance
U
EIP(X)
U
Xnext: Do next
probe here
This approach
balances
Exploration Vs.
Exploitation
U
Exploration
U
Exploitation
#AI7SAIS 29

Credit: https://discovery.rsm.nl/articles/detail/130-how-to-balance-exploration-and-exploitation-in-multinational-enterprises
Data Starved
& High Uncertainty
Data Rich
& Low
Uncertainty
30
App,Goal
Xnext
Probe Algorithm

Monitoring
Data
Historic Data
&
Probe Data
Recommendation
Algorithm
App,Goal
Orchestrator
Xnext
Probe Algorithm
#AI7SAIS 31

What is the
goal?
§ Make the
app faster
§ Make app
reliable
§ Make app
meet SLA
§ Make app
use less
resources
What is the
app?
§ SQL
§ Program
§ DAG
§ Workload
Are trained
models
available?
§ Yes
§ No
§ Partially
Is similar
historic data
available?
§ No
§ A little
§ A lot
How soon
user needs
the answer?
§ Now
§ Soon
§ By a
deadline
32
Probe Algorithm: Has to deal with a
diverse space of data & uncertainty

Dynamic Selection
Xnext
Probe Algorithm: Use
Model Ensembles
Model 1 Model 2 … Model N
#AI7SAIS 33

• Probe Algorithm: Use model ensembles
• Orchestrator: Use cheap probes when possible
• Monitoring data: Use full-stack data collection
• Recommendation Algorithm: Keep user in the loop
Lessons Learned Building the
#AI7SAIS 34

Monitoring
Data
Historic Data
&
Probe Data
Orchestrator
Xnext
Lessons Learned: Cheap Probes
#AI7SAIS 35
Orchestrator can leverage
cost-based optimizers like
Catalyst as an approximate,
but very fast, performance
model in some cases

Lessons Learned: Full Stack
#AI7SAIS 36
Monitoring
Data
Historic Data
&
Probe Data

Monitoring
Data
Historic Data
&
Probe Data
Recommendation
Algorithm
App,Goal
Orchestrator
Xnext
Probe Algorithm
#AI7SAIS 37
Lessons Learned: User in the Loop

Stay Tuned for our Spark Summit 2019 Talks!

In Summary
There are exciting opportunities to apply
AI to manage performance of modern data apps
Sign up for a free trial, we value your feedback!
https://unraveldata.com/free-trial
And yes, we are hiring
39#AI7SAIS

Putting AI to Work on Apache Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Putting AI to Work on Apache Spark

Semelhante a Putting AI to Work on Apache Spark (20)

Mais de Anyscale

Mais de Anyscale (9)

Último

Último (20)

Putting AI to Work on Apache Spark