Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems.
This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more.
2. Meet the speaker
• Cofounder/CTO at Unravel, Adjunct Professor
of Computer Science at Duke University
• Focusing on ease-of-use and manageability of
data-intensive systems
• Recipient of US National Science Foundation
CAREER Award, three IBM Faculty Awards,
HP Labs Innovation Research Award
2
3. 3
Why is it always
a fight to run
distributed
applications in
production?
4. My app often fails with Out of Memory…
4
DATA SCIENTIST
#AI7SAIS
5. How can I make it more reliable?
5
DATA SCIENTIST
#AI7SAIS
7. I need to make it faster…
7
DATA ENGINEER
#AI7SAIS
8. My app is missing SLA…
8
DATA PIPELINE OWNER
#AI7SAIS
9. How can I tune my app to guarantee
SLAs?
9
DATA PIPELINE OWNER
#AI7SAIS
10. This rogue app is wasting resources
and burning money on the cloud
10
OPERATIONS TEAMS
#AI7SAIS
11. Can this app use less resources/cost
while finishing on time?
11
OPERATIONS TEAMS
#AI7SAIS
12. Current approach: Find the needle in
the hay STACKS
1 System
1 User
1 App
12
1. Find the app. Review Spark/YARN UI to find the app
2. Review metrics on web UI
3. Review stages associated with the app
4. Identify executors associated with the stages
and outliers
5. Deep dive into “outlier” stage
6. Identify problematic “executors”
7. Review and debug container logs
8. Rinse & repeat across other executor/container logs
to identify the problem
#AI7SAIS
13. Now imagine the problem at scale
13
1 System
1 User
1 App
10x Systems
100x Users
1000x Apps
#AI7SAIS
14. Imagine a new world
INPUTS
1. App = Spark Query
2. Goal = Speedup
“I need to make this app faster”
#AI7SAIS
15. Imagine a new world
15
TIME
APPDURATION
In blink of an eye, user
gets recommendations to
make the app 30% faster
As user finishes
checking email, she
has a verified run
that is 60% faster
User comes back from
lunch. A verified run that
is 90% faster
90%
faster!
#AI7SAIS
16. We want to bring
the new world to you today
Introducing Unravel Sessions
16#AI7SAIS
21. • Goal: Find the setting of
X that best meets the goal
• Challenge: Response
surface y = ƒ(X) is
unknown
X
PERFORMANCE
Given: App + Goal
#AI7SAIS 24
23. Model the response surface as
The Gaussian Process model captures the
uncertainty in our current knowledge of the
response surface
)()()(ˆ XZXfXy t
+= b
!!
b
!!
)(Xf t
)(XZ
X
PERFORMANCE
Challenge: Response surface
y = ƒ(X) is unknown
Here:
is a regression model
is the residual captured as a
Gaussian Process
#AI7SAIS 26
24. ò
=
-¥=
-=
)(
)(ˆ
*
*
)())(()(
Xyp
p
Xy dpppdfpXyXEIP
We can now estimate the expected improvement EIP(X) from
doing a probe at any setting X
Gaussian Process model helps estimate EIP(X)
Improvement at any
setting X over the best
performance seen so far
Probability density
function (uncertainty
estimate)
X
Opportunity
#AI7SAIS 27
PERFORMANCE
25. Get initial set of
monitoring data from
history or via
probes: <X1,y1>,
<X2,y2>, …, <Xn,yn>
1
Select next probe
Xnext based on all
history and probe data
available so far to
calculate the setting
with maximum expected
improvement EIP(X)
2
Bootstrap
Probe Algorithm
Until the
stopping
condition
is
reached
#AI7SAIS 28
PERFORMANCE
X
26. 4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
X
Performance
U
EIP(X)
U
Xnext: Do next
probe here
This approach
balances
Exploration Vs.
Exploitation
U
Exploration
U
Exploitation
#AI7SAIS 29
29. What is the
goal?
§ Make the
app faster
§ Make app
reliable
§ Make app
meet SLA
§ Make app
use less
resources
What is the
app?
§ SQL
§ Program
§ DAG
§ Workload
Are trained
models
available?
§ Yes
§ No
§ Partially
Is similar
historic data
available?
§ No
§ A little
§ A lot
How soon
user needs
the answer?
§ Now
§ Soon
§ By a
deadline
32
Probe Algorithm: Has to deal with a
diverse space of data & uncertainty
31. • Probe Algorithm: Use model ensembles
• Orchestrator: Use cheap probes when possible
• Monitoring data: Use full-stack data collection
• Recommendation Algorithm: Keep user in the loop
Lessons Learned Building the
Unravel Sessions Architecture
#AI7SAIS 34
32. Monitoring
Data
Historic Data
&
Probe Data
Cluster Services On-premises and Cloud
Orchestrator
Xnext
Lessons Learned: Cheap Probes
#AI7SAIS 35
Orchestrator can leverage
cost-based optimizers like
Catalyst as an approximate,
but very fast, performance
model in some cases
36. In Summary
There are exciting opportunities to apply
AI to manage performance of modern data apps
Sign up for a free trial, we value your feedback!
https://unraveldata.com/free-trial
And yes, we are hiring
39#AI7SAIS