This document summarizes Holden Karau's presentation on building recoverable pipelines with Apache Spark. The presentation explored ways that Spark jobs can fail late, presented initial attempts to make a WordCount job recoverable, and discussed improvements to the approach using non-blocking saves and the Spark DAG. The presentation concluded with recommendations to replace WordCount with a real pipeline and clean up files, as well as links for learning more about Spark.
2. Some links (slides & recordings
will be at):
http://bit.ly/2QMUaRc
^ Slides & Code
(only after the talk because early is hard)
Shkumbin Saneja
3. Holden:
▪ Prefered pronouns are she/her
▪ Developer Advocate at Google
▪ Apache Spark PMC/Committer, contribute to many other projects
▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
▪ co-author of Learning Spark & High Performance Spark
▪ Twitter: @holdenkarau
▪ Slide share http://www.slideshare.net/hkarau
▪ Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
4.
5. Who y’all are?
▪ Nice folk
▪ Like databases of a certain kind
▪ Occasionally have big data jobs on your big data fail
mxmstryo
6. What are we going to explore?
▪ Brief: what is Spark and why it’s related to this conference
▪ Also brief: Some of the ways Spark can fail in hour 23
▪ Less brief: a first stab at making it recoverable
▪ How that goes boom
▪ Repeat ? times until it stops going boom
▪ Summary and github link
Stuart
7. What is Spark?
• General purpose distributed system
• With a really nice API including Python :)
• Apache project (one of the most active)
• Must faster than Hadoop Map/Reduce
• Good when too big for a single machine
• Built on top of two abstractions for
distributed data: RDDs & Datasets
8. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
9. Why people come to Spark:
Well this MapReduce job
is going to take 16 hours -
how long could it take to
learn Spark?
dougwoods
10. Why people come to Spark:
My DataFrame won’t fit in
memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
11. Big Data == Wordcount
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Chris
12. Big Data != Wordcount
▪ ETL (keeping your databases in sync)
▪ SQL on top of non-SQL (hey what about if we added a SQL
engine to this?)
▪ ML - Everyone’s doing it, we should too
▪ DL - VC’s won’t give us money for ML anymore so we changed
its name
▪ But for this talk we’re just looking at Wordcount because it fits
on a slide
14. Why Spark fails & fails late
▪ Lazy evaluation can make predicting behaviour difficulty
▪ Out of memory errors (from JVM heap to container limits)
▪ Errors in our own code
▪ Driver failure
▪ Data size increases without required tuning changes
▪ Key-skew (consistent partitioning is a great idea right? Oh wait…)
▪ Serialization
▪ Limited type checking in non-JVM languages with ML pipelines
▪ etc.
16. Why isn’t it recoverable?
▪ Seperate jobs - no files, no VMs, only sadness
▪ If same job (e.g. notebook failure and retry) cache & files
recovery
Jennifer C.
17. “Recoverable” Wordcount: Take 1
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
KLMircea
18. So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
▪ We don’t have any clean up on success
▪ sc._jvm is weird
▪ Functions -- the future!
▪ Not async
Jennifer C.
19. “Recoverable” Wordcount: Take 2
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words/SUCCESS.txt"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
Susanne Nilsson
20. So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
• Fixed
▪ We don’t have any clean up on success
• ….
▪ sc._jvm is weird
• Yeah we’re not fixing this one unless we use scala
▪ Functions -- the future!
• sure!
▪ Have to wait to finish writing file
• Hold your horses
ivva
21. “Recoverable” [X]: Take 3
def non_blocking_df_save_or_load(df, target):
success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"]
if any(fs.exists(hadoop_fs_path(t.format(target))) for t in
success_files):
print("Reusing")
return session.read.load(target).persist()
else:
print("Saving")
df.save(target)
return df
Jennifer C.
22. So what can we do better?
▪ Try and not slow down our code on the happy path
• async?
▪ Cleanup on success (damn meant to do that earlier)
hkase
24. What could go wrong?
▪ Turns out… a lot
▪ Multiple executions on the DAG are not super great
(getting better but)
▪ How do we work around this?
25. Spark’s (core) magic: the DAG
▪ In Spark most of our work is done by transformations
• Things like map
▪ Transformations return new RDDs or DataFrames representing
this data
▪ The RDD or DataFrame however doesn’t really “exist”
▪ RDD & DataFrames are really just “plans” of how to make the
data show up if we force Spark’s hand
▪ tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
27. cache + sync count + async save
def non_blocking_df_save_or_load(df, target):
s = "{0}/SUCCESS.txt"
if fs.exists(hadoop_fs_path(s.format(target))):
return session.read.load(target).persist()
else:
print("Saving")
df.cache()
df.count()
non_blocking_df_save(df, target)
return df
28. Well that was “fun”?
▪ Replace wordcount with your back-fill operation and it
becomes less fun
▪ You also need to clean up the files
▪ Use job IDS to avoid stomping on other jobs
29. Spark Videos
▪ Apache Spark Youtube Channel
▪ My Spark videos on YouTube -
• http://bit.ly/holdenSparkVideos
▪ Spark Summit 2014 training
▪ Paco’s Introduction to Apache Spark
Paul Anderson
30. Learning Spark
Fast Data Processing
with Spark
(Out of Date)
Fast Data Processing with
Spark (2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance
Spark
Learning PySpark
31. I also have a book...
High Performance Spark, it’s available today & the gift of
the season.
Unrelated to this talk, but if you have a corporate credit
card (and or care about distributed systems)….
http://bit.ly/hkHighPerfSpark
32. Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 Spark testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me?
Promise not to be creepy? Ok:
holden@pigscanfly.ca