Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

Building Recoverable Pipelines
With Apache Spark
Holden Karau
Open Source Developer Advocate @ Google

Some links (slides & recordings
will be at):
http://bit.ly/2QMUaRc
^ Slides & Code
(only after the talk because early is hard)
Shkumbin Saneja

Holden:
▪ Prefered pronouns are she/her
▪ Developer Advocate at Google
▪ Apache Spark PMC/Committer, contribute to many other projects
▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
▪ co-author of Learning Spark & High Performance Spark
▪ Twitter: @holdenkarau
▪ Slide share http://www.slideshare.net/hkarau
▪ Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
▪ Spark Talk Videos http://bit.ly/holdenSparkVideos

Who y’all are?
▪ Nice folk
▪ Like databases of a certain kind
▪ Occasionally have big data jobs on your big data fail
mxmstryo

What are we going to explore?
▪ Brief: what is Spark and why it’s related to this conference
▪ Also brief: Some of the ways Spark can fail in hour 23
▪ Less brief: a first stab at making it recoverable
▪ How that goes boom
▪ Repeat ? times until it stops going boom
▪ Summary and github link
Stuart

What is Spark?
• General purpose distributed system
• With a really nice API including Python :)
• Apache project (one of the most active)
• Must faster than Hadoop Map/Reduce
• Good when too big for a single machine
• Built on top of two abstractions for
distributed data: RDDs & Datasets

The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson

Why people come to Spark:
Well this MapReduce job
is going to take 16 hours -
how long could it take to
learn Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit in
memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau

Big Data == Wordcount
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Chris

Big Data != Wordcount
▪ ETL (keeping your databases in sync)
▪ SQL on top of non-SQL (hey what about if we added a SQL
engine to this?)
▪ ML - Everyone’s doing it, we should too
▪ DL - VC’s won’t give us money for ML anymore so we changed
its name
▪ But for this talk we’re just looking at Wordcount because it fits
on a slide

Why Spark fails & fails late
▪ Lazy evaluation can make predicting behaviour difficulty
▪ Out of memory errors (from JVM heap to container limits)
▪ Errors in our own code
▪ Driver failure
▪ Data size increases without required tuning changes
▪ Key-skew (consistent partitioning is a great idea right? Oh wait…)
▪ Serialization
▪ Limited type checking in non-JVM languages with ML pipelines
▪ etc.

Why isn’t it recoverable?
▪ Seperate jobs - no files, no VMs, only sadness
▪ If same job (e.g. notebook failure and retry) cache & files
recovery
Jennifer C.

“Recoverable” Wordcount: Take 1
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
KLMircea

So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
▪ We don’t have any clean up on success
▪ sc._jvm is weird
▪ Functions -- the future!
▪ Not async
Jennifer C.

“Recoverable” Wordcount: Take 2
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words/SUCCESS.txt"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
Susanne Nilsson

▪ Well if the pipeline fails in certain ways this will fail
• Fixed
▪ We don’t have any clean up on success
• ….
▪ sc._jvm is weird
• Yeah we’re not fixing this one unless we use scala
▪ Functions -- the future!
• sure!
▪ Have to wait to finish writing file
• Hold your horses
ivva

“Recoverable” [X]: Take 3
def non_blocking_df_save_or_load(df, target):
success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"]
if any(fs.exists(hadoop_fs_path(t.format(target))) for t in
success_files):
print("Reusing")
return session.read.load(target).persist()
else:
print("Saving")
df.save(target)
return df
Jennifer C.

▪ Try and not slow down our code on the happy path
• async?
▪ Cleanup on success (damn meant to do that earlier)
hkase

Adding async?
def non_blocking_df_save(df, target):
import threading
def save_panda():
df.write.mode("overwrite").save(target)
thread = threading.Thread(target=save_panda)
thread.start()

What could go wrong?
▪ Turns out… a lot
▪ Multiple executions on the DAG are not super great
(getting better but)
▪ How do we work around this?

Spark’s (core) magic: the DAG
▪ In Spark most of our work is done by transformations
• Things like map
▪ Transformations return new RDDs or DataFrames representing
this data
▪ The RDD or DataFrame however doesn’t really “exist”
▪ RDD & DataFrames are really just “plans” of how to make the
data show up if we force Spark’s hand
▪ tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G

The DAG The query
plan Susanne Nilsson

cache + sync count + async save
def non_blocking_df_save_or_load(df, target):
s = "{0}/SUCCESS.txt"
if fs.exists(hadoop_fs_path(s.format(target))):
return session.read.load(target).persist()
else:
print("Saving")
df.cache()
df.count()
non_blocking_df_save(df, target)
return df

Well that was “fun”?
▪ Replace wordcount with your back-fill operation and it
becomes less fun
▪ You also need to clean up the files
▪ Use job IDS to avoid stomping on other jobs

Spark Videos
▪ Apache Spark Youtube Channel
▪ My Spark videos on YouTube -
• http://bit.ly/holdenSparkVideos
▪ Spark Summit 2014 training
▪ Paco’s Introduction to Apache Spark
Paul Anderson

Learning Spark
Fast Data Processing
with Spark
(Out of Date)
Fast Data Processing with
Spark (2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance
Spark
Learning PySpark

I also have a book...
High Performance Spark, it’s available today & the gift of
the season.
Unrelated to this talk, but if you have a corporate credit
card (and or care about distributed systems)….
http://bit.ly/hkHighPerfSpark

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 Spark testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me?
Promise not to be creepy? Ok:
holden@pigscanfly.ca

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

Similar to Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions) (20)

Recently uploaded

Recently uploaded (20)

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)