Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

@holdenkarau
Powering TensorFlow with
big data
With Apache Beam, Flink & Spark bonus KF
@holdenkarau

@holdenkarau
Slides will be at:
http://bit.ly/2HWRxfA
CatLoversShow

@holdenkarau
Holden:
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
● Helping organize Data Track @ ITNEXT AMS - CFP Open!

@holdenkarau
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Maybe somewhat familiar with Tensorflow?
● Maybe somewhat familiar with Beam or Spark or Flink?
Lori Erickson

@holdenkarau
What will be covered?
● Why we need big data for deep learning
● The state of Java/Python integration
● And why this matters for Tensorflow
● Tools to simplify this (TFT, TFMA, TFDV, etc.)
● Pipelining & validation
Then choose your own demo or Q&A:
● TensorFlowOnSpark
● Tensorflow Transform on Apache Beam on {Apache Flink, Dataflow}
● Kubeflow w/Spark

Part of what lead to the success of Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc.
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson

What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau

Why this all matters?
https://twitter.com/wesmckinn/status/1060623695553683456
cuatrok77

What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together from Python?
● Pickling, Strings, JSON, XML, oh my!
Over
● Unix pipes, Sockets, files, and mmapped files (sometimes in the same
program)
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown

@holdenkarau
The "state" of TF + Big Data
● TensorFlowOnSpark w/basic Apache Arrow
○ Still needs more work
○ New scheduler, with improvements in Spark 3
● Basic TF Transform on Apache Flink via Apache Beam
● New* Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● feed_dict + scheduler luck
Vladimir Pustovit

@holdenkarau
So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● You need to clean your large datasets
● You also (probably)* need some feature preparation
○ even if you’re looking at mnist.csv you probably have _some_ feature prep
● You need to transform your datasets into the formats your DL wants
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset

@holdenkarau
TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming)
● Alternative 1: Data prep in an "exportable" format and serve with Seldon
○ Yay extra RPCs?
● Alternatives 2: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.

@holdenkarau
How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool
○ I like Apache Spark, some folks like Apache Beam or Flink.
○ So long as it not feature_prep.sh
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad

@holdenkarau
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Written in Python
● Runs on top of Apache Beam
○ Works really on Dataflow
○ On master this can run on Flink, but has bugs currently.
○ Please don’t use this in production today unless your on GCP/Dataflow
○ Python 2 only for now
PROKathryn Yengel

@holdenkarau
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}

@holdenkarau
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented as a distributed
data pipeline
Transforms
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow

@holdenkarau
Analyze
normalize
multiply
bucketize
constant
tensors
data
mean stddev
normalize
multiply
quantiles
bucketize

@holdenkarau
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...

@holdenkarau
BEAM Beyond the JVM: Current release
● Works pretty well on Dataflow
● non-JVM BEAM on Apache Flink is relatively early stages
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
Emma

@holdenkarau
BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK works with the new interface
● Go SDK talks to the new interface, still missing some features
Nick

@holdenkarau
BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability

@holdenkarau
So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc

@holdenkarau
Sample of the chicago taxi data:
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
taxi.FEATURE_BUCKET_COUNT)

@holdenkarau
BEAM Beyond the JVM: The “future”
E.g. not now
*ish
*ish
*ish
Nick
portability
*ish
*ish

@holdenkarau
This seems complicated, options?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan

@holdenkarau
A quick detour into PySpark’s internals
+ + JSON
TimOve

@holdenkarau
PySpark
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design

@holdenkarau
Driver
py4j
Worker 1
Worker K
pipe
pipe

@holdenkarau
And in flink….
Driver
custom
Worker 1
Worker K
mmap
mmap

@holdenkarau
So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go over container limits if
deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing them is normally simple

@holdenkarau
TensorFlowOnSpark, everyone loves mnist!
cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
TFCluster.InputMode.SPARK)
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
Lida

@holdenkarau
The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar

@holdenkarau
Andrew Skudder
*Arrow: Spark 2.3 and beyond & GPUs & R & Python & ….
* *

@holdenkarau
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.

@holdenkarau
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish

@holdenkarau
Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.

@holdenkarau
And we can do this in TFOnSpark*:
unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable) on
the next slide!
Delaina Haslam

@holdenkarau
Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
@pandas_udf("int")
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
train_func(modified_series)
return pandas.Series([0] * len(inputSeries1))
ljmacphee

@holdenkarau
And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
*ish

@holdenkarau
So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● At some point you want to get the data from your big data tool into Tensorflow
● Worst case: you can write out a bunch of files and read them back in
● Possibly better case: you use the things we talked about

Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
○ Much more like a Panda’s DataFrame than Spark’s DataFrames
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends make this better, but it’s still a bit rough
● There is a proof-of-concept to bootstrap a dask cluster on Spark
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins

@holdenkarau
Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
similar
○ But you’re not using this in production today anyways?
○ Right?
● Automate your pipeline so you don't have to run it every week by hand
● Validate that your models aren't getting worse
Nick Perla

@holdenkarau
(Optionally): Putting it together with Kubeflow
VIK hotels group
"The Machine Learning Toolkit
for Kubernetes"
- Kubeﬂow Website

@holdenkarau
Introducing* Kubeflow
VIK hotels group

@holdenkarau
Components Buffet
argo
automation
chainer-job
core
credentials-pod-preset
katib
mpi-job
mxnet-job
openmpi
pachyderm
pytorch-job
Seldon
spark
tf-serving
Paul Harrison

@holdenkarau
What are those pipelines?
“Kubeflow Pipelines is a platform for building and deploying portable, scalable
machine learning (ML) workflows based on Docker containers.” - kubeflow.org
Directed Acyclic Graph (DAG) of “pipeline components” (read “docker containers”)
each performing a function.

@holdenkarau
Building that pipeline?

@holdenkarau
Running that pipeline

@holdenkarau
Ok cool, but… we need to validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

@holdenkarau
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.

@holdenkarau
Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya

@holdenkarau
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker

@holdenkarau
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I have an updated variant of it that is going our OSS releasing process internally
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Do you have property tests? Could be Validation rules
● Historical data
○ what did your counters look like yesterday
● Domain specific solutions
○ The best, but also the most work
Photo by:
Paul Schadler

@holdenkarau
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?

@holdenkarau
TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)

@holdenkarau
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Have an ML model?
○ Welcome to new params - or old params with different default values.
○ We’ll talk more about that later
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the version number
Francesco

@holdenkarau
Optional Demos: (or early Q&A)
● Go on beam on Flink Wordcount
● Spark on Kubeflow?
● Tensorflow Transform on Beam on Flink
● TensorflowOnSpark
● Tensorflow Data Validation on Beam On Dataflow

@holdenkarau
References
● TFMA + TFT example guide -
https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi
● Apache Beam github repo (w/early alpha portable Flink support)-
https://beam.apache.org/
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark
● Spark Deep Learning Pipelines -
https://github.com/databricks/spark-deep-learning
● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow
● TF.Transform - https://github.com/tensorflow/transform
● Beam portability design: https://beam.apache.org/contribute/portability/
● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889
PROR. Crap Mariner

@holdenkarau
And some upcoming talks:
● April
○ Spark Summit
○ Strata London
● May
○ KiwiCoda Mania
○ KubeCon Barcelona
● June
○ Scala Days EU
○ Berlin Buzzwords
● July
○ OSCON Portland
○ Skills Matter meetup in London
● August
○ ScalaWorld

@holdenkarau
k thnx bye :)
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I have some free books on
Spark if anyone wants :)
Q&A session this afternoon

Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

Similar to Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3) (20)

Recently uploaded

Recently uploaded (11)

Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)