5. @holdenkarau
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Maybe somewhat familiar with Tensorflow?
● Maybe somewhat familiar with Beam or Spark or Flink?
Lori Erickson
6. @holdenkarau
What will be covered?
● Why we need big data for deep learning
● The state of Java/Python integration
● And why this matters for Tensorflow
● Tools to simplify this (TFT, TFMA, TFDV, etc.)
● Pipelining & validation
Then choose your own demo or Q&A:
● TensorFlowOnSpark
● Tensorflow Transform on Apache Beam on {Apache Flink, Dataflow}
● Kubeflow w/Spark
7. Part of what lead to the success of Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc.
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
8. What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
9. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
10. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
11. Why this all matters?
https://twitter.com/wesmckinn/status/1060623695553683456
cuatrok77
12. What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together from Python?
● Pickling, Strings, JSON, XML, oh my!
Over
● Unix pipes, Sockets, files, and mmapped files (sometimes in the same
program)
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
13. @holdenkarau
The "state" of TF + Big Data
● TensorFlowOnSpark w/basic Apache Arrow
○ Still needs more work
○ New scheduler, with improvements in Spark 3
● Basic TF Transform on Apache Flink via Apache Beam
● New* Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● feed_dict + scheduler luck
Vladimir Pustovit
14. @holdenkarau
So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● You need to clean your large datasets
● You also (probably)* need some feature preparation
○ even if you’re looking at mnist.csv you probably have _some_ feature prep
● You need to transform your datasets into the formats your DL wants
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset
15. @holdenkarau
TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming)
● Alternative 1: Data prep in an "exportable" format and serve with Seldon
○ Yay extra RPCs?
● Alternatives 2: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.
16. @holdenkarau
How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool
○ I like Apache Spark, some folks like Apache Beam or Flink.
○ So long as it not feature_prep.sh
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad
17. @holdenkarau
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Written in Python
● Runs on top of Apache Beam
○ Works really on Dataflow
○ On master this can run on Flink, but has bugs currently.
○ Please don’t use this in production today unless your on GCP/Dataflow
○ Python 2 only for now
PROKathryn Yengel
18. @holdenkarau
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
21. @holdenkarau
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
22. @holdenkarau
BEAM Beyond the JVM: Current release
● Works pretty well on Dataflow
● non-JVM BEAM on Apache Flink is relatively early stages
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
Emma
23. @holdenkarau
BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK works with the new interface
● Go SDK talks to the new interface, still missing some features
Nick
26. @holdenkarau
Sample of the chicago taxi data:
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
taxi.FEATURE_BUCKET_COUNT)
28. @holdenkarau
This seems complicated, options?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
30. @holdenkarau
PySpark
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
33. @holdenkarau
So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go over container limits if
deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing them is normally simple
35. @holdenkarau
The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
37. @holdenkarau
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
38. @holdenkarau
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
39. @holdenkarau
Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.
40. @holdenkarau
And we can do this in TFOnSpark*:
unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable) on
the next slide!
Delaina Haslam
41. @holdenkarau
Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
@pandas_udf("int")
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
train_func(modified_series)
return pandas.Series([0] * len(inputSeries1))
ljmacphee
42. @holdenkarau
And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
*ish
43. @holdenkarau
So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● At some point you want to get the data from your big data tool into Tensorflow
● Worst case: you can write out a bunch of files and read them back in
● Possibly better case: you use the things we talked about
44. Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
○ Much more like a Panda’s DataFrame than Spark’s DataFrames
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends make this better, but it’s still a bit rough
● There is a proof-of-concept to bootstrap a dask cluster on Spark
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins
45. @holdenkarau
Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
similar
○ But you’re not using this in production today anyways?
○ Right?
● Automate your pipeline so you don't have to run it every week by hand
● Validate that your models aren't getting worse
Nick Perla
49. @holdenkarau
What are those pipelines?
“Kubeflow Pipelines is a platform for building and deploying portable, scalable
machine learning (ML) workflows based on Docker containers.” - kubeflow.org
Directed Acyclic Graph (DAG) of “pipeline components” (read “docker containers”)
each performing a function.
54. @holdenkarau
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.
55. @holdenkarau
Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya
56. @holdenkarau
So what does that look like?
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker
57. @holdenkarau
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I have an updated variant of it that is going our OSS releasing process internally
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Do you have property tests? Could be Validation rules
● Historical data
○ what did your counters look like yesterday
● Domain specific solutions
○ The best, but also the most work
Photo by:
Paul Schadler
58. @holdenkarau
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?
59. @holdenkarau
TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)
60. @holdenkarau
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Have an ML model?
○ Welcome to new params - or old params with different default values.
○ We’ll talk more about that later
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the version number
Francesco
61. @holdenkarau
Optional Demos: (or early Q&A)
● Go on beam on Flink Wordcount
● Spark on Kubeflow?
● Tensorflow Transform on Beam on Flink
● TensorflowOnSpark
● Tensorflow Data Validation on Beam On Dataflow
62. @holdenkarau
References
● TFMA + TFT example guide -
https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi
● Apache Beam github repo (w/early alpha portable Flink support)-
https://beam.apache.org/
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark
● Spark Deep Learning Pipelines -
https://github.com/databricks/spark-deep-learning
● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow
● TF.Transform - https://github.com/tensorflow/transform
● Beam portability design: https://beam.apache.org/contribute/portability/
● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889
PROR. Crap Mariner
63. @holdenkarau
And some upcoming talks:
● April
○ Spark Summit
○ Strata London
● May
○ KiwiCoda Mania
○ KubeCon Barcelona
● June
○ Scala Days EU
○ Berlin Buzzwords
● July
○ OSCON Portland
○ Skills Matter meetup in London
● August
○ ScalaWorld
64. @holdenkarau
k thnx bye :)
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I have some free books on
Spark if anyone wants :)
Q&A session this afternoon