SlideShare a Scribd company logo
1 of 64
Download to read offline
Powering TensorFlow with
big data
With Apache Beam, Flink & Spark bonus KF
Slides will be at:
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share
● Code review livestreams: /
● Spark Talk Videos
● Talk feedback (if you are so inclined):
● Helping organize Data Track @ ITNEXT AMS - CFP Open!
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Maybe somewhat familiar with Tensorflow?
● Maybe somewhat familiar with Beam or Spark or Flink?
Lori Erickson
What will be covered?
● Why we need big data for deep learning
● The state of Java/Python integration
● And why this matters for Tensorflow
● Tools to simplify this (TFT, TFMA, TFDV, etc.)
● Pipelining & validation
Then choose your own demo or Q&A:
● TensorFlowOnSpark
● Tensorflow Transform on Apache Beam on {Apache Flink, Dataflow}
● Kubeflow w/Spark
Part of what lead to the success of Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc.
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Python, &
Spark ML
bagel &
Graph X
Paul Hudson
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
● Must faster than Hadoop
● Good when too big for a single
● Built on top of two abstractions for
distributed data: RDDs & Datasets
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
Why this all matters?
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together from Python?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes, Sockets, files, and mmapped files (sometimes in the same
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
The "state" of TF + Big Data
● TensorFlowOnSpark w/basic Apache Arrow
○ Still needs more work
○ New scheduler, with improvements in Spark 3
● Basic TF Transform on Apache Flink via Apache Beam
● New* Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● feed_dict + scheduler luck
Vladimir Pustovit
So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● You need to clean your large datasets
● You also (probably)* need some feature preparation
○ even if you’re looking at mnist.csv you probably have _some_ feature prep
● You need to transform your datasets into the formats your DL wants
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset
TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming)
● Alternative 1: Data prep in an "exportable" format and serve with Seldon
○ Yay extra RPCs?
● Alternatives 2: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.
How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool
○ I like Apache Spark, some folks like Apache Beam or Flink.
○ So long as it not
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● Written in Python
● Runs on top of Apache Beam
○ Works really on Dataflow
○ On master this can run on Flink, but has bugs currently.
○ Please don’t use this in production today unless your on GCP/Dataflow
○ Python 2 only for now
PROKathryn Yengel
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
mean stddev
Reduce (full pass)
Implemented as a distributed
data pipeline
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow
mean stddev
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
Some common use-cases...
BEAM Beyond the JVM: Current release
● Works pretty well on Dataflow
● non-JVM BEAM on Apache Flink is relatively early stages
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See
BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK works with the new interface
● Go SDK talks to the new interface, still missing some features
BEAM Beyond the JVM: Master w/ experiments
So what does that look like?
Worker 1
Worker K
Sample of the chicago taxi data:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
BEAM Beyond the JVM: The “future”
E.g. not now
This seems complicated, options?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
A quick detour into PySpark’s internals
+ + JSON
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
So what does that look like?
Worker 1
Worker K
And in flink….
Worker 1
Worker K
So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go over container limits if
deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing them is normally simple
TensorFlowOnSpark, everyone loves mnist!
cluster =, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: Spark 2.3 and beyond & GPUs & R & Python & ….
* *
What does the future look like?*
Trust but verify.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
Rewriting your code because why not
"add", lambda x, y: x + y, IntegerType())
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.
And we can do this in TFOnSpark*:
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable) on
the next slide!
Delaina Haslam
Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
return pandas.Series([0] * len(inputSeries1))
And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● At some point you want to get the data from your big data tool into Tensorflow
● Worst case: you can write out a bunch of files and read them back in
● Possibly better case: you use the things we talked about
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
○ Much more like a Panda’s DataFrame than Spark’s DataFrames
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends make this better, but it’s still a bit rough
● There is a proof-of-concept to bootstrap a dask cluster on Spark
● See &
Lisa Zins
Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
○ But you’re not using this in production today anyways?
○ Right?
● Automate your pipeline so you don't have to run it every week by hand
● Validate that your models aren't getting worse
Nick Perla
(Optionally): Putting it together with Kubeflow
VIK hotels group
"The Machine Learning Toolkit
for Kubernetes"
- Kubeflow Website
Introducing* Kubeflow
VIK hotels group
Components Buffet
Paul Harrison
What are those pipelines?
“Kubeflow Pipelines is a platform for building and deploying portable, scalable
machine learning (ML) workflows based on Docker containers.” -
Directed Acyclic Graph (DAG) of “pipeline components” (read “docker containers”)
each performing a function.
Building that pipeline?
Running that pipeline
Ok cool, but… we need to validate
Results from: Testing with Spark survey
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.
Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya
So what does that look like?
val parsed = data.flatMap(x => try {
} catch {
case _ =>
None // What's it's JSON
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I have an updated variant of it that is going our OSS releasing process internally
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Do you have property tests? Could be Validation rules
● Historical data
○ what did your counters look like yesterday
● Domain specific solutions
○ The best, but also the most work
Photo by:
Paul Schadler
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?
TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Have an ML model?
○ Welcome to new params - or old params with different default values.
○ We’ll talk more about that later
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the version number
Optional Demos: (or early Q&A)
● Go on beam on Flink Wordcount
● Spark on Kubeflow?
● Tensorflow Transform on Beam on Flink
● TensorflowOnSpark
● Tensorflow Data Validation on Beam On Dataflow
● TFMA + TFT example guide -
● Apache Beam github repo (w/early alpha portable Flink support)-
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -
● Spark Deep Learning Pipelines -
● flink-tensorflow -
● TF.Transform -
● Beam portability design:
● Beam on Flink + portability
PROR. Crap Mariner
And some upcoming talks:
● April
○ Spark Summit
○ Strata London
● May
○ KiwiCoda Mania
○ KubeCon Barcelona
● June
○ Scala Days EU
○ Berlin Buzzwords
● July
○ OSCON Portland
○ Skills Matter meetup in London
● August
○ ScalaWorld
k thnx bye :)
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
Pssst: Have feedback on the presentation? Give me a
shout ( if you feel comfortable doing
so :)
Give feedback on this presentation
I have some free books on
Spark if anyone wants :)
Q&A session this afternoon

More Related Content

What's hot

Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
Validating big data pipelines - FOSDEM 2019
Validating big data pipelines -  FOSDEM 2019Validating big data pipelines -  FOSDEM 2019
Validating big data pipelines - FOSDEM 2019Holden Karau
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018Holden Karau
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017Codemotion

What's hot (20)

Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
Validating big data pipelines - FOSDEM 2019
Validating big data pipelines -  FOSDEM 2019Validating big data pipelines -  FOSDEM 2019
Validating big data pipelines - FOSDEM 2019
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017

Similar to Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondDatabricks
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkNavid Kalaei
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau

Similar to Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3) (20)

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup

Recently uploaded

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo

Recently uploaded (11)

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx

Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

  • 1. @holdenkarau Powering TensorFlow with big data With Apache Beam, Flink & Spark bonus KF @holdenkarau
  • 2. @holdenkarau Slides will be at: CatLoversShow
  • 3. @holdenkarau Holden: ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share ● Code review livestreams: / ● Spark Talk Videos ● Talk feedback (if you are so inclined): ● Helping organize Data Track @ ITNEXT AMS - CFP Open!
  • 5. @holdenkarau Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Maybe somewhat familiar with Tensorflow? ● Maybe somewhat familiar with Beam or Spark or Flink? Lori Erickson
  • 6. @holdenkarau What will be covered? ● Why we need big data for deep learning ● The state of Java/Python integration ● And why this matters for Tensorflow ● Tools to simplify this (TFT, TFMA, TFDV, etc.) ● Pipelining & validation Then choose your own demo or Q&A: ● TensorFlowOnSpark ● Tensorflow Transform on Apache Beam on {Apache Flink, Dataflow} ● Kubeflow w/Spark
  • 7. Part of what lead to the success of Spark ● Integrated different tools which traditionally required different systems ○ Mahout, hive, etc. ● e.g. can use same system to do ML and SQL *Often written in Python! Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 8. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 9. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 10. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 11. Why this all matters? cuatrok77
  • 12. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together from Python? ● Pickling, Strings, JSON, XML, oh my! Over ● Unix pipes, Sockets, files, and mmapped files (sometimes in the same program) What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 13. @holdenkarau The "state" of TF + Big Data ● TensorFlowOnSpark w/basic Apache Arrow ○ Still needs more work ○ New scheduler, with improvements in Spark 3 ● Basic TF Transform on Apache Flink via Apache Beam ● New* Beam architecture allowing for better portability & handling dependencies (like Tensorflow) ● feed_dict + scheduler luck Vladimir Pustovit
  • 14. @holdenkarau So why do I need to power DL w/Big Data? ● Deep learning is most effective with large sample sets for training ● You need to clean your large datasets ● You also (probably)* need some feature preparation ○ even if you’re looking at mnist.csv you probably have _some_ feature prep ● You need to transform your datasets into the formats your DL wants ● Even if your just trying to raise some VC money it's going to go a lot better if you add some keywords about a large proprietary dataset
  • 15. @holdenkarau TensorFlow isn’t enough on its own ● Enter TFX & friends like Kubeflow ○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming) ● Alternative 1: Data prep in an "exportable" format and serve with Seldon ○ Yay extra RPCs? ● Alternatives 2: piles of custom code re-created at serving time. ○ Yay job security? PROJennifer C.
  • 16. @holdenkarau How do I do feature prep? (old skool) ● Write custom preparation jobs in your favourite big data tool ○ I like Apache Spark, some folks like Apache Beam or Flink. ○ So long as it not ● Run it, train on the prepared data ● Rewrite your feature prep code to run at serving time ○ Error prone and sad
  • 17. @holdenkarau Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Written in Python ● Runs on top of Apache Beam ○ Works really on Dataflow ○ On master this can run on Flink, but has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow ○ Python 2 only for now PROKathryn Yengel
  • 18. @holdenkarau Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  • 19. @holdenkarau mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 21. @holdenkarau Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  • 22. @holdenkarau BEAM Beyond the JVM: Current release ● Works pretty well on Dataflow ● non-JVM BEAM on Apache Flink is relatively early stages ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See Emma
  • 23. @holdenkarau BEAM Beyond the JVM: Master + Experiments ● Common interface for setting up jobs ● Portability framework allows SDK harnesses in arbitrary to be kicked off ● Runners ship in their own docker containers (goodbye dependency hell, hello container hell) ○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand) ● Hacked up Python SDK works with the new interface ● Go SDK talks to the new interface, still missing some features Nick
  • 24. @holdenkarau BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability
  • 25. @holdenkarau So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  • 26. @holdenkarau Sample of the chicago taxi data: for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key], taxi.FEATURE_BUCKET_COUNT)
  • 27. @holdenkarau BEAM Beyond the JVM: The “future” E.g. not now *ish *ish *ish Nick portability *ish *ish
  • 28. @holdenkarau This seems complicated, options? ● Spoiler: mostly it’s not better ○ Although it tends to be more finished ○ Sometimes it's different ● Different tradeoffs, maybe better for your use case but all tradeoffs Kate Neilan
  • 29. @holdenkarau A quick detour into PySpark’s internals + + JSON TimOve
  • 30. @holdenkarau PySpark ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  • 31. @holdenkarau So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 33. @holdenkarau So how does that impact Py[X] forall X in {Big Data}-{Native Python Big Data} ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Dependency management makes limited sense ● features aren’t automatically exposed, but exposing them is normally simple
  • 34. @holdenkarau TensorFlowOnSpark, everyone loves mnist! cluster =, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  • 35. @holdenkarau The “future”*: faster interchange ● By future I mean availability today but running it in production is “adventurous” ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 36. @holdenkarau Andrew Skudder *Arrow: Spark 2.3 and beyond & GPUs & R & Python & …. * *
  • 37. @holdenkarau What does the future look like?* *Source: *Vendor benchmark. Trust but verify.
  • 38. @holdenkarau Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  • 39. @holdenkarau Rewriting your code because why not spark.catalog.registerFunction( "add", lambda x, y: x + y, IntegerType()) => add = pandas_udf(lambda x, y: x + y, IntegerType()) Jennifer C.
  • 40. @holdenkarau And we can do this in TFOnSpark*: unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info, self.cluster_meta, qname)) Will Transform Into something magical (aka fast but unreliable) on the next slide! Delaina Haslam
  • 41. @holdenkarau Which becomes train_func = TFSparkNode.train(self.cluster_info, self.cluster_meta, qname) @pandas_udf("int") def do_train(inputSeries1, inputSeries2): # Sad hack for now modified_series = map(lambda x: (x[0], x[1]), zip(inputSeries1, inputSeries2)) train_func(modified_series) return pandas.Series([0] * len(inputSeries1)) ljmacphee
  • 42. @holdenkarau And this now looks like: Logos trademarks of their respective projects Juha Kettunen *ish
  • 43. @holdenkarau So how TF does this relate to TF? ● Tensorflow is in Python (kind of) ● At some point you want to get the data from your big data tool into Tensorflow ● Worst case: you can write out a bunch of files and read them back in ● Possibly better case: you use the things we talked about
  • 44. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ○ Much more like a Panda’s DataFrame than Spark’s DataFrames ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends make this better, but it’s still a bit rough ● There is a proof-of-concept to bootstrap a dask cluster on Spark ● See & Lisa Zins
  • 45. @holdenkarau Ok now what? ● Integrate this into your model serving pipeline of choice ○ Don’t have one or open to change? Checkout TFMA which can directly serve it ● There’s a guide (it doesn’t show Flink because not released yet) but steps are similar ○ But you’re not using this in production today anyways? ○ Right? ● Automate your pipeline so you don't have to run it every week by hand ● Validate that your models aren't getting worse Nick Perla
  • 46. @holdenkarau (Optionally): Putting it together with Kubeflow VIK hotels group "The Machine Learning Toolkit for Kubernetes" - Kubeflow Website
  • 49. @holdenkarau What are those pipelines? “Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.” - Directed Acyclic Graph (DAG) of “pipeline components” (read “docker containers”) each performing a function.
  • 52. @holdenkarau Ok cool, but… we need to validate Results from: Testing with Spark survey
  • 54. @holdenkarau So how do we validate our jobs? ● The idea is, at some point, you made software which worked. ○ If you don’t you probably want to run it a few times and manually validate it ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.
  • 55. @holdenkarau Counters* to the rescue**! ● Both BEAM & Spark have their it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems…. Miguel Olaya
  • 56. @holdenkarau So what does that look like? val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  • 57. @holdenkarau General Rules for making Validation rules ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ○ I have an updated variant of it that is going our OSS releasing process internally ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Do you have property tests? Could be Validation rules ● Historical data ○ what did your counters look like yesterday ● Domain specific solutions ○ The best, but also the most work Photo by: Paul Schadler
  • 58. @holdenkarau % of data change ● Not just invalid records, if a field’s value changes everywhere it could still be “valid” but have a different meaning ○ Remember that example about almost recommending illegal content? ● Join and see number of rows different on each side ● Expensive operation, but if your data changes slowly / at a constant ish rate ○ Sometimes done as a separate parallel job ● Can also be used on output if applicable ○ You do have a table/file/as applicable to roll back to right?
  • 59. @holdenkarau TFDV: Magic* ● Counters, schema inference, anomaly detection, oh my! # Compute statistics over a new set of data new_stats = tfdv.generate_statistics_from_csv(NEW_DATA) # Compare how new data conforms to the schema anomalies = tfdv.validate_statistics(new_stats, schema) # Display anomalies inline tfdv.display_anomalies(anomalies)
  • 60. @holdenkarau Not just data changes: Software too ● Things change! Yay! Often for the better. ○ Especially with handling edge cases like NA fields ○ Don’t expect the results to change - side-by-side run + diff ● Have an ML model? ○ Welcome to new params - or old params with different default values. ○ We’ll talk more about that later ● Excellent PyData London talk about how this can impact ML models ○ Done with sklearn shows vast differences in CVE results only changing the version number Francesco
  • 61. @holdenkarau Optional Demos: (or early Q&A) ● Go on beam on Flink Wordcount ● Spark on Kubeflow? ● Tensorflow Transform on Beam on Flink ● TensorflowOnSpark ● Tensorflow Data Validation on Beam On Dataflow
  • 62. @holdenkarau References ● TFMA + TFT example guide - ● Apache Beam github repo (w/early alpha portable Flink support)- ● TFMA Example fork for use w/Beam on Flink - ● TensorFlowOnSpark - ● Spark Deep Learning Pipelines - ● flink-tensorflow - ● TF.Transform - ● Beam portability design: ● Beam on Flink + portability PROR. Crap Mariner
  • 63. @holdenkarau And some upcoming talks: ● April ○ Spark Summit ○ Strata London ● May ○ KiwiCoda Mania ○ KubeCon Barcelona ● June ○ Scala Days EU ○ Berlin Buzzwords ● July ○ OSCON Portland ○ Skills Matter meetup in London ● August ○ ScalaWorld
  • 64. @holdenkarau k thnx bye :) Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! Pssst: Have feedback on the presentation? Give me a shout ( if you feel comfortable doing so :) Give feedback on this presentation I have some free books on Spark if anyone wants :) Q&A session this afternoon