SlideShare a Scribd company logo
1 of 48
Powering TensorFlow with
big data
With Apache Beam, Flink & Spark bonus
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Maybe somewhat familiar with Tensorflor?
● Maybe somewhat familiar with Beam or Spark or Flink?
Lori Erickson
What we did get to:
● TensorFlowOnSpark w/basic Apache Arrow
● TF Transform demo on Apache Flink via Apache Beam
● Python & Go on Beam on Flink prototype
○ Everyone loves wordcount right? right?....
● New Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● DO NOT non-JVM BEAM on Flink IN PRODUCTION
Vladimir Pustovit
DO NOT USE THIS* IN PRODUCTION
TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean July 18th 2018, but it’s probably going
to not be great for at least a “little while”
*Portable Beam on Flink (including Python, so TFT)
Vladimir Pustovit
PROTambako The Jaguar
What will be covered?
Most likely:
● Where we are today in non-JVM support in Beam
● And why this matters for Tensorflow
● What the rest of Big Data ecosystem looks like going outside the JVM
● A partial TF Transform demo + TFMA links + possible system crash
If there is time (e.g. demo runs quickly, wifi works, several other happy unlikely
things):
● TensorFlowOnSpark
● Apache Arrow - How this changes “everything”*
So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● As much as some may say that no feature prep is required even if you’re
looking at mnist.csv you probably have _some_ feature prep
● Since we need big data for training we need to to do our feature prep it
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset
TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with
more coming)
● Alternatives: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.
How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool (mine is Apache
Spark, but maybe yours is something else)
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Written in Python
● Runs on top of Apache Beam, but current release not yet support Python
outside of GCP
○ On master this can run on Flink, but has bugs currently.
○ Please don’t use this in production today unless your on
GCP/Dataflow
PROKathryn Yengel
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented as a distributed
data pipeline
Transforms
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow
Analyze
normalize
multiply
bucketize
constant
tensors
data
mean stddev
normalize
multiply
quantiles
bucketize
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
Emma
BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK to sort of talk to the new interface
● Go SDK talks to the new interface, still missing some features
● Need permissions? Run on GKE or plumb through permissions file :(
Nick
BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● Once we finish the Python SDK on Beam on Flink adventure you can use all
sorts of cool libraries (like TFT/TFX) to do your tensorflow work
○ You can use them today too if your use case is on Dataflow
○ If you don’t mind bugs you can experiment with them on Flink too
● You will be able manage your dependencies
● You will be able to (in theory) re-use dataprep code at serving time
○ 80% less copy n’ paste code with slight mistakes that get out of date!**
● No that doesn’t work today
● Or tomorrow
● But… eventually
○ Standard OSS excuse “patches welcome” (sort of if you can find the branch :p)
**Not a guarantee, see your vendor for details.
Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
Beam Demo Time!!!!!
● TFT!!!!!! So amazing!!!!!!!!!
○ Want to move to SF and raise a series A? Pay attention :p
● Based on my testing at noon there is a 50% chance it will finish without
crashing
● That’s your friendly reminder not to run any of this in production (yet)
● The demo code (holdenk/model-analysis) is forked from axelmagn/model-
analysis which is forked from tensorflow/model-analysis and runs on master
of Apache Beam w/Flink
BEAM Beyond the JVM: The “future”
E.g. not now
*ish
*ish
*ish
Nick
portability
*ish
*ish
Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
similar
○ But you’re not using this in production today anyways?
○ Right?
Nick Perla
(Optional) Second Beam Demo Time!!!!!
● Word count!!!!!! So amazing!!!!!!!!!
● Based on my testing on Saturday there is a 2 in 3 chance this will hard lock
my computer
● That’s your friendly reminder not to run any of this in production
● Demo shell script of fun (go only) & python + go
What do the rest of the systems do?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
A quick detour into PySpark’s internals
+ + JSON
TimOve
PySpark
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
And in flink….
Driver
custom
Worker 1
Worker K
mmap
mmap
So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing
them is normally simple
TensorFlowOnSpark, everyone loves mnist!
cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
TFCluster.InputMode.SPARK)
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
Lida
The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: Spark 2.3 and beyond & GPUs & R & Python & ….
* *
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.
And we can do this in TFOnSpark*:
unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable)
on the next slide!
Delaina Haslam
Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
@pandas_udf("int")
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
train_func(modified_series)
return pandas.Series([0] * len(inputSeries1))
ljmacphee
And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
*ish
TFOnSpark Possible vNext+1?
● Avoid funneling the data through Python native types
○ For now the Spark Arrow UDFS aren’t perfect for this
○ But we can (and are) improving them
● mmapped Arrow?
● Skip Python on the workers handling data entirely (idk I’m lazy so probably
not)
Renars
References
● TFMA + TFT example guide -
https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi
● Apache Beam github repo (w/early alpha portable Flink support)-
https://beam.apache.org/
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark
● Spark Deep Learning Pipelines - https://github.com/databricks/spark-deep-
learning
● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow
● TF.Transform - https://github.com/tensorflow/transform
● Beam portability design: https://beam.apache.org/contribute/portability/
● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889
PROR. Crap Mariner
And some upcoming talks:
● August
○ Tentative Ottawa meetup!
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv
● November
○ Big Data Spain
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
What’s the rest of big data outside the JVM
look like?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
Hadoop “streaming” (Python/R)
● Unix pipes!
● Involves a data copy, formats get sad
● But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
Kafka: re-implement all the things
● Multiple options for connecting to Kafka from outside of the JVM (yay!)
● They implement the protocol to talk to Kafka (yay!)
● This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins

More Related Content

What's hot

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Koan-Sin Tan
 

What's hot (20)

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented Programmers
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Puppetizing Your Organization
Puppetizing Your OrganizationPuppetizing Your Organization
Puppetizing Your Organization
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
 
Golang 101
Golang 101Golang 101
Golang 101
 
Iron Sprog Tech Talk
Iron Sprog Tech TalkIron Sprog Tech Talk
Iron Sprog Tech Talk
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
Golang workshop - Mindbowser
Golang workshop - MindbowserGolang workshop - Mindbowser
Golang workshop - Mindbowser
 
OpenMP
OpenMPOpenMP
OpenMP
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Why learn python in 2017?
Why learn python in 2017?Why learn python in 2017?
Why learn python in 2017?
 

Similar to Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 

Similar to Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018 (20)

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Go fundamentals
Go fundamentalsGo fundamentals
Go fundamentals
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 

Recently uploaded

₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 

Recently uploaded (20)

₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 

Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018

  • 1. Powering TensorFlow with big data With Apache Beam, Flink & Spark bonus @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Maybe somewhat familiar with Tensorflor? ● Maybe somewhat familiar with Beam or Spark or Flink? Lori Erickson
  • 5. What we did get to: ● TensorFlowOnSpark w/basic Apache Arrow ● TF Transform demo on Apache Flink via Apache Beam ● Python & Go on Beam on Flink prototype ○ Everyone loves wordcount right? right?.... ● New Beam architecture allowing for better portability & handling dependencies (like Tensorflow) ● DO NOT non-JVM BEAM on Flink IN PRODUCTION Vladimir Pustovit
  • 6. DO NOT USE THIS* IN PRODUCTION TODAY ● I’m serious, I don’t want to die or cause the next financial meltdown with software I’m a part of ● By Today I mean July 18th 2018, but it’s probably going to not be great for at least a “little while” *Portable Beam on Flink (including Python, so TFT) Vladimir Pustovit PROTambako The Jaguar
  • 7. What will be covered? Most likely: ● Where we are today in non-JVM support in Beam ● And why this matters for Tensorflow ● What the rest of Big Data ecosystem looks like going outside the JVM ● A partial TF Transform demo + TFMA links + possible system crash If there is time (e.g. demo runs quickly, wifi works, several other happy unlikely things): ● TensorFlowOnSpark ● Apache Arrow - How this changes “everything”*
  • 8. So why do I need to power DL w/Big Data? ● Deep learning is most effective with large sample sets for training ● As much as some may say that no feature prep is required even if you’re looking at mnist.csv you probably have _some_ feature prep ● Since we need big data for training we need to to do our feature prep it ● Even if your just trying to raise some VC money it's going to go a lot better if you add some keywords about a large proprietary dataset
  • 9. TensorFlow isn’t enough on its own ● Enter TFX & friends like Kubeflow ○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming) ● Alternatives: piles of custom code re-created at serving time. ○ Yay job security? PROJennifer C.
  • 10. How do I do feature prep? (old skool) ● Write custom preparation jobs in your favourite big data tool (mine is Apache Spark, but maybe yours is something else) ● Run it, train on the prepared data ● Rewrite your feature prep code to run at serving time ○ Error prone and sad
  • 11. Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Written in Python ● Runs on top of Apache Beam, but current release not yet support Python outside of GCP ○ On master this can run on Flink, but has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  • 12. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  • 13. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 15. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  • 16. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  • 17. BEAM Beyond the JVM: Master + Experiments ● Common interface for setting up jobs ● Portability framework allows SDK harnesses in arbitrary to be kicked off ● Runners ship in their own docker containers (goodbye dependency hell, hello container hell) ○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand) ● Hacked up Python SDK to sort of talk to the new interface ● Go SDK talks to the new interface, still missing some features ● Need permissions? Run on GKE or plumb through permissions file :( Nick
  • 18. BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability *ish
  • 19. So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  • 20. So how TF does this relate to TF? ● Tensorflow is in Python (kind of) ● Once we finish the Python SDK on Beam on Flink adventure you can use all sorts of cool libraries (like TFT/TFX) to do your tensorflow work ○ You can use them today too if your use case is on Dataflow ○ If you don’t mind bugs you can experiment with them on Flink too ● You will be able manage your dependencies ● You will be able to (in theory) re-use dataprep code at serving time ○ 80% less copy n’ paste code with slight mistakes that get out of date!** ● No that doesn’t work today ● Or tomorrow ● But… eventually ○ Standard OSS excuse “patches welcome” (sort of if you can find the branch :p) **Not a guarantee, see your vendor for details.
  • 21. Ooor from the chicago taxi data... for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key],
  • 22. Beam Demo Time!!!!! ● TFT!!!!!! So amazing!!!!!!!!! ○ Want to move to SF and raise a series A? Pay attention :p ● Based on my testing at noon there is a 50% chance it will finish without crashing ● That’s your friendly reminder not to run any of this in production (yet) ● The demo code (holdenk/model-analysis) is forked from axelmagn/model- analysis which is forked from tensorflow/model-analysis and runs on master of Apache Beam w/Flink
  • 23. BEAM Beyond the JVM: The “future” E.g. not now *ish *ish *ish Nick portability *ish *ish
  • 24. Ok now what? ● Integrate this into your model serving pipeline of choice ○ Don’t have one or open to change? Checkout TFMA which can directly serve it ● There’s a guide (it doesn’t show Flink because not released yet) but steps are similar ○ But you’re not using this in production today anyways? ○ Right? Nick Perla
  • 25. (Optional) Second Beam Demo Time!!!!! ● Word count!!!!!! So amazing!!!!!!!!! ● Based on my testing on Saturday there is a 2 in 3 chance this will hard lock my computer ● That’s your friendly reminder not to run any of this in production ● Demo shell script of fun (go only) & python + go
  • 26. What do the rest of the systems do? ● Spoiler: mostly it’s not better ○ Although it tends to be more finished ○ Sometimes it's different ● Different tradeoffs, maybe better for your use case but all tradeoffs Kate Neilan
  • 27. A quick detour into PySpark’s internals + + JSON TimOve
  • 28. PySpark ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  • 29. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 31. So how does that impact Py[X] forall X in {Big Data}-{Native Python Big Data} ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Dependency management makes limited sense ● features aren’t automatically exposed, but exposing them is normally simple
  • 32. TensorFlowOnSpark, everyone loves mnist! cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  • 33. The “future”*: faster interchange ● By future I mean availability today but running it in production is “adventurous” ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 34. Andrew Skudder *Arrow: Spark 2.3 and beyond & GPUs & R & Python & …. * *
  • 35. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 36. Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  • 37. Rewriting your code because why not spark.catalog.registerFunction( "add", lambda x, y: x + y, IntegerType()) => add = pandas_udf(lambda x, y: x + y, IntegerType()) Jennifer C.
  • 38. And we can do this in TFOnSpark*: unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info, self.cluster_meta, qname)) Will Transform Into something magical (aka fast but unreliable) on the next slide! Delaina Haslam
  • 39. Which becomes train_func = TFSparkNode.train(self.cluster_info, self.cluster_meta, qname) @pandas_udf("int") def do_train(inputSeries1, inputSeries2): # Sad hack for now modified_series = map(lambda x: (x[0], x[1]), zip(inputSeries1, inputSeries2)) train_func(modified_series) return pandas.Series([0] * len(inputSeries1)) ljmacphee
  • 40. And this now looks like: Logos trademarks of their respective projects Juha Kettunen *ish
  • 41. TFOnSpark Possible vNext+1? ● Avoid funneling the data through Python native types ○ For now the Spark Arrow UDFS aren’t perfect for this ○ But we can (and are) improving them ● mmapped Arrow? ● Skip Python on the workers handling data entirely (idk I’m lazy so probably not) Renars
  • 42. References ● TFMA + TFT example guide - https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi ● Apache Beam github repo (w/early alpha portable Flink support)- https://beam.apache.org/ ● TFMA Example fork for use w/Beam on Flink - ● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark ● Spark Deep Learning Pipelines - https://github.com/databricks/spark-deep- learning ● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow ● TF.Transform - https://github.com/tensorflow/transform ● Beam portability design: https://beam.apache.org/contribute/portability/ ● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889 PROR. Crap Mariner
  • 43. And some upcoming talks: ● August ○ Tentative Ottawa meetup! ○ JupyterCon NYC ● September ○ Strata NYC ○ Strangeloop STL ● October ○ Spark Summit London ○ Reversim Tel Aviv ● November ○ Big Data Spain
  • 44. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :) Give feedback on this presentation http://bit.ly/holdenTalkFeedback
  • 45. What’s the rest of big data outside the JVM look like? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 46. Hadoop “streaming” (Python/R) ● Unix pipes! ● Involves a data copy, formats get sad ● But the overhead of a Map/Reduce task is pretty high anyways... Lisa Larsson
  • 47. Kafka: re-implement all the things ● Multiple options for connecting to Kafka from outside of the JVM (yay!) ● They implement the protocol to talk to Kafka (yay!) ● This involves duplicated client work, and sometimes the clients can be slow (solution, FFI bindings to C instead of Java) ● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams) and features depend on client libraries implementing them (easy to slip below parity) Smokey Combs
  • 48. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends might make this better with time too, buuut…. ● See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html Lisa Zins

Editor's Notes

  1. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  2. introduce spark rdds, purple blog diagrams https://www.flickr.com/photos/pustovit/15867520885/in/photolist-qbac9i-9XLrR4-74scWq-bpnxfN-qYAD3D-e6u5Ej-oztsCu-qJMG4L-7b4y4a-gu8Wa-8MzgVR-b5gHki-djzdH3-82TowY-qJc99b-pC6yth-ifAkvP-mju1Ce-3ACPG6-F9aWR2-5QQL1U-4Hav3S-dGHvJj-jxQLth-djzdgd-dL24wn-8znjgb-aZxA6H-gDkWNo-djzcZF-22NYH5r-9amo58-apqLdG-fZhoSH-cDjpEQ-nLbBuK-6EuEN2-dAN5KN-asBjbL-Vx2zFR-djzdki-SRkhd1-djzcnF-Tc9FAf-qsduzQ-djzd4P-9wCiZT-8JALTP-eqbpop-R2S2FR
  3. Remind people not to use this in production
  4. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  5. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  6. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  7. As just discussed, TFT provides utility functions that run analyzers as needed. These are data processing jobs that can be run in arbitrary environments using the Beam SDK.
  8. What happens behind the scene is that the analyzers run as a distributed data processing graph, and the result is put into the output graph as constants.
  9. Some of the common use cases: Just talked about ones on left: Scale scores Bucketize Text features: apply bag of words or n grams For Feature crosses: cross strings and generate vocabs of the result of those crosses As mentioned before, tf.Tranform is powerful in that you can chain these transformations.
  10. SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. On The executors, java subprocess launches a python subprocess via a pipe. The data is serialized using cpickle and sent ot the python subprocess
  11. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  12. https://www.flickr.com/photos/timove/2873619269/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  13. SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. On The executors, java subprocess launches a python subprocess via a pipe. The data is serialized using cpickle and sent ot the python subprocess
  14. What does python memory isn’t controlled by the JVM mean? Double serialization from lanague transform
  15. https://www.flickr.com/photos/juhakettunen/17218305870/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  16. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  17. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  18. https://www.flickr.com/photos/juhakettunen/17218305870/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  19. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  20. Is debugging also really hard?
  21. Or this one … also should we put these below python part?