SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Using Spark ML on Spark Errors
What Do the Clusters Tell Us?
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
Normally I’d introduce my co-speaker
● However she was organizing the Apache Beam Summit and is just too
drained to be able to make it.
● I did have to cut a few corners (and re-use a few cat pictures) as a result
Sylvie burr
Some links (slides & recordings will be at):
Today’s talk:
http://bit.ly/2QoZuKz
Yesterday’s talk (Validating Pipelines):
https://bit.ly/2QqQUea
CatLoversShow
Who do I think you all are?
● Nice people*
● Familiar-ish to very familiar with Spark
● Possibly a little bit jaded (but also maybe not)
Amanda
What we are going to explore together!
● The Spark Mailing Lists
○ Yes, even user@
● My desire to be lazy
● The suspicion that srowen has a robot army to help
● A look at how much work it would be to build that robot
army
● The depressing realization “heuristics” are probably
better anyways (and some options)
Some of the reasons my employer cares*
● We have a hoted Spark/Hadoop solution (called Dataproc)
● We also have hosted pipeline management tools (based on Airflow called
Cloud Composer)
● Being good open source community members
*Probably, it’s not like I go to all of the meetings I’m invited to.
Khairil Zhafri
The Spark Mailing Lists & friends
● user@
○ Where people to to ask questions about using Spark
● dev@
○ Discussion about developing Spark
○ Also where people sometimes go when no one answers user@
● Stackoverflow
○ Some very active folks here as well
● Books/etc.
Petfu
l
~ unanswered Spark posts
8536
:(
Richard J
Stack overflow growth over time
Petfu
l
Khalid Abduljaleel
*Done with bigquery. Sorry!
Discoverability might matter
Petfu
l
Anyone have an outstanding PR? koi ko
So how do we handle this?
● Get more community volunteers
○ (hard & burn out)
● Answer more questions
○ (hard & burn out)
● Answer less questions?
○ (idk maybe someone will buy a support contract)
● Make robots!
○ Hard and doesn’t work entirely
Helen Olney
How many of you have had?
● Java OOM
● Application memory overhead exceeded
● Serialization exception
● Value is bigger than integer exception
● etc.
Helen Olney
Maaaaybe robots could help?
● It certainly seems like some folks have common issues
● Everyone loves phone trees right?
○ Press 1 if you’ve had an out-of-memory exception press 2 if you’re
running Python
● Although more seriously some companies are building
recommendation systems on top of Spark to solve this
for their customers
Matthew Hurst
Ok well, let’s try and build some clusters?
● Not those cluster :p
● Lets try k=4, we had 4 common errors right?
_torne
I’m lazy so let’s use Spark:
body_hashing = HashingTF(inputCol="body_tokens",
outputCol="raw_body_features", numFeatures=10000)
body_idf =IDF(inputCol="raw_body_features",
outputCol="body_features")
assembler = VectorAssembler(
inputCols=["body_features", "contains_python_stack_trace",
"contains_java_stack_trace",
"contains_exception_in_task", "is_thread_start",
"domain_features"],
outputCol="features")
kmeans = KMeans(featuresCol="features", k=4, seed=42)
_torne
Damn not quite one slide :(
dataprep_pipeline = Pipeline(stages=[tokenizer, body_hashing,
body_idf, domains_hashing, domains_idf, assembler])
pipeline = Pipeline(stages=[dataprep_pipeline, kmeans])
_torne
f ford Pinto by Morven
f ford Pinto by Morven
ayphen
“Northern Rock”
Let’s see what the results
● Let’s grab an email or two from each cluster and take a
peek
Rikki's Refuge
Waiiiit…. Rikki's Refuge
Oh hmm. Well maybe 4*4 right?
● 159 non group-zero messages…
Sherrie Thai
Well when do we start to see something?
*Not actually a great way to pick K
w.vandervet
Let’s look at some of the records - 73
1 {plain=Hi All,n Greetings ! I needed some help to read a Hive
tablenvia Pyspark for which the transactional property is set to 'True' (In
othernwords ACID property is enabled). Following is the entire stacktrace and
thendescription of the hive table. Would you please be able to help me
resolventhe error:nn18/03/01 11:06:22 INFO BlockManagerMaster: Registered
BlockManagern18/03/01 11:06:22 INFO EventLoggingListener: Logging events
tonhdfs:///spark-history/local-1519923982155nWelcome ton ____
__n / __/__ ___ _____/ /__n _ / _ / _ `/ __/ '_/n /__ / .__/_,_/_/ /_/_
version 1.6.3n /_/nnUsing Python version 2.7.12 (default, Jul 2 2016
17:42:40)nSparkContext available as sc, HiveContext available as
sqlContext.n>>> from pyspark.sql import HiveContextn>>> hive_context =
HiveContext(sc)n>>> hive_context.sql("select count(*)
Susan Young
Let’s look at some of the records - 133
5 {plain=Hi Gourav,nnMy answers are below.nnCheers,nBennnn> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta
<gourav.sengupta@gmail.com> wrote:n> n> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for
yourself in AWS? Our cluster in on premise in our data center.n> n> Also I have really never seen use s3a before, that was used way long
before when writing s3 files took a long time, but I think that you are reading it. n> n> Anyideas why you are not migrating to Spark 2.1,
besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that
you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever
Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.n> n> And besides that would you not want to work
on a platform which is at least 10 times faster What would that be?n> n> Regards,n> Gourav Senguptan> n> On Thu, Feb 23, 2017 at
6:23 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>> wrote:n> We are trying to use Spark 1.6 within CDH 5.7.1 to
retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but
when we try to do some operations, such as count, we get this error below.n> n> com.cloudera.com.amazonaws.AmazonClientException:
Unable to load AWS credentials from any provider in the chainn> at
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)n> at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)n> at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)n> at
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)n> at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)n> at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)n> at
nagy.tamas
Let’s look at some of the records - 133
6 {plain=I see, that’s quite interesting. For problem 2, I think the issue is that Akka 2.0.5 *always* kept TCP connections open between
nodes, so these messages didn’t get lost. It looks like Akka 2.2 occasionally disconnects them and loses messages. If this is the case, and
this behavior can’t be disabled with a flag, then it’s a problem for other parts of the code too. Most of our code assumes that messages will
make it through unless the destination node dies, which is what you’d usually hope for TCP.nnMateinnOn Oct 31, 2013, at 1:33 PM, Imran
Rashid <imran@quantifind.com> wrote:nn> pretty sure I found the problem -- two problems actually. And I think one of them has been a
general lurking problem w/ spark for a while.n> n> 1) we should ignore disassociation events, as you suggested earlier. They seem to just
indicate a temporary problem, and can generally be ignored. I've found that they're regularly followed by AssociatedEvents, and it seems
communication really works fine at that point.n> n> 2) Task finished messages get lost. When this message gets sent, we dont' know it
actually gets there:n> n>
https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala
#L90n> n> (this is so incredible, I feel I must be overlooking something -- but there is no ack somewhere else that I'm overlooking, is
there??) So, after the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent. It hangs b/c the executor has sent some
taskFinished messages that never get received by the driver. So the driver is waiting for some tasks to finish, but the executors think they are
all done.n> n> I'm gonna add the reliable proxy pattern for this particular interaction and see if its fixes the problemn>
http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxyn> n> imrann> n> n> n> On Thu, Oct 31, 2013
at 1:17 PM, Imran Rashid <imran@quantifind.com> wrote:n> Hi Prashant,n> n> thanks for looking into this. I don't have any answers yet,
but just wanted to send you an update. I finally figured out how to get all the akka logging turned on, so I'm looking at those for more info.
One thing immediately jumped out at me -- the Disassociation is actually immediatley followed by an Association! so maybe I came to the
wrong conclusion of our test of ignoring the DisassociatedEvent. I'm going to try it again -- hopefully w/ the logging on, I can find out more
about what is going on. I might ask on akka list for help w/ what to look for. also this thread makes me think that it really should just
re-associate:n> https://groups.google.com/forum/#!searchin/akka-user/Disassociated/akka-user/SajwwbyTriQ/8oxjbZtawxoJn> n> also, I've
翮郡 陳
Let’s look at some of the records - 155
*Problem Description*:nnThe program running in stand-alone spark cluster (1
master, 6 workers withn8g ram and 2 cores).nInput: a 468MB file with 133433
records stored in HDFS.nOutput: just 2MB file will stored in HDFSnThe program
has two map operations and one reduceByKey operation.nFinally I save the result
to HDFS using "*saveAsTextFile*".n*Problem*: if I don't add "saveAsTextFile",
the program runs very fast(a fewnseconds), otherwise extremely slow until about
30 mins.nn*My program (is very Simple)*ntpublic static void main(String[] args)
throws IOException{ntt/**Parameter Setting***********/ntt String localPointPath
= "/home/hduser/skyrock/skyrockImageFeatures.csv";ntt String remoteFilePath
=n"hdfs://HadoopV26Master:9000/user/skyrock/skyrockImageIndexedFeatures.cs
v";ntt String outputPath =
"hdfs://HadoopV26Master:9000/user/sparkoutput/";ntt final int row =
Марья
Let’s look at some of the records - 183
{plain=I'm glad that I could help :)n19 sie 2015 8:52 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>nnapisał(a):nn> +1n>n> I
wish I have read this blog earlier. I am using Java and have justn> implemented a singleton producer per executor/JVM during the day.n>
Yes, I did see that NonSerializableException when I was debugging the coden> ...n>n> Thanks for sharing.n>n> On Tue, Aug 18, 2015 at
10:59 PM, Tathagata Das <tdas@databricks.com>n> wrote:n>n>> Its a cool blog post! Tweeted it!n>> Broadcasting the configuration
necessary for lazily instantiating then>> producer is a good idea.n>>n>> Nitpick: The first code example has an extra `}` ;)n>>n>> On Tue,
Aug 18, 2015 at 10:49 PM, Marcin Kuthan <marcin.kuthan@gmail.com>n>> wrote:n>>n>>> As long as Kafka producent is thread-safe you
don't need any pool atn>>> all. Just share single producer on every executor. Please look at my blogn>>> post for more details.
http://allegro.tech/spark-kafka-integration.htmln>>> 19 sie 2015 2:00 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>n>>>
napisał(a):n>>>n>>>> All of you are right.n>>>>n>>>> I was trying to create too many producers. My idea was to create an>>>> pool(for
now the pool contains only one producer) shared by all then>>>> executors.n>>>> After I realized it was related to the serializable issues
(though In>>>> did not find clear clues in the source code to indicate the broacastn>>>> template type parameter must be implement
serializable), I followed sparkn>>>> cassandra connector design and created a singleton of Kafka producer pools.n>>>> There is not
exception noticed.n>>>>n>>>> Thanks for all your comments.n>>>>n>>>>n>>>> On Tue, Aug 18, 2015 at 4:28 PM, Tathagata Das
<tdas@databricks.com>n>>>> wrote:n>>>>n>>>>> Why are you even trying to broadcast a producer? A broadcast variablen>>>>> is
some immutable piece of serializable DATA that can be used forn>>>>> processing on the executors. A Kafka producer is neither DATA
norn>>>>> immutable, and definitely not serializable.n>>>>> The right way to do this is to create the producer in the executors.n>>>>>
Please see the discussion in the programming guiden>>>>>n>>>>>
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreamsn>>>>>n>>>>> On Tue, Aug 18,
2015 at 3:08 PM, Cody Koeninger <cody@koeninger.org>n>>>>> wrote:n>>>>>n>>>>>> I wouldn't expect a kafka producer to be
serializable at all... amongn>>>>>> other things, it has a background threadn>>>>>>n>>>>>> On Tue, Aug 18, 2015 at 4:55 PM,
Shenghua(Daniel) Wan <n>>>>>> wanshenghua@gmail.com> wrote:n>>>>>>n>>>>>>> Hi,n>>>>>>> Did anyone see
An idea of some of the clusters “meaning”
● 74 - (more or less) answers
● 53 - hive errors (more or less)
● 155 - non-hive stack traces (mostly map partitions)
● 126 - PR comments
● 183 - streaming
● etc.
● 0 - Everything else
w.vandervet
This probably isn’t good enough :(
● But maaaybe we don’t want to give folks an IVR response
● How about if we took un-answered questions and pointed
to similar questions (ideally ones with answers…)
● Human-in-the-loop?
○ Idk anyone want to volunteer for this?
Lisa Zins
What else could we do?
● Transfer learning with the TF github summarizer?
● Explore elasticsearch
● Label some data with fiverr or similar
● Give up
● Go for a drink
● Explore network graphs on the Spark Mailing list
○ Like Paco did -
https://www.slideshare.net/pacoid/graph-analytics-in-s
park
Oooor
● Spark-lint
● Dr elephant
● etc.
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
High Performance Spark!
You can buy it today! Several copies!
Really not that much on ML
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
And some upcoming talks:
● October
○ Reversim
○ Scylladb summit
○ Twilio’s Signal in SF
● November
○ PyCon Canada
○ Big Data Spain
● December
○ ScalaX
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I want us to build better Spark
testing support in 3+
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
I’m always trying to get better at giving talks so feedback is
welcome: http://bit.ly/holdenTalkFeedback

Mais conteúdo relacionado

Mais procurados

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017Big Data Spain
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkSpark Summit
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 

Mais procurados (20)

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 

Semelhante a Spark ML on Spark Errors: What Clusters Can Tell Us

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 

Semelhante a Spark ML on Spark Errors: What Clusters Can Tell Us (20)

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Último (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Spark ML on Spark Errors: What Clusters Can Tell Us

  • 1. Using Spark ML on Spark Errors What Do the Clusters Tell Us?
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Normally I’d introduce my co-speaker ● However she was organizing the Apache Beam Summit and is just too drained to be able to make it. ● I did have to cut a few corners (and re-use a few cat pictures) as a result Sylvie burr
  • 5. Some links (slides & recordings will be at): Today’s talk: http://bit.ly/2QoZuKz Yesterday’s talk (Validating Pipelines): https://bit.ly/2QqQUea CatLoversShow
  • 6. Who do I think you all are? ● Nice people* ● Familiar-ish to very familiar with Spark ● Possibly a little bit jaded (but also maybe not) Amanda
  • 7. What we are going to explore together! ● The Spark Mailing Lists ○ Yes, even user@ ● My desire to be lazy ● The suspicion that srowen has a robot army to help ● A look at how much work it would be to build that robot army ● The depressing realization “heuristics” are probably better anyways (and some options)
  • 8. Some of the reasons my employer cares* ● We have a hoted Spark/Hadoop solution (called Dataproc) ● We also have hosted pipeline management tools (based on Airflow called Cloud Composer) ● Being good open source community members *Probably, it’s not like I go to all of the meetings I’m invited to. Khairil Zhafri
  • 9. The Spark Mailing Lists & friends ● user@ ○ Where people to to ask questions about using Spark ● dev@ ○ Discussion about developing Spark ○ Also where people sometimes go when no one answers user@ ● Stackoverflow ○ Some very active folks here as well ● Books/etc. Petfu l
  • 10. ~ unanswered Spark posts 8536 :( Richard J
  • 11. Stack overflow growth over time Petfu l Khalid Abduljaleel *Done with bigquery. Sorry!
  • 13. Anyone have an outstanding PR? koi ko
  • 14. So how do we handle this? ● Get more community volunteers ○ (hard & burn out) ● Answer more questions ○ (hard & burn out) ● Answer less questions? ○ (idk maybe someone will buy a support contract) ● Make robots! ○ Hard and doesn’t work entirely Helen Olney
  • 15. How many of you have had? ● Java OOM ● Application memory overhead exceeded ● Serialization exception ● Value is bigger than integer exception ● etc. Helen Olney
  • 16. Maaaaybe robots could help? ● It certainly seems like some folks have common issues ● Everyone loves phone trees right? ○ Press 1 if you’ve had an out-of-memory exception press 2 if you’re running Python ● Although more seriously some companies are building recommendation systems on top of Spark to solve this for their customers Matthew Hurst
  • 17. Ok well, let’s try and build some clusters? ● Not those cluster :p ● Lets try k=4, we had 4 common errors right? _torne
  • 18. I’m lazy so let’s use Spark: body_hashing = HashingTF(inputCol="body_tokens", outputCol="raw_body_features", numFeatures=10000) body_idf =IDF(inputCol="raw_body_features", outputCol="body_features") assembler = VectorAssembler( inputCols=["body_features", "contains_python_stack_trace", "contains_java_stack_trace", "contains_exception_in_task", "is_thread_start", "domain_features"], outputCol="features") kmeans = KMeans(featuresCol="features", k=4, seed=42) _torne
  • 19. Damn not quite one slide :( dataprep_pipeline = Pipeline(stages=[tokenizer, body_hashing, body_idf, domains_hashing, domains_idf, assembler]) pipeline = Pipeline(stages=[dataprep_pipeline, kmeans]) _torne
  • 20. f ford Pinto by Morven
  • 21. f ford Pinto by Morven ayphen
  • 23. Let’s see what the results ● Let’s grab an email or two from each cluster and take a peek Rikki's Refuge
  • 25. Oh hmm. Well maybe 4*4 right? ● 159 non group-zero messages… Sherrie Thai
  • 26. Well when do we start to see something? *Not actually a great way to pick K w.vandervet
  • 27. Let’s look at some of the records - 73 1 {plain=Hi All,n Greetings ! I needed some help to read a Hive tablenvia Pyspark for which the transactional property is set to 'True' (In othernwords ACID property is enabled). Following is the entire stacktrace and thendescription of the hive table. Would you please be able to help me resolventhe error:nn18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManagern18/03/01 11:06:22 INFO EventLoggingListener: Logging events tonhdfs:///spark-history/local-1519923982155nWelcome ton ____ __n / __/__ ___ _____/ /__n _ / _ / _ `/ __/ '_/n /__ / .__/_,_/_/ /_/_ version 1.6.3n /_/nnUsing Python version 2.7.12 (default, Jul 2 2016 17:42:40)nSparkContext available as sc, HiveContext available as sqlContext.n>>> from pyspark.sql import HiveContextn>>> hive_context = HiveContext(sc)n>>> hive_context.sql("select count(*) Susan Young
  • 28. Let’s look at some of the records - 133 5 {plain=Hi Gourav,nnMy answers are below.nnCheers,nBennnn> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengupta@gmail.com> wrote:n> n> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.n> n> Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. n> n> Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.n> n> And besides that would you not want to work on a platform which is at least 10 times faster What would that be?n> n> Regards,n> Gourav Senguptan> n> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>> wrote:n> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.n> n> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chainn> at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)n> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)n> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)n> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)n> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)n> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)n> at nagy.tamas
  • 29. Let’s look at some of the records - 133 6 {plain=I see, that’s quite interesting. For problem 2, I think the issue is that Akka 2.0.5 *always* kept TCP connections open between nodes, so these messages didn’t get lost. It looks like Akka 2.2 occasionally disconnects them and loses messages. If this is the case, and this behavior can’t be disabled with a flag, then it’s a problem for other parts of the code too. Most of our code assumes that messages will make it through unless the destination node dies, which is what you’d usually hope for TCP.nnMateinnOn Oct 31, 2013, at 1:33 PM, Imran Rashid <imran@quantifind.com> wrote:nn> pretty sure I found the problem -- two problems actually. And I think one of them has been a general lurking problem w/ spark for a while.n> n> 1) we should ignore disassociation events, as you suggested earlier. They seem to just indicate a temporary problem, and can generally be ignored. I've found that they're regularly followed by AssociatedEvents, and it seems communication really works fine at that point.n> n> 2) Task finished messages get lost. When this message gets sent, we dont' know it actually gets there:n> n> https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala #L90n> n> (this is so incredible, I feel I must be overlooking something -- but there is no ack somewhere else that I'm overlooking, is there??) So, after the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent. It hangs b/c the executor has sent some taskFinished messages that never get received by the driver. So the driver is waiting for some tasks to finish, but the executors think they are all done.n> n> I'm gonna add the reliable proxy pattern for this particular interaction and see if its fixes the problemn> http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxyn> n> imrann> n> n> n> On Thu, Oct 31, 2013 at 1:17 PM, Imran Rashid <imran@quantifind.com> wrote:n> Hi Prashant,n> n> thanks for looking into this. I don't have any answers yet, but just wanted to send you an update. I finally figured out how to get all the akka logging turned on, so I'm looking at those for more info. One thing immediately jumped out at me -- the Disassociation is actually immediatley followed by an Association! so maybe I came to the wrong conclusion of our test of ignoring the DisassociatedEvent. I'm going to try it again -- hopefully w/ the logging on, I can find out more about what is going on. I might ask on akka list for help w/ what to look for. also this thread makes me think that it really should just re-associate:n> https://groups.google.com/forum/#!searchin/akka-user/Disassociated/akka-user/SajwwbyTriQ/8oxjbZtawxoJn> n> also, I've 翮郡 陳
  • 30. Let’s look at some of the records - 155 *Problem Description*:nnThe program running in stand-alone spark cluster (1 master, 6 workers withn8g ram and 2 cores).nInput: a 468MB file with 133433 records stored in HDFS.nOutput: just 2MB file will stored in HDFSnThe program has two map operations and one reduceByKey operation.nFinally I save the result to HDFS using "*saveAsTextFile*".n*Problem*: if I don't add "saveAsTextFile", the program runs very fast(a fewnseconds), otherwise extremely slow until about 30 mins.nn*My program (is very Simple)*ntpublic static void main(String[] args) throws IOException{ntt/**Parameter Setting***********/ntt String localPointPath = "/home/hduser/skyrock/skyrockImageFeatures.csv";ntt String remoteFilePath =n"hdfs://HadoopV26Master:9000/user/skyrock/skyrockImageIndexedFeatures.cs v";ntt String outputPath = "hdfs://HadoopV26Master:9000/user/sparkoutput/";ntt final int row = Марья
  • 31. Let’s look at some of the records - 183 {plain=I'm glad that I could help :)n19 sie 2015 8:52 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>nnapisał(a):nn> +1n>n> I wish I have read this blog earlier. I am using Java and have justn> implemented a singleton producer per executor/JVM during the day.n> Yes, I did see that NonSerializableException when I was debugging the coden> ...n>n> Thanks for sharing.n>n> On Tue, Aug 18, 2015 at 10:59 PM, Tathagata Das <tdas@databricks.com>n> wrote:n>n>> Its a cool blog post! Tweeted it!n>> Broadcasting the configuration necessary for lazily instantiating then>> producer is a good idea.n>>n>> Nitpick: The first code example has an extra `}` ;)n>>n>> On Tue, Aug 18, 2015 at 10:49 PM, Marcin Kuthan <marcin.kuthan@gmail.com>n>> wrote:n>>n>>> As long as Kafka producent is thread-safe you don't need any pool atn>>> all. Just share single producer on every executor. Please look at my blogn>>> post for more details. http://allegro.tech/spark-kafka-integration.htmln>>> 19 sie 2015 2:00 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>n>>> napisał(a):n>>>n>>>> All of you are right.n>>>>n>>>> I was trying to create too many producers. My idea was to create an>>>> pool(for now the pool contains only one producer) shared by all then>>>> executors.n>>>> After I realized it was related to the serializable issues (though In>>>> did not find clear clues in the source code to indicate the broacastn>>>> template type parameter must be implement serializable), I followed sparkn>>>> cassandra connector design and created a singleton of Kafka producer pools.n>>>> There is not exception noticed.n>>>>n>>>> Thanks for all your comments.n>>>>n>>>>n>>>> On Tue, Aug 18, 2015 at 4:28 PM, Tathagata Das <tdas@databricks.com>n>>>> wrote:n>>>>n>>>>> Why are you even trying to broadcast a producer? A broadcast variablen>>>>> is some immutable piece of serializable DATA that can be used forn>>>>> processing on the executors. A Kafka producer is neither DATA norn>>>>> immutable, and definitely not serializable.n>>>>> The right way to do this is to create the producer in the executors.n>>>>> Please see the discussion in the programming guiden>>>>>n>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreamsn>>>>>n>>>>> On Tue, Aug 18, 2015 at 3:08 PM, Cody Koeninger <cody@koeninger.org>n>>>>> wrote:n>>>>>n>>>>>> I wouldn't expect a kafka producer to be serializable at all... amongn>>>>>> other things, it has a background threadn>>>>>>n>>>>>> On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan <n>>>>>> wanshenghua@gmail.com> wrote:n>>>>>>n>>>>>>> Hi,n>>>>>>> Did anyone see
  • 32. An idea of some of the clusters “meaning” ● 74 - (more or less) answers ● 53 - hive errors (more or less) ● 155 - non-hive stack traces (mostly map partitions) ● 126 - PR comments ● 183 - streaming ● etc. ● 0 - Everything else w.vandervet
  • 33. This probably isn’t good enough :( ● But maaaybe we don’t want to give folks an IVR response ● How about if we took un-answered questions and pointed to similar questions (ideally ones with answers…) ● Human-in-the-loop? ○ Idk anyone want to volunteer for this? Lisa Zins
  • 34. What else could we do? ● Transfer learning with the TF github summarizer? ● Explore elasticsearch ● Label some data with fiverr or similar ● Give up ● Go for a drink ● Explore network graphs on the Spark Mailing list ○ Like Paco did - https://www.slideshare.net/pacoid/graph-analytics-in-s park
  • 35. Oooor ● Spark-lint ● Dr elephant ● etc.
  • 36. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 37. High Performance Spark! You can buy it today! Several copies! Really not that much on ML Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 38. And some upcoming talks: ● October ○ Reversim ○ Scylladb summit ○ Twilio’s Signal in SF ● November ○ PyCon Canada ○ Big Data Spain ● December ○ ScalaX
  • 39. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I want us to build better Spark testing support in 3+ Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF I’m always trying to get better at giving talks so feedback is welcome: http://bit.ly/holdenTalkFeedback