SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
Problem Solving Recipes
Learned from Supporting Spark
Justin Pihony & Stavros Kontopolous
Lightbend
Table of Contents
1. OOM1. OOM
2. NoSuchMethod
5. Strategizing
Your Joins
6. Safe Stream
Recovery
3. Perplexities of
Size
6. Safe Stream
Recovery
4. Struggles in
Speculation
5. Strategizing
Your Joins
7. Handling S3 w/o
Hanging
2. NoSuchMethod
3. Perplexities of
Size
4. Struggles in
Speculation
Prologue
Memory Problems?
Out Of Memory (OOM)
“Thrown when the Java Virtual Machine cannot allocate an
object because it is out of memory, and no more memory
could be made available by the garbage collector.”
spark.memory.fraction vs spark.memory.storageFraction
• Don’t jump straight to parameter tuning
– But if you do => Sparklint
• Be aware of execution time object creation
• Don’t jump straight to parameter tuning
– But if you do => Sparklint
OOM Tips
• Don’t jump straight to parameter tuning
rdd.mapPartitions{ iterator => // fetch remote file }
OOM Tips
• Plan the resources needed from your cluster manager
before deploying - when possible
• Plan the resources needed from your cluster manager
before deploying - when possible
YARN
– Cluster vs client mode
– yarn.nodemanager.resource.memory-mb
– yarn.scheduler.minimum-allocation-
mb
– spark.yarn.driver.memoryOverhead
EXAMPLE
OOM
//Re-partitioning may cause issues...
//Here target is to have one file as output but...
private def saveToFile(dataFrame: DataFrame,
filePath: String): Unit = {
dataFrame.repartition(1).write.
format("com.databricks.spark.csv")...save(filePath)
}
NoSuchMethod
NoSuchMethod
“Thrown if an application tries to call a specified method of a
class (either static or instance), and that class no longer
has a definition of that method. Normally, this error is
caught by the compiler; this error can only occur at run time
if the definition of a class has incompatibly changed.”
java.lang.NoSuchMethodError:
org.apache.commons.math3.util.MathArrays.natural
“...spark 1.4.1 uses math3 3.1.1 which, as it turns out,
doesn't have the natural method.”
Dependency Collision
Compiled against version A Runtime used version B
Version B did not have the method!
• Enforce lib load order
• Upgrade Spark or downgrade your library
– If you’re lucky...
• Enforce lib load order
• Upgrade Spark or downgrade your library
– If you’re lucky...
• Shade your lib
– sbt: https://github.com/sbt/sbt-assembly#shading
– Maven: https://maven.apache.org/plugins/maven-shade-plugin/
spark-submit --class "MAIN_CLASS"
--driver-class-path commons-math3-3.3.jar YOURJAR.jar
spark-submit --class "MAIN_CLASS"
--driver-class-path commons-math3-3.3.jar YOURJAR.jar
Solutions
Perplexities of Size
https://josephderosa.files.wordpress.com/2016/04/too-big-by-half.jpg
Perplexities of Size
Perplexities of Size
Struggles in Speculation
• spark.speculation.multiplier
Struggles in Speculation
• spark.speculation.interval• spark.speculation.interval
• spark.speculation.quantile
• spark.speculation.multiplier
Strategizing Your Joins
Common Issues
• Slow Joins
• Unavoidable Joins
Slow Joins
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
df.explain
df.queryExecution.executedPlan
df.explain
df.queryExecution.executedPlan
== Physical Plan ==
BroadcastHashJoin ... BuildRight
...
• Check which strategy is actually used
• For broadcast you should see :
• Check which strategy is actually used
– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD
Join Strategies
• Shuffle hash join• Shuffle hash join
• Broadcast
• Sort merge
– spark.sql.join.preferSortMergeJoin (default = true)
• Broadcast
val df = spark.sparkContext.parallelize(
List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London"))
).toDF("GroupId", "Amount","City")
val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City"))
grouped.collect().foreach(println)
val df = spark.sparkContext.parallelize(
List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London"))
).toDF("GroupId", "Amount","City")
val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City"))
grouped.collect().foreach(println)
Unavoidable Joins
Joining is the only way to retain all related columns for max after groupBy
[Id1,10,London]
[Id2,20,Paris]
val joined = df.as("a").join(grouped.as("b"),
$"a.GroupId" === $"b.GroupId" && $"a.Amount" === $"b.max(Amount)", "inner")
joined.collect().foreach(println)
[Id1,10,London,Id1,10,London]
[Id2,20,Paris,Id2,20,Paris]
[Id2,20,London,Id2,20,Paris]
Safe Stream Recovery
val products = ssc.cassandraTable[(Int, String, Float)]
(...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
.... ....
....
ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations; for
example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
values transformation and count action cannot be performed inside of
the rdd1.map transformation. For more information, see SPARK-5063.
ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and actions
can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
val products = ssc.cassandraTable[(Int, String, Float)]
(...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)]
(...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
orders.transform(rdd => {
val products = rdd.sparkContext.cassandraTable(…).map(…)
rdd.join(products)
})
orders.transform(rdd => {
val products = rdd.sparkContext.cassandraTable(…).map(…)
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)]
(...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)]
(...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
orders.transform{ rdd => {
val sc = rdd.sparkContext
val productsOption = sc.getPersistentRDDs
.values.filter(rdd => rdd.name == "foo").headOption
val products = productsOption match {
case Some(persistedRDD) =>
persistedRDD.asInstanceOf[CassandraTableScanRDD[…]]
case None => {
val productsRDD = sc.cassandraTable(…).map(…)
productsRDD.setName("foo")
productsRDD.cache
}
}
rdd.join(products)
}
Safe Stream Recovery
orders.transform{ rdd => {
val sc = rdd.sparkContext
val productsOption = sc.getPersistentRDDs
.values.filter(rdd => rdd.name == "foo").headOption
val products = productsOption match {
case Some(persistedRDD) =>
persistedRDD.asInstanceOf[CassandraTableScanRDD[…]]
case None => {
val productsRDD = sc.cassandraTable(…).map(…)
productsRDD.setName("foo")
productsRDD.cache
}
}
rdd.join(products)
}
Handling S3 w/o Hanging
Simple Streaming
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("s3n://some/folder")
lines.print
Simple Streaming, Right???
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("s3n://some/folder")
lines.print
Wrong!!!
• Custom s3n handler• Custom s3n handler
• Increase the heap• Increase the heap
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
Solutions
• Pace it with a script
• Complain
• Pace it with a script
…... …...
THANK YOU.
justin.pihony@lightbend.com
stavros.kontopolous@lightbend.com

Mais conteúdo relacionado

Mais procurados

Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarDatabricks
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Jaroslav Bachorik
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 

Mais procurados (20)

Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 

Semelhante a Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk by Justin Pihony and Stavros Kontopoulos

Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Stavros Kontopoulos
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 

Semelhante a Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk by Justin Pihony and Stavros Kontopoulos (20)

Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk by Justin Pihony and Stavros Kontopoulos

  • 1. Problem Solving Recipes Learned from Supporting Spark Justin Pihony & Stavros Kontopolous Lightbend
  • 2. Table of Contents 1. OOM1. OOM 2. NoSuchMethod 5. Strategizing Your Joins 6. Safe Stream Recovery 3. Perplexities of Size 6. Safe Stream Recovery 4. Struggles in Speculation 5. Strategizing Your Joins 7. Handling S3 w/o Hanging 2. NoSuchMethod 3. Perplexities of Size 4. Struggles in Speculation
  • 5. Out Of Memory (OOM) “Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector.”
  • 6.
  • 8. • Don’t jump straight to parameter tuning – But if you do => Sparklint • Be aware of execution time object creation • Don’t jump straight to parameter tuning – But if you do => Sparklint OOM Tips • Don’t jump straight to parameter tuning rdd.mapPartitions{ iterator => // fetch remote file }
  • 9. OOM Tips • Plan the resources needed from your cluster manager before deploying - when possible • Plan the resources needed from your cluster manager before deploying - when possible YARN – Cluster vs client mode – yarn.nodemanager.resource.memory-mb – yarn.scheduler.minimum-allocation- mb – spark.yarn.driver.memoryOverhead EXAMPLE
  • 10. OOM //Re-partitioning may cause issues... //Here target is to have one file as output but... private def saveToFile(dataFrame: DataFrame, filePath: String): Unit = { dataFrame.repartition(1).write. format("com.databricks.spark.csv")...save(filePath) }
  • 12. NoSuchMethod “Thrown if an application tries to call a specified method of a class (either static or instance), and that class no longer has a definition of that method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of a class has incompatibly changed.”
  • 13. java.lang.NoSuchMethodError: org.apache.commons.math3.util.MathArrays.natural “...spark 1.4.1 uses math3 3.1.1 which, as it turns out, doesn't have the natural method.” Dependency Collision Compiled against version A Runtime used version B Version B did not have the method!
  • 14. • Enforce lib load order • Upgrade Spark or downgrade your library – If you’re lucky... • Enforce lib load order • Upgrade Spark or downgrade your library – If you’re lucky... • Shade your lib – sbt: https://github.com/sbt/sbt-assembly#shading – Maven: https://maven.apache.org/plugins/maven-shade-plugin/ spark-submit --class "MAIN_CLASS" --driver-class-path commons-math3-3.3.jar YOURJAR.jar spark-submit --class "MAIN_CLASS" --driver-class-path commons-math3-3.3.jar YOURJAR.jar Solutions
  • 19. • spark.speculation.multiplier Struggles in Speculation • spark.speculation.interval• spark.speculation.interval • spark.speculation.quantile • spark.speculation.multiplier
  • 21. Common Issues • Slow Joins • Unavoidable Joins
  • 22. Slow Joins val df = largeDF.join(broadcast(smallDF), “key”) • Avoid shuffling if one side of the join is small enough val df = largeDF.join(broadcast(smallDF), “key”) • Avoid shuffling if one side of the join is small enough df.explain df.queryExecution.executedPlan df.explain df.queryExecution.executedPlan == Physical Plan == BroadcastHashJoin ... BuildRight ... • Check which strategy is actually used • For broadcast you should see : • Check which strategy is actually used
  • 23. – size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD Join Strategies • Shuffle hash join• Shuffle hash join • Broadcast • Sort merge – spark.sql.join.preferSortMergeJoin (default = true) • Broadcast
  • 24. val df = spark.sparkContext.parallelize( List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London")) ).toDF("GroupId", "Amount","City") val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City")) grouped.collect().foreach(println) val df = spark.sparkContext.parallelize( List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London")) ).toDF("GroupId", "Amount","City") val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City")) grouped.collect().foreach(println) Unavoidable Joins Joining is the only way to retain all related columns for max after groupBy [Id1,10,London] [Id2,20,Paris] val joined = df.as("a").join(grouped.as("b"), $"a.GroupId" === $"b.GroupId" && $"a.Amount" === $"b.max(Amount)", "inner") joined.collect().foreach(println) [Id1,10,London,Id1,10,London] [Id2,20,Paris,Id2,20,Paris] [Id2,20,London,Id2,20,Paris]
  • 26. val products = ssc.cassandraTable[(Int, String, Float)] (...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 28. ....
  • 29. ERROR JobScheduler: Error running job streaming job org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
  • 30. ERROR JobScheduler: Error running job streaming job org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
  • 31. val products = ssc.cassandraTable[(Int, String, Float)] (...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 32. val products = ssc.cassandraTable[(Int, String, Float)] (...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 33. orders.transform(rdd => { val products = rdd.sparkContext.cassandraTable(…).map(…) rdd.join(products) })
  • 34. orders.transform(rdd => { val products = rdd.sparkContext.cassandraTable(…).map(…) rdd.join(products) })
  • 35. val products = ssc.cassandraTable[(Int, String, Float)] (...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 36. val products = ssc.cassandraTable[(Int, String, Float)] (...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 37. orders.transform{ rdd => { val sc = rdd.sparkContext val productsOption = sc.getPersistentRDDs .values.filter(rdd => rdd.name == "foo").headOption val products = productsOption match { case Some(persistedRDD) => persistedRDD.asInstanceOf[CassandraTableScanRDD[…]] case None => { val productsRDD = sc.cassandraTable(…).map(…) productsRDD.setName("foo") productsRDD.cache } } rdd.join(products) }
  • 38. Safe Stream Recovery orders.transform{ rdd => { val sc = rdd.sparkContext val productsOption = sc.getPersistentRDDs .values.filter(rdd => rdd.name == "foo").headOption val products = productsOption match { case Some(persistedRDD) => persistedRDD.asInstanceOf[CassandraTableScanRDD[…]] case None => { val productsRDD = sc.cassandraTable(…).map(…) productsRDD.setName("foo") productsRDD.cache } } rdd.join(products) }
  • 39. Handling S3 w/o Hanging
  • 40. Simple Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = ssc.textFileStream("s3n://some/folder") lines.print
  • 41. Simple Streaming, Right??? val ssc = new StreamingContext(sc, Seconds(10)) val lines = ssc.textFileStream("s3n://some/folder") lines.print
  • 43.
  • 44. • Custom s3n handler• Custom s3n handler • Increase the heap• Increase the heap • Explicit file handling – http://stackoverflow.com/a/34681877/779513 • Explicit file handling – http://stackoverflow.com/a/34681877/779513 Solutions • Pace it with a script • Complain • Pace it with a script