Due to Spark, writing big data applications has never been easier…at least until they stop being easy! At Lightbend we’ve helped our customers out of a number of hidden Spark pitfalls. Some crop up often; the ever-persistent OutOfMemoryError, the confusing NoSuchMethodError, shuffle and partition management, etc. Others occur less frequently; an obscure configuration affecting SQL broadcasts, struggles with speculating, a failing stream recovery due to RDD joins, S3 file reading leading to hangs, etc. All are intriguing! In this session we will provide insights into their origins and show how you can avoid making the same mistakes. Whether you are a seasoned Spark developer or a novice, you should learn some new tips and tricks that could save you hours or even days of debugging.
5. Out Of Memory (OOM)
“Thrown when the Java Virtual Machine cannot allocate an
object because it is out of memory, and no more memory
could be made available by the garbage collector.”
8. • Don’t jump straight to parameter tuning
– But if you do => Sparklint
• Be aware of execution time object creation
• Don’t jump straight to parameter tuning
– But if you do => Sparklint
OOM Tips
• Don’t jump straight to parameter tuning
rdd.mapPartitions{ iterator => // fetch remote file }
9. OOM Tips
• Plan the resources needed from your cluster manager
before deploying - when possible
• Plan the resources needed from your cluster manager
before deploying - when possible
YARN
– Cluster vs client mode
– yarn.nodemanager.resource.memory-mb
– yarn.scheduler.minimum-allocation-
mb
– spark.yarn.driver.memoryOverhead
EXAMPLE
10. OOM
//Re-partitioning may cause issues...
//Here target is to have one file as output but...
private def saveToFile(dataFrame: DataFrame,
filePath: String): Unit = {
dataFrame.repartition(1).write.
format("com.databricks.spark.csv")...save(filePath)
}
12. NoSuchMethod
“Thrown if an application tries to call a specified method of a
class (either static or instance), and that class no longer
has a definition of that method. Normally, this error is
caught by the compiler; this error can only occur at run time
if the definition of a class has incompatibly changed.”
22. Slow Joins
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
df.explain
df.queryExecution.executedPlan
df.explain
df.queryExecution.executedPlan
== Physical Plan ==
BroadcastHashJoin ... BuildRight
...
• Check which strategy is actually used
• For broadcast you should see :
• Check which strategy is actually used
29. ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations; for
example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
values transformation and count action cannot be performed inside of
the rdd1.map transformation. For more information, see SPARK-5063.
30. ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and actions
can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
44. • Custom s3n handler• Custom s3n handler
• Increase the heap• Increase the heap
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
Solutions
• Pace it with a script
• Complain
• Pace it with a script