22. OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
23. Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
24. Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
28. Process memory
Solution: let the java-agent get the memory
usage of its process right from the procfs
https://github.com/DataDog/spark-jvm-profiler
31. OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
32. Lessons
- Give more resources than you think you
would need, and then reduce
33. Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
34. Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
- Keep an eye on your GC metrics
35. Measure slow parts
val timer = MaxAndTotalTimeAccumulator
rdd.map(key => {
val startTime = System.nanoTime()
...
val endTime = System.nanoTime()
val millisecondsPassed = ((endTime - startTime) / 1000000).toInt
timer.add(millisecondsPassed)
})
36. Watch skewed parts
.groupByKey().flatMap({ case (key, iter) =>
val size = iter.size
maxAccumulator.add(key, size)
if (size >= 100,000,000) {
log.info(s"Key $key has $size values")
None
} else {
41. Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
42. Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
- Have some alerting on these metrics, so you
would know that your job gets slower
45. Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
46. Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
47. Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
- Losing multiple executors won't result in
recomputing partitions
57. ExternalShuffleService
SPARK-19753 Remove all shuffle files on a host in case
of slave lost of fetch failure
SPARK-20832 Standalone master should explicitly inform
drivers of worker deaths and invalidate external shuffle
service outputs
61. Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
62. Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
- Monitor the number of failed
tasks/stages/lost nodes
63. Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
64. Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
65. Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
66. Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sortBy(_._1)
// (1L, 10), (1L, 1), (2L, 1)
// (1L, 1), (1L, 10), (2L, 1)
})
SPARK-19263 DAGScheduler should avoid sending
conflicting task set
67. Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sorted
// (1L, 1), (1L, 10), (2L, 1)
})
69. Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
70. Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
71. Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
- Have retries on the pipeline scheduler level
72. Migration to Spark 2
SPARK-13850 TimSort Comparison method violates its general contract
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-14363 Executor OOM due to a memory leak in Sorter
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-22033 BufferHolder, other size checks should account for the specific VM
array size limitations
79. In conclusion
- Log everything
- Measure everything
- Trust but be ready
- Smaller survivable pieces
80. Thanks!
Want to work with us on Spark, Kafka, ES, and
more? Come to our booth!
jobs.datadoghq.com
twitter.com/@databuryat
_@databuryat.com
vadim@datadoghq.com