Using apache spark for processing trillions of records each day at Datadog

Using Apache Spark for processing
trillions of records each day at
Datadog
Vadim Semenov
Data Engineer @ Datadog
vadim@datadoghq.com

Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
24-48 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API

Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
23.5-47 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
only 240.23
GiB available
because of Xen

Some initial settings
yarn.nodemanager.resource.memory-mb 240g
yarn.scheduler.maximum-allocation-mb 240g
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 3g
spark.executor.memory 201g
spark.yarn.executor.memoryOverhead 28g
spark.driver.cores 4
spark.executor.cores 32
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
-XX:ErrorFile=/tmp/hs_err_pid%p.log

Trillion
How big is a trillion?
2^40 = 1,099,511,627,776
2^31 = 2,147,483,648 = Int.MaxValue
a trillion Integers = 4.3 TiB

OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)

OOMs
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)

OOMs
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)

OOMs
garbage)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)

OOMs
garbage)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)

The driver must survive
spark.driver.memory 8g 83g
spark.yarn.driver.memoryOverhead 3g 32g
spark.driver.cores 4 15
spark.executor.memory 201g 166g
spark.yarn.executor.memoryOverhead 28g 64g
spark.executor.cores 32 30

IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS

Measure memory usage
https://github.com/etsy/statsd-jvm-profiler
spark.files = /tmp/statsd-jvm-profiler.jar
spark.executor.extraJavaOptions +=
-javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem
oryProfiler

Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…

Off-heap memory
Direct Allocated Buffers (NIO): Parquet, MessagePack, …
Java Native Interface (JNI): dynamically-linked native
libraries like libhadoop.so, GZIP, ZLIB, LZ4
sun.misc.Unsafe: org.apache.hadoop.io.nativeio,
org.apache.spark.unsafe

Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB

Process memory
Solution: let the java-agent get the memory
usage of its process right from the procfs
https://github.com/DataDog/spark-jvm-profiler

OOMs
garbage)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)

Lessons
- Give more resources than you think you
would need, and then reduce

Lessons
- Measure memory usage of each executor

Lessons
- Measure memory usage of each executor
- Keep an eye on your GC metrics

Measure slow parts
val timer = MaxAndTotalTimeAccumulator
rdd.map(key => {
val startTime = System.nanoTime()
...
val endTime = System.nanoTime()
val millisecondsPassed = ((endTime - startTime) / 1000000).toInt
timer.add(millisecondsPassed)
})

Watch skewed parts
.groupByKey().flatMap({ case (key, iter) =>
val size = iter.size
maxAccumulator.add(key, size)
if (size >= 100,000,000) {
log.info(s"Key $key has $size values")
None
} else {

Report accumulators per partition
sc.addSparkListener(new SparkListener {
override def onTaskEnd(
taskEnd: SparkListenerTaskEnd
): Unit =
Option(taskEnd.taskMetrics)
.foreach(taskMetrics => … )
})

Lessons
- Measure slowest parts of your job

Lessons
- Count records in the most skewed parts

Lessons
- Keep track of how much CPU time your job
actually consumes

Lessons
- Keep track of how much CPU time your job
actually consumes
- Have some alerting on these metrics, so you
would know that your job gets slower

Spot instances mitigation
- Break the job into smaller survivable pieces

- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS

- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs

- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
- Losing multiple executors won't result in
recomputing partitions

ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver

1 2 3 4
Ex2
Ex3
Driver

1 2 3 4
Ex2
Ex3
Driver
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337

2 3 4
Ex2
Ex3
Driver
1

2 3 4
Ex2
Ex3
Driver
1
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337

3 4
Ex2
Ex3
Driver
1
2

SPARK-19753 Remove all shuffle files on a host in case
of slave lost of fetch failure
SPARK-20832 Standalone master should explicitly inform
drivers of worker deaths and invalidate external shuffle
service outputs

Other FetchFailures
SPARK-20178 Improve Scheduler fetch failures

Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies

Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
- Monitor the number of failed
tasks/stages/lost nodes

Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …

.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sortBy(_._1)
// (1L, 10), (1L, 1), (2L, 1)
// (1L, 1), (1L, 10), (2L, 1)
})
SPARK-19263 DAGScheduler should avoid sending
conflicting task set

.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sorted
// (1L, 1), (1L, 10), (2L, 1)
})

Lessons
- Trust but put extra checks and log everything

Lessons
- Add extra idempotency even if it should be there

Lessons
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome

Lessons
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
- Have retries on the pipeline scheduler level

Migration to Spark 2
SPARK-13850 TimSort Comparison method violates its general contract
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-14363 Executor OOM due to a memory leak in Sorter
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-22033 BufferHolder, other size checks should account for the specific VM
array size limitations

Lessons
- Check the bug tracker periodically

Lessons
- Subscribe to mailing lists

Lessons
- Subscribe to mailing lists
- Participate in discussing issues

In conclusion
- Log everything (driver/executors,
NodeManagers, GC)

In conclusion
- Log everything
- Measure everything (heap/off-heap, GC,
executors cpu, failed tasks/stages, slow
parts, skewed parts)

In conclusion
- Log everything
- Measure everything
- Trust but be ready

In conclusion
- Log everything
- Measure everything
- Trust but be ready
- Smaller survivable pieces

Thanks!
Want to work with us on Spark, Kafka, ES, and
more? Come to our booth!
jobs.datadoghq.com
twitter.com/@databuryat
_@databuryat.com
vadim@datadoghq.com

Using apache spark for processing trillions of records each day at Datadog

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Using apache spark for processing trillions of records each day at Datadog

Semelhante a Using apache spark for processing trillions of records each day at Datadog (20)

Último

Último (20)

Using apache spark for processing trillions of records each day at Datadog