O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Tuning Apache Spark for large-
scale workloads
Sital Kedia

Gaoxiang Liu Facebook
Agenda
• Apache Spark at Facebook
• Scaling Spark Driver
• Scaling Spark Executor
• Scaling External Shuffle
• Application t...
• Used for large scale batch workload
• Tens of thousands of jobs/day and growing
• Running on in the order of thousands o...
Spark Architecture
Cluster

Manager
shuffle

service
Executor
Driver shuffle

service
Executor
shuffle

service
Executor
Scaling Spark Driver
Dynamic Executor Allocation
Cluster

Manager
Task Queue
Scheduler
Worker Node
shuffle

service
shuffle

service
shuffle

s...
Multi-threaded Event Processor
[SPARK-18838]
Task

Scheduler
Single

threaded

event 

processor
Event

listeners
Event

Q...
Better Fetch Failure handling
[SPARK-19753] Avoid multiple retries of stages in case of Fetch Failure
Single fetch failure...
• Avoid duplicate task run in case of Fetch Failure (SPARK-20163)
• Configurable max number of Fetch Failures (SPARK-13369)...
• Frequent driver OOM when running many tasks in parallel
• Huge backlog of RPC requests built on Netty server of the
driv...
Scaling Spark Executor
Executor memory layout
Shuffle

Memory
User

Memory
Reserved

Memory 

(300 MB)
Memory

Buffer
}
}
}
spark.memory.fraction...
Tuning memory configurations
Shuffle

Memory
User

Memory
Reserved

Memory 

(300 MB)
Memory

Buffer
}
}
}
spark.memory.off...
Tuning memory configurations
• Large contiguous in-memory buffers allocated by Spark's shuffle
internals.
• G1GC suffers from f...
Eliminating disk I/O bottleneck
Sort & 

Spill to disk
























Temporary spill files on disk
Shuffle 

Part...
Eliminating disk I/O bottleneck
• Disk access is 10 - 100K times slower than memory access
• Make write buffer sizes for di...
Eliminating disk I/O bottleneck
























Temporary spill 

files on disk
[SPARK-20014] Optimize spill files m...
Eliminating disk I/O bottleneck
Tune compression block size
Compression block size vs size of shuffle files
SizeofShufflefiles...
Various Memory leak fixes and
improvements
• Memory leak fixes (SPARK-14363, SPARK-17113, SPARK-18208)
• Snappy optimization...
Scaling External Shuffle Service
Cache Index files on Shuffle Server
SPARK-15074
index file
partition
partition
partition
Shuffle
service
Reducer
Reducer
Shuffl...
Scaling External Shuffle Service
• Tune shuffle service worker thread and backlog
• Configurable shuffle registration timeout and...
https://code.facebook.com/posts/1671373793181703
Application tuning
• Improve performance of job latency (under same amount of
resource)
• Improve usability - eliminate manual tuning as much...
• Heuristics-based approach based on table input size
Auto tuning of mapper and reducer
• Max cap due to the
constrain of ...
Tools
Tools
Spark UI metrics
Tools
Flame Graph
Executo Periodic
Jstack/PerfExecutoExecutor
Filter executor
threads
Worker Jstack
aggregator
service
Tools
Analysis of Task metrics using Facebook's Scuba
Task

Scheduler
Scuba

Event

listeners
Event

Queue
• Enables us to...
Resources
• Scuba: Diving into Data at Facebook
• Apache Spark @Scale: A 60 TB+ production use case
Questions?
Próximos SlideShares
Carregando em…5
×

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

  • Entre para ver os comentários

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

  1. 1. Tuning Apache Spark for large- scale workloads Sital Kedia
 Gaoxiang Liu Facebook
  2. 2. Agenda • Apache Spark at Facebook • Scaling Spark Driver • Scaling Spark Executor • Scaling External Shuffle • Application tuning • Tools
  3. 3. • Used for large scale batch workload • Tens of thousands of jobs/day and growing • Running on in the order of thousands of nodes • Job scalability - • Processes hundreds of TBs of compressed input data and shuffle data • Runs hundreds of thousands of tasks Apache Spark at Facebook
  4. 4. Spark Architecture Cluster
 Manager shuffle
 service Executor Driver shuffle
 service Executor shuffle
 service Executor
  5. 5. Scaling Spark Driver
  6. 6. Dynamic Executor Allocation Cluster
 Manager Task Queue Scheduler Worker Node shuffle
 service shuffle
 service shuffle
 service Executor Request 
 Executor Release
 Executor • Better Resource utilization • Good for multi-tenant environment spark.dynamicAllocation.enabled = true
 spark.dynamicAllocation.executorIdleTimeout = 2m spark.dynamicAllocation.minExecutors = 1 spark.dynamicAllocation.maxExecutors = 2000
  7. 7. Multi-threaded Event Processor [SPARK-18838] Task
 Scheduler Single
 threaded
 event 
 processor Event
 listeners Event
 Queue Single threaded event processor architecture Task
 Scheduler Single threaded
 Executor service
 per listener Multi-threaded event processor architecture
  8. 8. Better Fetch Failure handling [SPARK-19753] Avoid multiple retries of stages in case of Fetch Failure Single fetch failure causing
 single retriesSingle fetch failure causing 
 multiple retries
  9. 9. • Avoid duplicate task run in case of Fetch Failure (SPARK-20163) • Configurable max number of Fetch Failures (SPARK-13369) • Ongoing effort (SPARK-20178) Better Fetch Failure handling spark.max.fetch.failures.per.stage = 10
  10. 10. • Frequent driver OOM when running many tasks in parallel • Huge backlog of RPC requests built on Netty server of the driver • Increase RPC server thread to fix OOM Tune RPC Server threads Scaling Spark Driver spark.rpc.io.serverThreads = 64
  11. 11. Scaling Spark Executor
  12. 12. Executor memory layout Shuffle
 Memory User
 Memory Reserved
 Memory 
 (300 MB) Memory
 Buffer } } } spark.memory.fraction * (spark.executor.memory - 300 MB) (1- spark.memory.fraction) * (spark.executor.memory - 300 MB) spark.yarn.executor.memoryOverhead = 0.1 * (spark.executor.memory)
  13. 13. Tuning memory configurations Shuffle
 Memory User
 Memory Reserved
 Memory 
 (300 MB) Memory
 Buffer } } } spark.memory.offHeap.enabled = true
 spark.memory.offHeap.size = 3g spark.executor.memory = 3g spark.yarn.executor.memoryOverhead = 0.1 * 
 (spark.executor.memory + spark.memory.offHeap.size) Enable off-heap memory
  14. 14. Tuning memory configurations • Large contiguous in-memory buffers allocated by Spark's shuffle internals. • G1GC suffers from fragmentation due to Humongous Allocations, if object size is more than 32 MB (Maximum region size of G1GC) • Use parallel GC instead of G1GC Garbage collection tuning spark.executor.extraJavaOptions = -XX:ParallelGCThreads=4 -XX:+UseParallelGC
  15. 15. Eliminating disk I/O bottleneck Sort & 
 Spill to disk 
 
 
 
 
 
 
 
 
 
 
 
 Temporary spill files on disk Shuffle 
 Partition Final shuffle files on disk In-memory
 records
  16. 16. Eliminating disk I/O bottleneck • Disk access is 10 - 100K times slower than memory access • Make write buffer sizes for disk I/O configurable (SPARK-20074) • Amortize disk I/O cost by doing buffered read/write Tune Shuffle file buffer spark.shuffle.file.buffer = 1 MB
 spark.unsafe.sorter.spill.reader.buffer.size = 1MB

  17. 17. Eliminating disk I/O bottleneck 
 
 
 
 
 
 
 
 
 
 
 
 Temporary spill 
 files on disk [SPARK-20014] Optimize spill files merging 
 
 
 
 
 
 
 
 In-memory spill
 file buffers 
 In-memory
 spill merge Shuffle 
 Partition Final shuffle 
 file on disk spark.file.transferTo = false
 spark.shuffle.file.buffer = 1 MB
 spark.shuffle.unsafe.file
 .output.buffer = 5 MB
  18. 18. Eliminating disk I/O bottleneck Tune compression block size Compression block size vs size of shuffle files SizeofShufflefiles(inTB) 130 155 180 205 230 Compression block size 32kb 128kb 512kb 1mb • Default compression block size of 32 kb is sub- optimal • Upto 20% reduction in shuffle/spill file size by increasing the block size spark.io.compression.lz4.blockSize = 512KB
  19. 19. Various Memory leak fixes and improvements • Memory leak fixes (SPARK-14363, SPARK-17113, SPARK-18208) • Snappy optimization (SPARK-14277) • Reduce update frequency of shuffle bytes written metrics (SPARK-15569) • Configurable initial buffer size for Sorter(SPARK-15958)
  20. 20. Scaling External Shuffle Service
  21. 21. Cache Index files on Shuffle Server SPARK-15074 index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read index 
 file Read 
 partition Read 
 partition Read index 
 file index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read and cache
 index file Read 
 partition Read 
 partition spark.shuffle.service.index.cache.entries = 2048
  22. 22. Scaling External Shuffle Service • Tune shuffle service worker thread and backlog • Configurable shuffle registration timeout and retry (SPARK-20640) spark.shuffle.io.serverThreads = 128
 spark.shuffle.io.backLog = 8192
 spark.shuffle.registration.timeout = 2m
 spark.shuffle.registration.maxAttempts = 5
  23. 23. https://code.facebook.com/posts/1671373793181703
  24. 24. Application tuning
  25. 25. • Improve performance of job latency (under same amount of resource) • Improve usability - eliminate manual tuning as much as possible to achieve comparable job performance with manually tuned parameters Motivation
  26. 26. • Heuristics-based approach based on table input size Auto tuning of mapper and reducer • Max cap due to the constrain of the scalability of shuffle service and drivers • Min cap due to the minimum guarantee of resource to user's job
  27. 27. Tools
  28. 28. Tools Spark UI metrics
  29. 29. Tools Flame Graph Executo Periodic Jstack/PerfExecutoExecutor Filter executor threads Worker Jstack aggregator service
  30. 30. Tools Analysis of Task metrics using Facebook's Scuba Task
 Scheduler Scuba
 Event
 listeners Event
 Queue • Enables us to do complex queries like - • How many job failed because of OOM in ExternalSorter in last 1 hour? • What percentage of total execution time is being spent in shuffle read? • Did fetch failure rate go up after the last Spark release? • Set up monitoring and alerting to catch regression
  31. 31. Resources • Scuba: Diving into Data at Facebook • Apache Spark @Scale: A 60 TB+ production use case
  32. 32. Questions?

×