This document provides guidelines for tuning Hadoop for performance. It discusses key factors that influence Hadoop performance like hardware configuration, application logic, and system bottlenecks. It also outlines various configuration parameters that can be tuned at the cluster and job level to optimize CPU, memory, disk throughput, and task granularity. Sample tuning gains are shown for a webmap application where tuning multiple parameters improved job execution time by up to 22%.
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
1. Tuning Hadoop for Performance
Srigurunath Chakravarthi
Performance Enginnering,
Yahoo! Bangalore
Doc Ver 1.0
March 5, 2010
Yahoo! Confidential 1
2. Outline
• Why worry about performance?
• Recap of Hadoop Design
– Control Flow (Map, Shuffle, Reduce phases)
• Key performance considerations
• Thumb rules for tuning Hadoop
– Cluster level
– Application level
• Wrap up
Yahoo! Confidential 2
3. Why Worry About Performance?
Why Measure/Track Performance?
• Tells your ROI on hardware.
• Surfaces silent performance regressions from
– Faulty and “slow” (malfunctioning) disks/NICs/CPUs
– Software/Configuration Upgrades, etc.
Why Improve Performance?
• Faster results and better ROI :-)
• There are non-obvious, yet simple ways to
– Push up cluster/app performance without adding hardware
– Unlock cluster/app performance by mitigating bottlenecks
And The Good News Is… Hadoop is designed to be tunable by users
– 25+ performance influencing tunable parameters
– Cluster-wide and Job-specific controls
Yahoo! Confidential 3
4. Recap of Hadoop Design
Map
Map
Task Task
Map
Task
Task Tracker
Reduce
Reduce
Task
HDFS Local Disk Task
Map
Map HDFS
Task
Map Task
Task Tracker Local Disk
Task
HDFS Local Disk
Reduce
Reduce
Task
Task
Map
Map
Task Task
Map
Task
Task Tracker HDFS
Local Disk
HDFS Local Disk
Yahoo! Confidential 4
5. Key Performance influencing factors
Multiple Orthogonal factors
• Cluster Hardware Configuration
– # cores; RAM, # disks per node; disk speeds; network topology, etc.
Example: If your app is data intensive, can you drive sufficiently good disk throughput?
Do you have sufficient RAM (to decrease # trips to disk)?
• Application logic related
– Degree of Parallelism: M-R favors embarrassingly parallel apps
– Load Balance: Slowest tasks impact M-R job completion time.
• System Bottlenecks
– Thrashing your CPU/memory/disks/network degrades performance severely
• Resource Under-utilization
– Your app may not be pushing system limits enough.
• Scale
– Bottlenecks from centralized components (Job Tracker and Name Node).
Yahoo! Confidential 5
6. Key Performance influencing factors
Tuning Opportunities
• Cluster Hardware Configuration
– Hardware Purchase/Upgrade time decision. (Outside scope of this pres.)
• Application logic related
– Tied to app logic. (Outside scope of this presentation.)
– Countering Load Balance:
• Typically mitigated by adapting user algorithm to avoid “long tails”.
• Examples: Re-partitioning; Imposing per-task hard-limits on input/output sizes.
– Handling Non-Parallelism:
• Run app as a pipeline of M-R jobs. Sequential portions as single reducers.
– Record Combining:
• Map-side and reduce-side combiners
• System Bottlenecks & Resource Under-utilization
– These can be mitigated by tuning Hadoop (discussed more).
• Scale
– Relevant to large (1000+ node) clusters. (Outside scope of this pres.)
Yahoo! Confidential 6
7. System Usage Characteristics
Resource Intensiveness
M-R Step CPU Memory Network Disk Notes
Serve Map Yes* Yes *For remote maps (minority)
Input
Execute Map Yes* *Depends on App
Function
Store Map Yes* Yes+ Yes *If compression is ON
Output +Memory Sensitive
Shuffle Yes+ Yes Yes +Memory Sensitive
Execute Yes* *Depends on App
Reduce Func.
Store Reduce Yes* Yes+ Yes *If compression is ON
Output +For replication factor > 1
Yahoo! Confidential 7
8. Cluster Level Tuning – CPU & Memory
Map and Reducers task execution: Pushing Up CPU Utilization
Tunables
– mapred.tasktracker.map.tasks.maximum: The maximum number of map tasks that will
be run simultaneously by a task tracker (aka “map slots” / “M”).
– mapred.tasktracker.reduce.tasks.maximum: The maximum number of reduce tasks that
will be run simultaneously by a task tracker (aka “reduce slots” / “R”).
Thumb Rules for Tuning
– Over-subscribe cores (Set total “slots” > num cores)
– Throw more slots at the dominant phase.
– Don’t exceed mem limit and hit swap! (Adjust Java heap via mapred.child.javaopts)
– Example:
– 8 cores. Assume map tasks account for 75% of CPU time.
– Per Over-subscribing rule: Total Slots (M+R) = 10 (on 8 cores)
– Per Biasing rule: Create more Map Slots than Reduce Slots. E.g., M,R = (8, 2) or (7,3)
Yahoo! Confidential 8
9. Cluster Level Tuning – DFS Throughput
DFS Data Read/Write: Pushing up throughput
Tunables
– dfs.block.size: The default block size for new files (aka “DFS Block Size”).
Thumb Rules for Tuning
– The default of 128 MB is normally a good size. Lower if disk-space is a crunch.
– Size it to avoid serving multiple blocks to a map task. May forsake data locality.
– Alternately tailor the number of map tasks at the job level.
– Example:
– If your data sets that logically go to a single map are ~180-190 MB in size, set block
size to 196 MB.
Yahoo! Confidential 9
10. Job Level Tuning –Task Granularity
Setting optimal number of Map and Reduce tasks
Tunables
– # map tasks in your job (“m”) – controlled via input splits.
– “mapred.reduce.tasks”: # reduce tasks in your job (“r”)
Thumb Rules for Tuning
– Set # map tasks to read off approximately 1 DFS block worth of data.
– Use multiple “map waves”, to hide shuffle latency.
– Look for a “sweet range” of # of waves (this is empirical).
# Reduce tasks:
– Use a single reducer wave. Second wave adds extra shuffle latency.
– Use multiple reducer waves, iff reducer task can’t scale in memory.
Num “map waves” = Total # of map tasks / Total # of map slots in cluster
Yahoo! Confidential 10
11. Job Level Tuning – io.sort.mb
Buffering to Minimize Disk Writes
Tunables
– io.sort.mb Size of map-side buffer to store and merge map output before spilling to
disk. (Map-side buffer)
– fs.inmemorysize.mb Size of reduce-side buffer for storing & merging multi-map
output before spilling to disk. (Reduce side-buffer)
Thumb Rules for Tuning
– Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM across
all processes (maps, reducers, TT, DN, other)
– Set it small enough to avoid swap activity, but
– Set it large enough to minimize disk spills.
– Ensure that io.sort.factor is set large enough to allow full use of buffer space.
– Balance space for output records (default 95%) & record meta-data (5%)
• Use io.sort.spill.percent and io.sort.record.percent
Yahoo! Confidential 11
12. Job Level Tuning – Compression
Compression: Trades off CPU cycles to reduce disk/network traffic.
Tunables
– mapred.compress.map.output Should intermediate map output be compressed?
– mapred.output.compress Should final (reducer) output be compressed?
Thumb Rules for Tuning
– Turn them on unless CPU is your bottleneck.
– Use BLOCK compression: Set mapred.(map).output.compression.type to BLOCK
– LZO does better than default (Zlib) – mapred.(map).output.compression.codec
– Try Intel® IPP libraries for even better compression speed on Intel platforms.
Turn map output compression ON cluster-wide. Compression invariably improves
performance of apps handling large data on modern multi-core systems.
Yahoo! Confidential 12
13. Tuning multiple parameters
• Multiple tunables for memory, CPU, disk and network.
• Only the prominent ones were covered here.
• Inter-dependent. Can’t tune them independently.
• Meta rules to help multi-tune :
- Avoid swap. Cost of swapping is high.
- Minimize spills. Spilling is not as evil as swapping.
- It generally pays to compress and to over-subscribe cores.
• Several other tunable parameters exist. Look them up in config/
– Core-default.xml, Mapred-default.xml, dfs-default.xml
– Core-site.xml, Mapred-site.xml, dfs-site.xml
Yahoo! Confidential 13
14. Sample tuning gains for a 60-job app pipeline
(“Mini Webmap on 64 node cluster”)
Setting #Maps (m) #Reduces M,R slots io.sort.mb Job exec Improvement
(r) time over Baseline
(sec)
Baseline Two Heaviest Apps: 1215 243 4,4 500 7682 -
All Other Apps: 243
Tuned1 Two Heaviest Apps: 800 243 8,3 1000 7084 7.78%
All Other Apps: 243
Tuned2 Two Heaviest Apps: 800 200 8,3 1000 6496 15.43%
All Other Apps: 200
Tuned3 Two Heaviest Apps: 800 150 8,3 1000 5689 22.42%
All Other Apps: 150
Contribution major moderate moderate minor
to
improvement
Yahoo! Confidential 14
15. Acknowledgements
Many of the observations presented here came as learnings and insights from
• Webmap Performance Engineers @ Y!
– Mahadevan Iyer, Arvind Murthy, Rohit Jalan
• Grid Performance Engineers @ Y!
– Rajesh Balamohan, Harish Mallipeddi, Janardhana Reddy
• Hadoop Dev Engineers @ Y!
– Devaraj Das, Jothi Padmanabhan, Hemanth Yamijala
Questions: sriguru@yahoo-inc.com
Yahoo! Confidential 15