More Related Content Similar to Hadoop Summit 2012 | Optimizing MapReduce Job Performance (20) More from Cloudera, Inc. (20) Hadoop Summit 2012 | Optimizing MapReduce Job Performance2. Introductions
• Software Engineer at Cloudera since 2009
• Committer and PMC member on HDFS,
MapReduce, and HBase
• Spend lots of time looking at full stack
performance
• This talk is to help you develop faster jobs
– If you want to hear about how we made Hadoop
faster, see my Hadoop World 2011 talk on
cloudera.com
2
©2011 Cloudera, Inc. All Rights Reserved.
3. Aspects of Performance
• Algorithmic performance
– big-O, join strategies, data
structures, asymptotes
• Physical performance
– Hardware (disks, CPUs, etc)
• Implementation performance
– Efficiency of code, avoiding extra work
– Make good use of available physical perf
3
©2011 Cloudera, Inc. All Rights Reserved.
4. Performance fundamentals
• You can’t tune what you don’t
understand
– MR’s strength as a framework is its black-box
nature
– To get optimal performance, you have to
understand the internals
• This presentation: understanding the
black box
4
©2011 Cloudera, Inc. All Rights Reserved.
5. Performance fundamentals (2)
• You can’t improve what you can’t
measure
– Ganglia/Cacti/Cloudera Manager/etc a must
– Top 4 metrics: CPU, Memory, Disk, Network
– MR job metrics: slot-seconds, CPU-seconds,
task wall-clocks, and I/O
• Before you start: run jobs, gather data
5
©2011 Cloudera, Inc. All Rights Reserved.
6. Graphing bottlenecks
This job might
be CPU-bound
in map phase
Most jobs not
CPU-bound
Plenty of free
RAM, perhaps
can make better
use of it?
Fairly flat-topped
network –
bottleneck?
6
©2011 Cloudera, Inc. All Rights Reserved.
7. Performance tuning cycle
Identify Address
Run job
bottleneck bottleneck
-Graphs - Tune configs
-Job counters - Improve code
-Job logs - Rethink algos
-Profiler results
In order to understand these metrics and make
changes, you need to understand MR internals.
7
©2011 Cloudera, Inc. All Rights Reserved.
8. MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task
8
©2011 Cloudera, Inc. All Rights Reserved.
9. MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task
9
©2011 Cloudera, Inc. All Rights Reserved.
10. Map-side sort/spill overview
• Goal: when complete, map task outputs one sorted file
• What happens when you call
OutputCollector.collect()?
Map
Task 2. Output Buffer fills up.
Contents sorted, partitioned
.collect(K,V) and spilled to disk
MapOutputBuffer IFile
1. In-memory buffer
holds serialized, Map-side
IFile IFile
unsorted key-values Merge
3. Map task finishes. All
IFile IFiles merged to single
IFile per task
10
©2011 Cloudera, Inc. All Rights Reserved.
11. Zooming further: MapOutputBuffer
(Hadoop 1.0)
12 bytes/rec
kvoffsets
(Partition, KOff, VOff)
per record
io.sort.record.percent
* io.sort.mb
kvindices
4 bytes/rec
1 indirect-sort index
io.sort.mb per record
kvbuffer R bytes/rec
Raw, serialized (1-io.sort.record.percent)
(Key, Val) pairs * io.sort.mb
11
©2011 Cloudera, Inc. All Rights Reserved.
12. MapOutputBuffer spill behavior
• Memory is limited: must spill
– If either of the kvbuffer or the metadata
buffers fill up, “spill” to disk
– In fact, we spill before it’s full (in another
thread): configure io.sort.spill.percent
• Performance impact
– If we spill more than one time, we must re-
read and re-write all data: 3x the IO!
– #1 goal for map task optimization: spill once!
12
©2011 Cloudera, Inc. All Rights Reserved.
13. Spill counters on map tasks
• ratio of Spilled Records vs Map Output
Records
– if unequal, then you are doing more than one
spill
• FILE: Number of bytes read/written
– get a sense of I/O amplification due to spilling
13
©2011 Cloudera, Inc. All Rights Reserved.
14. Spill logs on map tasks
indicates that the metadata buffers
2012-06-04 11:52:21,445 INFO before the data buffer map output:
filled up MapTask: Spilling
record full = true
2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
= 60030900; bufvoid = 228117712
2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
600309; length = 750387
2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
record full = true
2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
bufend = 120061700; bufvoid = 228117712
2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
kvend = 450230; length = 750387
2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
map output
2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
3 11:52:29,105 we can do
better?
14
©2011 Cloudera, Inc. All Rights Reserved.
15. Tuning to reduce spills
• Parameters:
– io.sort.mb: total buffer space
– io.sort.record.percent: proportion between
metadata buffers and key/value data
– io.sort.spill.percent: threshold at which
spill is triggered
– Total map output generated: can you use
more compact serialization?
• Optimal settings depend on your data and
available RAM!
15
©2011 Cloudera, Inc. All Rights Reserved.
16. Setting io.sort.record.percent
• Common mistake: metadata buffers fill up
way before kvdata buffer
• Optimal setting:
– io.sort.record.percent = 16/(16 + R)
– R = average record size: divide “Map Output
Bytes” counter by “Map Output Records” counter
• Default (0.05) is usually too low (optimal for
~300byte records)
• Hadoop 2.0: this is no longer necessary!
– see MAPREDUCE-64 for gory details
16
©2011 Cloudera, Inc. All Rights Reserved.
17. Tuning Example (terasort)
• Map input size = output size
– 128MB block = 1,342,177 records, each 100
bytes
– metadata: 16 * 1342177 = 20.9MB
• io.sort.mb
– 128MB data + 20.9MB meta = 148.9MB
• io.sort.record.percent
– 16/(16+100)=0.138
• io.sort.spill.percent = 1.0
17
©2011 Cloudera, Inc. All Rights Reserved.
18. More tips on spill tuning
• Biggest win is going from 2 spills to 1 spill
– 3 spills is approximately the same speed as 2 spills
(same IO amplificatoin)
• Calculate if it’s even possible, given your heap
size
– io.sort.mb has to fit within your Java heap (plus
whatever RAM your Mapper needs, plus ~30% for
overhead)
• Only bother if this is the bottleneck!
– Look at map task logs: if the merge step at the end is
taking a fraction of a second, not worth it!
– Typically most impact on jobs with big shuffle
(sort/dedup)
18
©2011 Cloudera, Inc. All Rights Reserved.
19. MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task
19
©2011 Cloudera, Inc. All Rights Reserved.
20. Reducer fetch tuning
• Reducers fetch map output via HTTP
• Tuning parameters:
– Server side: tasktracker.http.threads
– Client side:
mapred.reduce.parallel.copies
• Turns out this is not so interesting
– follow the best practices from Hadoop:
Definitive Guide
20
©2011 Cloudera, Inc. All Rights Reserved.
21. Improving fetch bottlenecks
• Reduce intermediate data
– Implement a Combiner: less data transfers faster
– Enable intermediate compression: Snappy is
easy to enable; trades off some CPU for less
IO/network
• Double-check for network issues
– Frame errors, NICs auto-negotiated to 100mbit,
etc: one or two slow hosts can bottleneck a job
– Tell-tale sign: all maps are done, and reducers sit
in fetch stage for many minutes (look at logs)
21
©2011 Cloudera, Inc. All Rights Reserved.
22. MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task
22
©2011 Cloudera, Inc. All Rights Reserved.
23. Reducer merge (Hadoop 1.0)
Yes:
RAMManager
fetch to RAM-to-disk
RAM merges
Remote Map Fits in
Outputs RAM?
(via HTTP) 1. Data accumulated
in RAM is merged to
No: fetch disk files
to disk
Local Disk
IFile
2. If too many disk
disk-to-disk Merged
files accumulate, IFile
merges iterator
they are re-merged
IFile Reduce
Task
3. Segments from
RAM and disk are
23 merged into the
reducer code
©2011 Cloudera, Inc. All Rights Reserved.
24. Reducer merge triggers
• RAMManager
– Total buffer size:
mapred.job.shuffle.input.buffer.percent
(default 0.70, percentage of reducer heapsize)
• Mem-to-disk merge triggers:
– RAMManager is
mapred.job.shuffle.merge.percent % full
(default 0.66)
– Or mapred.inmem.merge.threshold segments
accumulated (default 1000)
• Disk-to-disk merge
– io.sort.factor on-disk segments pile up (fairly rare)
24
©2011 Cloudera, Inc. All Rights Reserved.
25. Final merge phase
• MR assumes that reducer code needs the
full heap worth of RAM
– Spills all in-RAM segments before running
user code to free memory
• This isn’t true if your reducer is simple
– eg sort, simple aggregation, etc with no state
• Configure
mapred.job.reduce.input.buffer.percent to
0.70 to keep reducer input data in RAM
25
©2011 Cloudera, Inc. All Rights Reserved.
26. Reducer merge counters
• FILE: number of bytes read/written
– Ideally close to 0 if you can fit in RAM
• Spilled records:
– Ideally close to 0. If significantly more than
reduce input records, job is hitting a multi-
pass merge which is quite expensive
26
©2011 Cloudera, Inc. All Rights Reserved.
27. Tuning reducer merge
• Configure
mapred.job.reduce.input.buffer.percent
to 0.70 to keep data in RAM if you don’t
have any state in reducer
• Experiment with setting
mapred.inmem.merge.threshold to 0 to
avoid spills
• Hadoop 2.0: experiment with
mapreduce.reduce.merge.memtomem.enabled
27
©2011 Cloudera, Inc. All Rights Reserved.
28. Rules of thumb for # maps/reduces
• Aim for map tasks running 1-3 minutes each
– Too small: wasted startup overhead, less efficient
shuffle
– Too big: not enough parallelism, harder to share
cluster
• Reduce task count:
– Large reduce phase: base on cluster slot count (a
few GB per reducer)
– Small reduce phase: fewer reducers will result in
more efficient shuffle phase
28
©2011 Cloudera, Inc. All Rights Reserved.
29. MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task
29
©2011 Cloudera, Inc. All Rights Reserved.
30. Tuning Java code for MR
• Follow general Java best practices
– String parsing and formatting is slow
– Guard debug statements with isDebugEnabled()
– StringBuffer.append vs repeated string concatenation
• For CPU-intensive jobs, make a test
harness/benchmark outside MR
– Then use your favorite profiler
• Check for GC overhead: -XX:+PrintGCDetails –
verbose:gc
• Easiest profiler: add –Xprof to
mapred.child.java.opts – then look at
stdout task log
30
©2011 Cloudera, Inc. All Rights Reserved.
31. Other tips for fast MR code
• Use the most compact and efficient data
formats
– LongWritable is way faster than parsing text
– BytesWritable instead of Text for SHA1
hashes/dedup
– Avro/Thrift/Protobuf for complex data, not JSON!
• Write a Combiner and RawComparator
• Enable intermediate compression
(Snappy/LZO)
31
©2011 Cloudera, Inc. All Rights Reserved.
32. Summary
• Understanding MR internals helps understand
configurations and tuning
• Focus your tuning effort on things that are
bottlenecks, following a scientific approach
• Don’t forget that you can always just add nodes!
– Spending 1 month of engineer time to make your job
20% faster is not worth it if you have a 10 node
cluster!
• We’re working on simplifying this where we can,
but deep understanding will always allow more
efficient jobs
32
©2011 Cloudera, Inc. All Rights Reserved.