The Performance of MapReduce: An In-depth Study

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,
School of Computing, NUS

Presented by Tang Kai

 Introduction
 Factors affecting Performance of MR
 Pruning search space
 Implementation
 Benchmark

 MapReduce-based systems are increasingly
being used.
◦ Simple yet impressive interface
 Map() Reduce()
◦ Flexible
 Storage system independence
◦ Scalable
◦ Fine-grain fault tolerance

 Previous study
◦ Fundamental difference
 Schema support
 Data access
 Fault tolerance
◦ Benchmark
 Parallel DB >> MR-based

 Is it not possible to have a flexible, scalable
and efficient MapReduce-based systems?

 Works
◦ Identify several performance bottlenecks
◦ manage bottlenecks and tune performance
 well-known engineering and database techniques

 Conclusion
◦ 2.5x-3.5x

 7 steps of a MapReduce job

1) Map
2) Parse
3) Process
4) Sort
5) Shuffle
6) Merge
7) Reduce

 I/O mode
 Indexing
 Parsing
 Sorting

 Direct I/O
◦ read data from the disk directly
◦ Local
 Streaming I/O
◦ streaming data from the storage system by an
inter-process communication scheme,
 such as TCP/IP or JDBC.
◦ Local and remote

 Direct I/O > Streaming I/O by 10%-15%

 Input of a MapReduce job
◦ a set of files stored in a distributed file system, i.e.
HDFS Boost selection task 2x-10x
 Ranged-indexes depending on the selectivity

◦ input HDFS files are not sorted but each data chunk
in the files are indexed by keys
 Block-level indexes
◦ tables stored in database servers
 Database indexed tables

 Raw data -> <k,v> pair

 Immutable decoding
◦ Read-only records (set once)
 Mutable decoding

 Mutable decoder is 10x faster.
◦ boost selection task 2x overall

 Map-side sorting affects performance of
aggregation
◦ Cost of key comparison is non-trivial.
 Example
◦ SourceIP in UserVisits Table
◦ Sort intermediate records.
◦ sourceIP variable-length string
 String compare (byte-to-byte)
 Fingerprint compare (integer)
 Fingerprint-based is 4x-5x faster.
◦ 20%-25% overall

 Why
◦ 4 factors
 Resulting in large search space (2*2*3*2)
◦ Budget limit on Amazon EC2
 Greedy

 Greedy Stategy 3 datasets

Direct I/O
I/O mode
Stream I/O

Different sort schemes Bench
In various architecture mark
Hadoop Writable
Google’s
Parser
ProtocolBuffer
Berkeley DB

4 queries

 Hadoop 0.19.2 as code base
 Direct I/O
◦ Modification of data node implementation
 Text decoder
◦ Immutable same as Dewitt
◦ Mutable by ourselves
 Binary decoder
◦ Hadoop
 Immutable Writable decoder
 Mutable using hadoop API by ourselves
◦ Google Protocol buffer
 Build-in compiler->mutable
 Immutable by ourselves
◦ Berkeley DB
 BDB binding API (mutable)

 Amazon EC2 (Elastic computing cloud)
◦ 7.5GB memory
◦ 2 virtual cores
◦ 64-bits Fedora 8
 Tuning EC2 disk I/O by shifting peak time.
 Hadoop Setting
◦ Block size of HDFS: 512MB
◦ Heap size of JVM: 1024MB

 Results for different I/O mode
◦ Single node
◦ No-op job w/ map w/o reduce

 Results for record parsing
◦ Run in Java process instead of MapReduce job
◦ Time start after loading into memory
 Mutable > Immutable
◦ Mutable text> mutable binary

 In between hadoop-based system
◦ Cache factor
 In between hadoop-based and Parallel DB
◦ Close

 Selection task -> scan -> Index
 Caching
 Indexing

UserVisits GROUP BY SUBSTR(so

 Parsing: 2x faster
 Sorting: 20%-25% faster
◦ Not significant in small size aggregation task

 On decoding scheme
 Comparison of tuned MR-based & Parallel DB

 Cons
◦ Need to be committed/forked to Hadoop source
code tree
◦ A complete framework is needed instead of
miscellaneous patches.
◦ Various API support: CLI, Web rather than Java.
 Future work
◦ Provide query parser, optimizer etc to build a
complete solution
◦ Elastic power-aware data intensive Cloud
 http://www.comp.nus.edu.sg/~epic/download/MapRe
duceBenchmark.tar.gz

Tenzing: A SQL Implemetation On The MapReduce Framework

The Performance of MapReduce: An In-depth Study

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to The Performance of MapReduce: An In-depth Study

Similar to The Performance of MapReduce: An In-depth Study (20)

Recently uploaded

Recently uploaded (20)

The Performance of MapReduce: An In-depth Study