SlideShare a Scribd company logo
1 of 25
Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,
                School of Computing, NUS

                     Presented by Tang Kai
   Introduction
   Factors affecting Performance of MR
   Pruning search space
   Implementation
   Benchmark
   MapReduce-based systems are increasingly
    being used.
    ◦ Simple yet impressive interface
      Map() Reduce()
    ◦ Flexible
      Storage system independence
    ◦ Scalable
    ◦ Fine-grain fault tolerance
   Previous study
    ◦ Fundamental difference
      Schema support
      Data access
      Fault tolerance
    ◦ Benchmark
      Parallel DB >> MR-based
   Is it not possible to have a flexible, scalable
    and efficient MapReduce-based systems?

   Works
    ◦ Identify several performance bottlenecks
    ◦ manage bottlenecks and tune performance
      well-known engineering and database techniques


   Conclusion
    ◦ 2.5x-3.5x
   Introduction
   Factors affecting Performance of MR
   Pruning search space
   Implementation
   Benchmark
   7 steps of a MapReduce job

                                 1)   Map
                                 2)   Parse
                                 3)   Process
                                 4)   Sort
                                 5)   Shuffle
                                 6)   Merge
                                 7)   Reduce
   I/O mode
   Indexing
   Parsing
   Sorting
   Direct I/O
    ◦ read data from the disk directly
    ◦ Local
   Streaming I/O
    ◦ streaming data from the storage system by an
      inter-process communication scheme,
       such as TCP/IP or JDBC.
    ◦ Local and remote

   Direct I/O > Streaming I/O by 10%-15%
   Input of a MapReduce job
    ◦ a set of files stored in a distributed file system, i.e.
      HDFS                 Boost selection task 2x-10x
      Ranged-indexes     depending on the selectivity

    ◦ input HDFS files are not sorted but each data chunk
      in the files are indexed by keys
      Block-level indexes
    ◦ tables stored in database servers
      Database indexed tables
   Raw data -> <k,v> pair

   Immutable decoding
    ◦ Read-only records (set once)
   Mutable decoding

   Mutable decoder is 10x faster.
    ◦ boost selection task 2x overall
   Map-side sorting affects performance of
    aggregation
    ◦ Cost of key comparison is non-trivial.
   Example
    ◦ SourceIP in UserVisits Table
    ◦ Sort intermediate records.
    ◦ sourceIP variable-length string
      String compare (byte-to-byte)
      Fingerprint compare (integer)
   Fingerprint-based is 4x-5x faster.
    ◦ 20%-25% overall
   Why
    ◦ 4 factors
      Resulting in large search space (2*2*3*2)
    ◦ Budget limit on Amazon EC2
   Greedy
   Greedy Stategy                3 datasets

        Direct I/O
                     I/O mode
        Stream I/O


                                Different sort schemes    Bench
                                In various architecture   mark
Hadoop Writable
    Google’s
                      Parser
 ProtocolBuffer
  Berkeley DB

                                      4 queries
   Introduction
   Factors affecting Performance of MR
   Pruning search space
   Implementation
   Benchmark
   Hadoop 0.19.2 as code base
   Direct I/O
    ◦ Modification of data node implementation
   Text decoder
    ◦ Immutable same as Dewitt
    ◦ Mutable by ourselves
   Binary decoder
    ◦ Hadoop
      Immutable Writable decoder
      Mutable using hadoop API by ourselves
    ◦ Google Protocol buffer
      Build-in compiler->mutable
      Immutable by ourselves
    ◦ Berkeley DB
      BDB binding API (mutable)
   Amazon EC2 (Elastic computing cloud)
    ◦ 7.5GB memory
    ◦ 2 virtual cores
    ◦ 64-bits Fedora 8
   Tuning EC2 disk I/O by shifting peak time.
   Hadoop Setting
    ◦ Block size of HDFS: 512MB
    ◦ Heap size of JVM: 1024MB
   Introduction
   Factors affecting Performance of MR
   Pruning search space
   Implementation
   Benchmark
   Results for different I/O mode
    ◦ Single node
    ◦ No-op job w/ map w/o reduce
   Results for record parsing
    ◦ Run in Java process instead of MapReduce job
    ◦ Time start after loading into memory
   Mutable > Immutable
    ◦ Mutable text> mutable binary
   In between hadoop-based system
    ◦ Cache factor
   In between hadoop-based and Parallel DB
    ◦ Close
   Selection task -> scan -> Index
   Caching
   Indexing
UserVisits GROUP BY SUBSTR(so




   Parsing: 2x faster
   Sorting: 20%-25% faster
    ◦ Not significant in small size aggregation task
   On decoding scheme
   Comparison of tuned MR-based & Parallel DB
   Cons
    ◦ Need to be committed/forked to Hadoop source
      code tree
    ◦ A complete framework is needed instead of
      miscellaneous patches.
    ◦ Various API support: CLI, Web rather than Java.
   Future work
    ◦ Provide query parser, optimizer etc to build a
      complete solution
    ◦ Elastic power-aware data intensive Cloud
      http://www.comp.nus.edu.sg/~epic/download/MapRe
       duceBenchmark.tar.gz

      Tenzing: A SQL Implemetation On The MapReduce Framework

More Related Content

What's hot

A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
Dan Han
 
Google File System
Google File SystemGoogle File System
Google File System
nadikari123
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 

What's hot (20)

Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Google File System
Google File SystemGoogle File System
Google File System
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Google File System
Google File SystemGoogle File System
Google File System
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
HDFS Federation++
HDFS Federation++HDFS Federation++
HDFS Federation++
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Google file system
Google file systemGoogle file system
Google file system
 
Google file system
Google file systemGoogle file system
Google file system
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 

Viewers also liked

TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
TCP-FIT: An Improved TCP Congestion Control Algorithm and its PerformanceTCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
Kevin Tong
 
Transport methods in 3DTV--A Survey
Transport methods in 3DTV--A SurveyTransport methods in 3DTV--A Survey
Transport methods in 3DTV--A Survey
Kevin Tong
 
Simple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud StorageSimple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud Storage
Kevin Tong
 

Viewers also liked (15)

全球最佳外派目的地 新加坡居冠台灣第8
全球最佳外派目的地 新加坡居冠台灣第8全球最佳外派目的地 新加坡居冠台灣第8
全球最佳外派目的地 新加坡居冠台灣第8
 
走入現代生活的台灣諺語
走入現代生活的台灣諺語走入現代生活的台灣諺語
走入現代生活的台灣諺語
 
漢語間統計式機器翻譯語料處理-用臺灣閩南語示範
漢語間統計式機器翻譯語料處理-用臺灣閩南語示範漢語間統計式機器翻譯語料處理-用臺灣閩南語示範
漢語間統計式機器翻譯語料處理-用臺灣閩南語示範
 
臺灣閩南語推薦用字第二批
臺灣閩南語推薦用字第二批臺灣閩南語推薦用字第二批
臺灣閩南語推薦用字第二批
 
Transport methods in 3DTV--A Survey
Transport methods in 3DTV--A SurveyTransport methods in 3DTV--A Survey
Transport methods in 3DTV--A Survey
 
臺灣閩南語羅馬字拼音方案使用手冊
臺灣閩南語羅馬字拼音方案使用手冊臺灣閩南語羅馬字拼音方案使用手冊
臺灣閩南語羅馬字拼音方案使用手冊
 
臺灣閩南語推薦用字700字表
臺灣閩南語推薦用字700字表臺灣閩南語推薦用字700字表
臺灣閩南語推薦用字700字表
 
花宅聚落數位典藏執行簡報20081124
花宅聚落數位典藏執行簡報20081124花宅聚落數位典藏執行簡報20081124
花宅聚落數位典藏執行簡報20081124
 
Analysis of Adaptive Streaming for Hybrid CDN/P2P Live Video Systems
Analysis of Adaptive Streaming for Hybrid CDN/P2P Live Video SystemsAnalysis of Adaptive Streaming for Hybrid CDN/P2P Live Video Systems
Analysis of Adaptive Streaming for Hybrid CDN/P2P Live Video Systems
 
談莫札特的歌劇《女人皆如此》
談莫札特的歌劇《女人皆如此》談莫札特的歌劇《女人皆如此》
談莫札特的歌劇《女人皆如此》
 
閩南俚語
閩南俚語閩南俚語
閩南俚語
 
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
TCP-FIT: An Improved TCP Congestion Control Algorithm and its PerformanceTCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
 
Transport methods in 3DTV--A Survey
Transport methods in 3DTV--A SurveyTransport methods in 3DTV--A Survey
Transport methods in 3DTV--A Survey
 
Simple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud StorageSimple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud Storage
 
女人皆如此計劃書
女人皆如此計劃書女人皆如此計劃書
女人皆如此計劃書
 

Similar to The Performance of MapReduce: An In-depth Study

Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 

Similar to The Performance of MapReduce: An In-depth Study (20)

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
226 team project-report-manjula kollipara
226 team project-report-manjula kollipara226 team project-report-manjula kollipara
226 team project-report-manjula kollipara
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
mongodb tutorial
mongodb tutorialmongodb tutorial
mongodb tutorial
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

The Performance of MapReduce: An In-depth Study

  • 1. Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School of Computing, NUS Presented by Tang Kai
  • 2. Introduction  Factors affecting Performance of MR  Pruning search space  Implementation  Benchmark
  • 3. MapReduce-based systems are increasingly being used. ◦ Simple yet impressive interface  Map() Reduce() ◦ Flexible  Storage system independence ◦ Scalable ◦ Fine-grain fault tolerance
  • 4. Previous study ◦ Fundamental difference  Schema support  Data access  Fault tolerance ◦ Benchmark  Parallel DB >> MR-based
  • 5. Is it not possible to have a flexible, scalable and efficient MapReduce-based systems?  Works ◦ Identify several performance bottlenecks ◦ manage bottlenecks and tune performance  well-known engineering and database techniques  Conclusion ◦ 2.5x-3.5x
  • 6. Introduction  Factors affecting Performance of MR  Pruning search space  Implementation  Benchmark
  • 7. 7 steps of a MapReduce job 1) Map 2) Parse 3) Process 4) Sort 5) Shuffle 6) Merge 7) Reduce
  • 8. I/O mode  Indexing  Parsing  Sorting
  • 9. Direct I/O ◦ read data from the disk directly ◦ Local  Streaming I/O ◦ streaming data from the storage system by an inter-process communication scheme,  such as TCP/IP or JDBC. ◦ Local and remote  Direct I/O > Streaming I/O by 10%-15%
  • 10. Input of a MapReduce job ◦ a set of files stored in a distributed file system, i.e. HDFS Boost selection task 2x-10x  Ranged-indexes depending on the selectivity ◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys  Block-level indexes ◦ tables stored in database servers  Database indexed tables
  • 11. Raw data -> <k,v> pair  Immutable decoding ◦ Read-only records (set once)  Mutable decoding  Mutable decoder is 10x faster. ◦ boost selection task 2x overall
  • 12. Map-side sorting affects performance of aggregation ◦ Cost of key comparison is non-trivial.  Example ◦ SourceIP in UserVisits Table ◦ Sort intermediate records. ◦ sourceIP variable-length string  String compare (byte-to-byte)  Fingerprint compare (integer)  Fingerprint-based is 4x-5x faster. ◦ 20%-25% overall
  • 13. Why ◦ 4 factors  Resulting in large search space (2*2*3*2) ◦ Budget limit on Amazon EC2  Greedy
  • 14. Greedy Stategy 3 datasets Direct I/O I/O mode Stream I/O Different sort schemes Bench In various architecture mark Hadoop Writable Google’s Parser ProtocolBuffer Berkeley DB 4 queries
  • 15. Introduction  Factors affecting Performance of MR  Pruning search space  Implementation  Benchmark
  • 16. Hadoop 0.19.2 as code base  Direct I/O ◦ Modification of data node implementation  Text decoder ◦ Immutable same as Dewitt ◦ Mutable by ourselves  Binary decoder ◦ Hadoop  Immutable Writable decoder  Mutable using hadoop API by ourselves ◦ Google Protocol buffer  Build-in compiler->mutable  Immutable by ourselves ◦ Berkeley DB  BDB binding API (mutable)
  • 17. Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦ 2 virtual cores ◦ 64-bits Fedora 8  Tuning EC2 disk I/O by shifting peak time.  Hadoop Setting ◦ Block size of HDFS: 512MB ◦ Heap size of JVM: 1024MB
  • 18. Introduction  Factors affecting Performance of MR  Pruning search space  Implementation  Benchmark
  • 19. Results for different I/O mode ◦ Single node ◦ No-op job w/ map w/o reduce
  • 20. Results for record parsing ◦ Run in Java process instead of MapReduce job ◦ Time start after loading into memory  Mutable > Immutable ◦ Mutable text> mutable binary
  • 21. In between hadoop-based system ◦ Cache factor  In between hadoop-based and Parallel DB ◦ Close
  • 22. Selection task -> scan -> Index  Caching  Indexing
  • 23. UserVisits GROUP BY SUBSTR(so  Parsing: 2x faster  Sorting: 20%-25% faster ◦ Not significant in small size aggregation task
  • 24. On decoding scheme  Comparison of tuned MR-based & Parallel DB
  • 25. Cons ◦ Need to be committed/forked to Hadoop source code tree ◦ A complete framework is needed instead of miscellaneous patches. ◦ Various API support: CLI, Web rather than Java.  Future work ◦ Provide query parser, optimizer etc to build a complete solution ◦ Elastic power-aware data intensive Cloud  http://www.comp.nus.edu.sg/~epic/download/MapRe duceBenchmark.tar.gz Tenzing: A SQL Implemetation On The MapReduce Framework