SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Weekly meeting
         Week 5


    Le Duc Tung

 PRL group, NII, Tokyo


    Nov 21, 2012




 Le Duc Tung   Weekly meeting
Last week




     Described Quicksort using MapReduce, however,
         Required many MapReduce tasks: (log n) to n steps.




                           Le Duc Tung   Weekly meeting
Terasort using Hadoop



     Written by Owen O’Malley and Arun Murthy at Yahoo Inc.
     Won the annual general purpose terabyte sort benchmark in 2008
     and 2009.
         In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB
         memory). Sorting 1TB data in 3.48 minutes.
         In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA).
         Sorting 100TB data in 173 minutes.
     Terasort includes 3 MapReduce applications:
         Teragen: generates the data.
         Terasort: samples the input data and uses them with MapReduce to
         sort the data.
         Teravalidate: validates the output is sorted.




                           Le Duc Tung   Weekly meeting
A closer look at MapReduce’s implementation model




  ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html”
                           Le Duc Tung   Weekly meeting
Teragen




     Input: The number of rows and the output directory.
     Output format:




                           Le Duc Tung   Weekly meeting
MapReduce for Teragen


     Preparation of data for Map task.
                                    InputSplit
         The number of rows            →         splits of (startid, num of rows).
                                      RecordReader
         (startid, num of rows)            →         key-value pairs of (rowid, null)
     Example:
         We need generate data of 100 rows
         We use 2 mappers
         We generate 2 splits for 2 mappers.
                Split 0: consists of two values: startid = 0, number of rows = 50.
                Split 1: consists of two values: startid = 50, number of rows = 50.
         Key-value pairs generated from split 0:
                (0, null), (1, null), ..., (49, null)
         Key-value pairs generated from split 1:
                (50, null), (51, null), ..., (99, null)




                                  Le Duc Tung      Weekly meeting
Map and Reduce task


     Map
           Input is a pair of (rowid, null)
           Output is a pair of (key, value), in which
                Step 1: Create a random generator based on ”rowid”.
                Step 2: Create a key that is a text of 10 random bytes.
                Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters +
                1 block of 8 characters”
                Characters are got from [’A’ .. ’Z’]
     Reduce task
           Does nothing.




                               Le Duc Tung   Weekly meeting
Terasort



     Step 1: (Before submitting job) Generate the sample keys by
     sampling the input. It is a sorted list of sampled keys.
     Step 2: (Using Partitioner) Distribute keys to the correspondent
     reducers using sample keys.
     Example: If we have N reducers.
           Generate (N-1) sample keys.
           All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to
           reduce i
           The above guarantees that the output of reduce i are all less than
           the output of reduce (i + 1)




                              Le Duc Tung   Weekly meeting
Generate sample keys

     From the input, create an array R containing all keys or the
     maximum of 100.000 keys (1MB). However, we can specify how
     many elements the array has.
     Using Quicksort to sort keys in the array.
     Create an array of (N-1) sample keys from the array R, in which N is
     the number of reducers, such that,
         (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the
         original array.
         stepSize is of (No. of elements of R)/(No. of reducers)
     List of sampled keys is written into HDFS.




                             Le Duc Tung   Weekly meeting
Partitioner

  public interface Partitioner<K2, V2> extends JobConfigurable {
    /**
     * Get the partition number for a given key (hence record)
     * given the total number of partitions
     * i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of
     * the key.</p>
     *
     * @param key the key to be partitioned.
     * @param value the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    int getPartition(K2 key, V2 value, int numPartitions);
  }


                        Le Duc Tung   Weekly meeting
Partitioner and Trie tree




    getPartition() function will
    return the position of k in the
    trie tree.                              ”source:
                                            http://en.wikipedia.org/wiki/Trie”


                              Le Duc Tung   Weekly meeting
Trie for sample keys
      Assume that we have three sample keys:
           :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120
           respectively.
           LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115
           respectively.
           le5awB.$sm, in which, ASCII code of l and e is 108 and 101
           respectively.
      Using the 2 first bytes of keys for building a 2-level trie tree.




                              Le Duc Tung   Weekly meeting
Default sort




    By default, data
    will be sorted
    before passing to
    reducers.
    Reduce task
    does nothing, it
    just outputs the
    (K, V) pairs to
    output files.




                        Le Duc Tung   Weekly meeting

Mais conteúdo relacionado

Mais procurados

기가박스 영화관 운영 시스템 구축마지막
기가박스 영화관 운영 시스템 구축마지막기가박스 영화관 운영 시스템 구축마지막
기가박스 영화관 운영 시스템 구축마지막ssuser5280ce
 
Introduction to Git / Github
Introduction to Git / GithubIntroduction to Git / Github
Introduction to Git / GithubPaige Bailey
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Getting Git Right
Getting Git RightGetting Git Right
Getting Git RightSven Peters
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
[Pgday.Seoul 2018]  이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG[Pgday.Seoul 2018]  이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PGPgDay.Seoul
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHubJames Gray
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 

Mais procurados (20)

기가박스 영화관 운영 시스템 구축마지막
기가박스 영화관 운영 시스템 구축마지막기가박스 영화관 운영 시스템 구축마지막
기가박스 영화관 운영 시스템 구축마지막
 
Introduction to Git / Github
Introduction to Git / GithubIntroduction to Git / Github
Introduction to Git / Github
 
Introduction to Git and Github
Introduction to Git and GithubIntroduction to Git and Github
Introduction to Git and Github
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Git 101
Git 101Git 101
Git 101
 
Git & Github for beginners
Git & Github for beginnersGit & Github for beginners
Git & Github for beginners
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Git Rebase vs Merge
Git Rebase vs MergeGit Rebase vs Merge
Git Rebase vs Merge
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Git & GitHub for Beginners
Git & GitHub for BeginnersGit & GitHub for Beginners
Git & GitHub for Beginners
 
Introduction to git
Introduction to gitIntroduction to git
Introduction to git
 
Getting Git Right
Getting Git RightGetting Git Right
Getting Git Right
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Git and Github Session
Git and Github SessionGit and Github Session
Git and Github Session
 
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
[Pgday.Seoul 2018]  이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG[Pgday.Seoul 2018]  이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHub
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Docker by Example - Quiz
Docker by Example - QuizDocker by Example - Quiz
Docker by Example - Quiz
 

Semelhante a TeraSort

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structureShakil Ahmed
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0BG Java EE Course
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxjeremylockett77
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.pptaliraza2732
 
Ida python intro
Ida python introIda python intro
Ida python intro小静 安
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxdudelover
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerKenta Sato
 

Semelhante a TeraSort (20)

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structure
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Using Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUGUsing Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUG
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0
 
Clojure intro
Clojure introClojure intro
Clojure intro
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Python Day1
Python Day1Python Day1
Python Day1
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

TeraSort

  • 1. Weekly meeting Week 5 Le Duc Tung PRL group, NII, Tokyo Nov 21, 2012 Le Duc Tung Weekly meeting
  • 2. Last week Described Quicksort using MapReduce, however, Required many MapReduce tasks: (log n) to n steps. Le Duc Tung Weekly meeting
  • 3. Terasort using Hadoop Written by Owen O’Malley and Arun Murthy at Yahoo Inc. Won the annual general purpose terabyte sort benchmark in 2008 and 2009. In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory). Sorting 1TB data in 3.48 minutes. In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). Sorting 100TB data in 173 minutes. Terasort includes 3 MapReduce applications: Teragen: generates the data. Terasort: samples the input data and uses them with MapReduce to sort the data. Teravalidate: validates the output is sorted. Le Duc Tung Weekly meeting
  • 4. A closer look at MapReduce’s implementation model ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html” Le Duc Tung Weekly meeting
  • 5. Teragen Input: The number of rows and the output directory. Output format: Le Duc Tung Weekly meeting
  • 6. MapReduce for Teragen Preparation of data for Map task. InputSplit The number of rows → splits of (startid, num of rows). RecordReader (startid, num of rows) → key-value pairs of (rowid, null) Example: We need generate data of 100 rows We use 2 mappers We generate 2 splits for 2 mappers. Split 0: consists of two values: startid = 0, number of rows = 50. Split 1: consists of two values: startid = 50, number of rows = 50. Key-value pairs generated from split 0: (0, null), (1, null), ..., (49, null) Key-value pairs generated from split 1: (50, null), (51, null), ..., (99, null) Le Duc Tung Weekly meeting
  • 7. Map and Reduce task Map Input is a pair of (rowid, null) Output is a pair of (key, value), in which Step 1: Create a random generator based on ”rowid”. Step 2: Create a key that is a text of 10 random bytes. Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters + 1 block of 8 characters” Characters are got from [’A’ .. ’Z’] Reduce task Does nothing. Le Duc Tung Weekly meeting
  • 8. Terasort Step 1: (Before submitting job) Generate the sample keys by sampling the input. It is a sorted list of sampled keys. Step 2: (Using Partitioner) Distribute keys to the correspondent reducers using sample keys. Example: If we have N reducers. Generate (N-1) sample keys. All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to reduce i The above guarantees that the output of reduce i are all less than the output of reduce (i + 1) Le Duc Tung Weekly meeting
  • 9. Generate sample keys From the input, create an array R containing all keys or the maximum of 100.000 keys (1MB). However, we can specify how many elements the array has. Using Quicksort to sort keys in the array. Create an array of (N-1) sample keys from the array R, in which N is the number of reducers, such that, (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the original array. stepSize is of (No. of elements of R)/(No. of reducers) List of sampled keys is written into HDFS. Le Duc Tung Weekly meeting
  • 10. Partitioner public interface Partitioner<K2, V2> extends JobConfigurable { /** * Get the partition number for a given key (hence record) * given the total number of partitions * i.e. number of reduce-tasks for the job. * * <p>Typically a hash function on a all or a subset of * the key.</p> * * @param key the key to be partitioned. * @param value the entry value. * @param numPartitions the total number of partitions. * @return the partition number for the <code>key</code>. */ int getPartition(K2 key, V2 value, int numPartitions); } Le Duc Tung Weekly meeting
  • 11. Partitioner and Trie tree getPartition() function will return the position of k in the trie tree. ”source: http://en.wikipedia.org/wiki/Trie” Le Duc Tung Weekly meeting
  • 12. Trie for sample keys Assume that we have three sample keys: :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120 respectively. LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115 respectively. le5awB.$sm, in which, ASCII code of l and e is 108 and 101 respectively. Using the 2 first bytes of keys for building a 2-level trie tree. Le Duc Tung Weekly meeting
  • 13. Default sort By default, data will be sorted before passing to reducers. Reduce task does nothing, it just outputs the (K, V) pairs to output files. Le Duc Tung Weekly meeting