SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
How to Implement MapReduce Using




                              Presented By
                               Jamie Pitts

Thursday, December 9, 2010
A Problem Seeking A Solution

                  Given a corpus of html-stripped financial filings:
                        Identify and count unique subjects.


                                    Possible Solutions:
                    1. Use the unix toolbox.
                    2. Create a set of unwieldy perl scripts.
                    3. Solve it with MapReduce, perl, and Gearman!




Thursday, December 9, 2010
What is MapReduce?

                             A programming model for processing
                                and generating large datasets.

                                   The original 2004 paper by Dean and Ghemawat




Thursday, December 9, 2010
M.R. Execution Model

                       A large set of Input key/value pairs is reduced
                        into a smaller set of Output key/value pairs

                             1.   A Master process splits a corpus of documents into sets.
                                  Each is assigned to a separate Mapper worker.

                             2. A Mapper worker iterates over the contents of the split. A
                                Map function processes each. Keys are extracted and key/
                                value pairs are stored as intermediate data.

                             3.   When the mapping for a split is complete, the Master process
                                  assigns a Reducer worker to process the intermediate data
                                  for that split.

                             4. A Reducer worker iterates over the intermediate data. A
                                Reduce function accepts a key and a set of values and
                                returns a value. Unique key/value pairs are stored as
                                output data.



Thursday, December 9, 2010
M.R. Execution Model Diagram




Thursday, December 9, 2010
Applying MapReduce
                             Counting Unique Subjects in a Corpus of Financial Filings



                                                     Mapper                       Reducer




                                                Intermediate Data:             Output Data:
            Corpus of Filings,
           Split by Company ID                Redundant records of            Unique Records
                                              Subjects and Counts         of Subjects and Counts



Thursday, December 9, 2010
What is Gearman?

                             A protocol and server for distributing work
                                       across multiple servers.

                                      Job Server implemented in pure
                                      perl and in C

                                      Client and Workers can be
                                      implemented in any language

                                      Not tied to any particular
                                      processing model

                                      No single point of failure


                               Current Maintainers: Brian Aker and Eric Day


Thursday, December 9, 2010
Gearman Architecture

                             Code that Organizes   Job Server is the intermediary
                                   the Work        between Clients and Workers

                                                   Clients organize the work

                                                   Workers define and register
                                                   work functions with Job Servers.

                                                   Servers are aware of all available
                                                   Workers.

                                                   Clients and Workers are aware
                                                   of all available Servers.

                                                   Multiple Servers can be run on
                                                   one machine.
                       Code that Does the Work



Thursday, December 9, 2010
Perl and Gearman
                                Use Gearman::XS to Create a Pipeline Process




                             http://search.cpan.org/~dschoen/Gearman-XS/



Thursday, December 9, 2010
Perl and Gearman
                                         Client-Ser ver-Worker Architecture


                                               Gearman Job Servers




                              Each computer has two Gearman Job Servers running:
                      one each for handing mapping and reducing jobs to available workers.




                             Gearman::XS::Client                   Gearman::XS::Worker
                                                             Mappers           Reducers
              Split > Do Mapping > Do Reducing


                The Master uses Gearman clients to            Mapper and Reducer workers are
                  control the MapReduce pipeline.            perl processes, ready to accept jobs.



Thursday, December 9, 2010
Example
                             Processing Filings from Two Companies




                                   Mapper                     Reducer


                                             Mapper                     Reducer


             Filings from
            Two Companies




                                   Intermediate Data:            Output Data:
                                  Redundant records of          Unique Records
                                  Subjects and Counts       of Subjects and Counts



Thursday, December 9, 2010
Example: The Master Process
                              Setting Up the Mapper and Reducer Gearman Clients


                                                                     Connect the mapper client to
                                                                     the gearman server listening
                                                                     on 4730. More than one can
                                                                     be defined here.

                                                                     Connect the reducer client to
                                                                     the gearman server listening
                                                                     on 4731. More than one can
                                                                     be defined here.

                                                                     A unique reducer id is
                                                                     generated for each reducer
                                                                     client. This is used to
                                                                     identify the output.




Thursday, December 9, 2010
Example: The Master Process
                                  Submitting a Mapper Job For Each Split


                                                                    The company IDs are pushed
                                                                    into an array of splits. A jobs
                                                                    hash for both mappers and
                                                                    reducers is defined.

                                                                    A background mapper job for
                                                                    each split is submitted to
                                                                    gearman.

                                                                    The job handle is a unique ID
                                                                    generated by gearman when
                                                                    the job is submitted. This is
                                                                    used to identify the job in the
                                                                    jobs hash.

                                                                    The split is passed to the job as
                                                                    frozen workload data.




Thursday, December 9, 2010
Example: The Master Process
                      Job Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done


                                                                     Mapper and Reducer jobs are
                                                                     monitored inside of a loop.

                                                                     The job status is queried from
                                                                     the client using the job
                                                                     handle.

                                                                     If the job is completed and if it
                                                                     is a mapper job, a reducer is
                                                                     now run (detailed in the next
                                                                     slide).

                                                                     Completed jobs are deleted
                                                                     from the jobs hash.




Thursday, December 9, 2010
Example: The Master Process
                                  Detail on Submitting the Reducer Job


                                                                   A background reducer job is
                                                                   submitted to gearman.

                                                                   The job handle is a unique ID
                                                                   generated by gearman when
                                                                   the job is submitted. This is
                                                                   used to identify the job in the
                                                                   jobs hash.

                                                                   The split and reducer_id is
                                                                   passed to the job as frozen
                                                                   workload data.




Thursday, December 9, 2010
Example
                               The Result of Processing Filings from Two Companies




                                           Mapper                      Reducer


                                                          Mapper                         Reducer


         Our AdWords agreements are
         generally terminable at any
         time by our advertisers.
         Google AdSense is the program
         through which we distribute
         our advertisers' AdWords ads
         for display on the web sites
         of our Google Network
         members. Our AdSense program
         includes AdSense for search
         and AdSense for content.
         AdSense for search is our
         service for distributing
         relevant ads from our             1   AdWords
         advertisers for display with      1   Google Adsense
         search results on our Google      1   Adwords
         Network members' sites.           1   Google Network
                                           1   AdSense                  1     Ad Words
               ... Filings Text ...        1   AdSense                  14    AdMob
                                           1   AdSense                  299   AdSense
                                           1   Google Network           152   AdWords
                                           ... Redundant Counts ...     ... Aggregated Counts ...




Thursday, December 9, 2010
That’s It For Now
                             To Be Continued...



                                   Presented By
                                    Jamie Pitts

Thursday, December 9, 2010
Credits

                             Diagrams and other content were used
                                  from the following sources:

                             MapReduce: Simplified Data Processing on Large Clusters
                                      Jeffrey Dean and Sanjay Ghemawat

                                         Gearman (http://gearman.org)
                                           Brian Aker and Eric Day




Thursday, December 9, 2010

Mais conteúdo relacionado

Mais procurados

Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingScalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingDigitalPreservationEurope
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developerVitor Tomaz
 
Lightspeed Preprint
Lightspeed PreprintLightspeed Preprint
Lightspeed Preprintjustanimate
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...kcitp
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Preventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenPreventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenGiuseppe Maxia
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Solving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenSolving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenGiuseppe Maxia
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
What's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseWhat's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseDavid Bosschaert
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution ISSGC Summer School
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mappingbasisspace
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Futurekarl.barnes
 

Mais procurados (19)

H04502048051
H04502048051H04502048051
H04502048051
 
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingScalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
 
Lightspeed Preprint
Lightspeed PreprintLightspeed Preprint
Lightspeed Preprint
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Preventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenPreventing multi master conflicts with tungsten
Preventing multi master conflicts with tungsten
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Solving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenSolving MySQL replication problems with Tungsten
Solving MySQL replication problems with Tungsten
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
What's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseWhat's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise Release
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mapping
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
 

Destaque

Gearman Introduction
Gearman IntroductionGearman Introduction
Gearman IntroductionGreen Wang
 
Scalability In PHP
Scalability In PHPScalability In PHP
Scalability In PHPIan Selby
 
Distributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanDistributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanIssac Goldstand
 
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
Scale like an ant, distribute the workload - DPC, Amsterdam,  2011Scale like an ant, distribute the workload - DPC, Amsterdam,  2011
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011Helgi Þormar Þorbjörnsson
 
Gearman work queue in php
Gearman work queue in phpGearman work queue in php
Gearman work queue in phpBo-Yi Wu
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

Destaque (9)

Gearman Introduction
Gearman IntroductionGearman Introduction
Gearman Introduction
 
Gearman For Beginners
Gearman For BeginnersGearman For Beginners
Gearman For Beginners
 
Scalability In PHP
Scalability In PHPScalability In PHP
Scalability In PHP
 
Gearman and Perl
Gearman and PerlGearman and Perl
Gearman and Perl
 
Distributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanDistributed Applications with Perl & Gearman
Distributed Applications with Perl & Gearman
 
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
Scale like an ant, distribute the workload - DPC, Amsterdam,  2011Scale like an ant, distribute the workload - DPC, Amsterdam,  2011
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
 
Gearman work queue in php
Gearman work queue in phpGearman work queue in php
Gearman work queue in php
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Semelhante a MapReduce Using Perl and Gearman

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Java Abs Grid Information Retrival System
Java Abs   Grid Information Retrival SystemJava Abs   Grid Information Retrival System
Java Abs Grid Information Retrival Systemncct
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data OverviewPrabhu Thukkaram
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? DataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?Mike Wendt
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Building Applications on YARN
Building Applications on YARNBuilding Applications on YARN
Building Applications on YARNChris Riccomini
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 

Semelhante a MapReduce Using Perl and Gearman (20)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Java Abs Grid Information Retrival System
Java Abs   Grid Information Retrival SystemJava Abs   Grid Information Retrival System
Java Abs Grid Information Retrival System
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hadoop
HadoopHadoop
Hadoop
 
Building Applications on YARN
Building Applications on YARNBuilding Applications on YARN
Building Applications on YARN
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 

Último

20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Último (20)

20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

MapReduce Using Perl and Gearman

  • 1. How to Implement MapReduce Using Presented By Jamie Pitts Thursday, December 9, 2010
  • 2. A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use the unix toolbox. 2. Create a set of unwieldy perl scripts. 3. Solve it with MapReduce, perl, and Gearman! Thursday, December 9, 2010
  • 3. What is MapReduce? A programming model for processing and generating large datasets. The original 2004 paper by Dean and Ghemawat Thursday, December 9, 2010
  • 4. M.R. Execution Model A large set of Input key/value pairs is reduced into a smaller set of Output key/value pairs 1. A Master process splits a corpus of documents into sets. Each is assigned to a separate Mapper worker. 2. A Mapper worker iterates over the contents of the split. A Map function processes each. Keys are extracted and key/ value pairs are stored as intermediate data. 3. When the mapping for a split is complete, the Master process assigns a Reducer worker to process the intermediate data for that split. 4. A Reducer worker iterates over the intermediate data. A Reduce function accepts a key and a set of values and returns a value. Unique key/value pairs are stored as output data. Thursday, December 9, 2010
  • 5. M.R. Execution Model Diagram Thursday, December 9, 2010
  • 6. Applying MapReduce Counting Unique Subjects in a Corpus of Financial Filings Mapper Reducer Intermediate Data: Output Data: Corpus of Filings, Split by Company ID Redundant records of Unique Records Subjects and Counts of Subjects and Counts Thursday, December 9, 2010
  • 7. What is Gearman? A protocol and server for distributing work across multiple servers. Job Server implemented in pure perl and in C Client and Workers can be implemented in any language Not tied to any particular processing model No single point of failure Current Maintainers: Brian Aker and Eric Day Thursday, December 9, 2010
  • 8. Gearman Architecture Code that Organizes Job Server is the intermediary the Work between Clients and Workers Clients organize the work Workers define and register work functions with Job Servers. Servers are aware of all available Workers. Clients and Workers are aware of all available Servers. Multiple Servers can be run on one machine. Code that Does the Work Thursday, December 9, 2010
  • 9. Perl and Gearman Use Gearman::XS to Create a Pipeline Process http://search.cpan.org/~dschoen/Gearman-XS/ Thursday, December 9, 2010
  • 10. Perl and Gearman Client-Ser ver-Worker Architecture Gearman Job Servers Each computer has two Gearman Job Servers running: one each for handing mapping and reducing jobs to available workers. Gearman::XS::Client Gearman::XS::Worker Mappers Reducers Split > Do Mapping > Do Reducing The Master uses Gearman clients to Mapper and Reducer workers are control the MapReduce pipeline. perl processes, ready to accept jobs. Thursday, December 9, 2010
  • 11. Example Processing Filings from Two Companies Mapper Reducer Mapper Reducer Filings from Two Companies Intermediate Data: Output Data: Redundant records of Unique Records Subjects and Counts of Subjects and Counts Thursday, December 9, 2010
  • 12. Example: The Master Process Setting Up the Mapper and Reducer Gearman Clients Connect the mapper client to the gearman server listening on 4730. More than one can be defined here. Connect the reducer client to the gearman server listening on 4731. More than one can be defined here. A unique reducer id is generated for each reducer client. This is used to identify the output. Thursday, December 9, 2010
  • 13. Example: The Master Process Submitting a Mapper Job For Each Split The company IDs are pushed into an array of splits. A jobs hash for both mappers and reducers is defined. A background mapper job for each split is submitted to gearman. The job handle is a unique ID generated by gearman when the job is submitted. This is used to identify the job in the jobs hash. The split is passed to the job as frozen workload data. Thursday, December 9, 2010
  • 14. Example: The Master Process Job Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done Mapper and Reducer jobs are monitored inside of a loop. The job status is queried from the client using the job handle. If the job is completed and if it is a mapper job, a reducer is now run (detailed in the next slide). Completed jobs are deleted from the jobs hash. Thursday, December 9, 2010
  • 15. Example: The Master Process Detail on Submitting the Reducer Job A background reducer job is submitted to gearman. The job handle is a unique ID generated by gearman when the job is submitted. This is used to identify the job in the jobs hash. The split and reducer_id is passed to the job as frozen workload data. Thursday, December 9, 2010
  • 16. Example The Result of Processing Filings from Two Companies Mapper Reducer Mapper Reducer Our AdWords agreements are generally terminable at any time by our advertisers. Google AdSense is the program through which we distribute our advertisers' AdWords ads for display on the web sites of our Google Network members. Our AdSense program includes AdSense for search and AdSense for content. AdSense for search is our service for distributing relevant ads from our 1 AdWords advertisers for display with 1 Google Adsense search results on our Google 1 Adwords Network members' sites. 1 Google Network 1 AdSense 1 Ad Words ... Filings Text ... 1 AdSense 14 AdMob 1 AdSense 299 AdSense 1 Google Network 152 AdWords ... Redundant Counts ... ... Aggregated Counts ... Thursday, December 9, 2010
  • 17. That’s It For Now To Be Continued... Presented By Jamie Pitts Thursday, December 9, 2010
  • 18. Credits Diagrams and other content were used from the following sources: MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Gearman (http://gearman.org) Brian Aker and Eric Day Thursday, December 9, 2010