SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




                     MapReduce in Cloud Computing


                                    Mohammad Mustaqeem
                                              M.Tech 2nd Year
                                             Reg No: 2011CS17




                           Computer Science and Engineering Department
                        Motilal Nehru National Institute of Technology Allahabad


                                          November 8, 2012
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Outline
      1    Introduction
      2    Motivation
      3    Description of First Paper
              Issues
              Approach Used
                   HDFS
                   MapReduce Progamming Model
             Example: Word Count
      4    Description of Second Paper
             Issues
             Approach Used
                   Architecture
                   System Mechanism
             Example
      5    Comparison
      6    Conclusion
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Introduction



             MapReduce is a general-purpose programming model for
             data-intensive computing.
             It was introduced by Google in 2004 to construct its web
             index.
             It is also used at Yahoo, Facebook etc.
             It uses a parallel computing model that distributes
             computational tasks to large number of nodes(approx
             1000-10000 nodes.)
             It is fault-tolerable. It can work even when 1600 nodes
             among 1800 nodes fails.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Introduction




             In MapReduce model, user has to write only two functions-
             map and reduce.
             Few examples that can be easily expressed as
             MapReduce computations:
                     Distributed Grep
                     Count of URL Access Frequency
                     Inverted Index
                     Mining
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Motivation



             Cloud Computing refers to services that are offered by
             cluster having 1000 to 10000 machines[6].
             e.g. services offered by Yahoo, Google etc.

             Cloud computing deliveres computing resources as a
             service. It may be -
                     Infrastructure as a Service (IaaS).
                     Platform as a Service (PaaS).
                     Software as a Service (SaaS).
                     Storage as a Service (STaaS). etc.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Motivation cont..




             Cloud Service is different from traditional hosting service in
             following ways[6] -
                     It is sold on demand, typically by the minute or the hour.
                     It is elastic - a user can have as much or as little of a
                     service as they want at any given time.
                     It is fully managed by provider (the consumer needs
                     nothing but a personal computer and Internet access)
                            Amazon Web Services is the largest public cloud provider.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Motivation cont..

             MapReduce is a programming model for large-scale
             computing[3].
             It uses distributed environment of the cloud to process
             large amount of data in reasonable amount of time.
             It was inspired by map and reduce function of Functional
             Programming Language(like LISP, scheme, racket)[3].
             Map and Reduce in Racket (Functional Programming
             Language)[4]:
                     Map:
                     (map f list1) → list2
                     e.g. (map square ’(1 2 3 4 5)) → ’(1 4 9 16 25)
                     Reduce:
                     (foldl f init list1) → any
                     e.g. (foldl + 0 ’(1 2 3 4 5)) → 15
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Motivation cont..



             Although, the map and reduce functions in MapReduce
             model is not exactly same as in functional programming.
             Map and Reduce functions in MapReduce model:
                     Map: It process a (key, value) pair and returns a list of
                     (intermediate key, value) pairs-
                     map(k1, v1) → list(k2, v2)
                     Reduce: It merges all intermediate values having the same
                     intermediate key-
                     reduce(k2, list(v2)) → list(v3)
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Issues

Issues


         Gaizhen Yang, "The Application of MapReduce in the
         Cloud Computing"
             It analyzes Hadoop.
             Hadoop is the implementation of MapReduce Model.
             It process data parallely in distributed manner.
             It divides the data into different logical blocks and process
             these logical blocks in parallel on different machines and at
             last combines all the results to produce the final result[1].
             It is fault-tolerable.
             One attractive feature of Hadoop is that user can write the
             map and reduce functions in any programming langauge.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

Approach Used




                Hadoop is an open source Java framework for processing
                large amount of data on the clusters of machines[1].
                Hadoop is the implementation of Google’s MapReduce
                programming model.
                Yahoo is the biggest contributor of Hadoop[5].
                Hadoop has mainly two components:
                     Hadoop Distributed File System (HDFS)
                     MapReduce
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

HDFS

                HDFS provides support for distributed storage[1].
                Like traditional File System, the files can be deleted,
                renamed etc.
                HDFS has two types of nodes:
                     Name Node
                     Data Node




                                     Figure: HDFS Architecture
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

HDFS cont..


                Name Node:
                     Name Node provides the main data services.
                     It is a process that runs on a separate machine.
                     It stores only the meta-data of the files and directories.
                     Programmer access files through it.
                     For reliablity of the file system, it keeps multiple copies of
                     the same file blocks.
                Data Node:
                     Data Node is a process that runs on individual machine of
                     the cluster.
                     The file blocks are stored in the local file system of these
                     nodes.
                     It periodically send the meta-data of the stored blocks to the
                     Name Node.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

MapReduce Progamming Model



                MapReduce is the key concept behind the Hadoop.
                It is a technique for dividing the work across a distributed
                system.
                The user has to define only two functions:
                     Map: It process a (key, value) pair and returns a list of
                     (intermediate key, value) pairs-
                     map(k1, v1) → list(k2, v2)
                     Reduce: It merges all intermediate values having the same
                     intermediate key-
                     reduce(k2, list(v2)) → list(v3)
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

MapReduce Progamming Model cont..

      Execution phase of a MapReduce Application
       1 MapReduce library splits files into M pieces and copies
         these pieces into multiple machines.
       2 Master picks the idle workers and assigns a map task.
       3 The map workers process key-value pairs of the input data
         and passes each pair to the user-defined map function and
         produces the intermediate key-value pairs.
       4 The map worker buffers the output key-value pairs in the
         local memory. It passes these memory locations to the
         master and then master forwards it to the reducer.
       5 After reading the intermediate key-value pairs, the reducer
         sorts these pairs by the intermediate key.
       6 For each intermediate key, the user defined reduce
         function is applied to the corresponding intermediate
         values.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

MapReduce Progamming Model cont..

         7      When all map tasks and reduce tasks have been
                completed. Master gives the final output to the user.




             Figure: Execution phase of a generic MapReduce Application

                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Example: Word Count

Example: Word Count

      The pseudo code of map and reduce function for word count
      problem is -

       Algorithm 3.1: MAPPER(filename, file − contents)

         for each word ∈ file − contents
           do EMIT(word, 1)


       Algorithm 3.2: REDUCER(word, values)

         sum ← 0
         for each value ∈ values
           do sum ← sum + value
         EMIT(word, sum)
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Example: Word Count

Example: Word Count cont..




                                  Figure: Word Count Execution


                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Issues

Issues



         Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa,
         "P2P-MapReduce: Parallel data processing in dynamic
         Cloud environments"
             The discussed MapReduce is centralized.
             It can’t deal with master failure.
             Since the nodes joins and leaves the cloud dynamically, we
             need a P2P-MapReduce model.
             This paper descibes an adaptive P2P-MapReduce system
             that can handle the master failure.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

Approach Used

                P2P-MapReduce is a programming model in which nodes
                may join and leave the cluster dynamically.
                The nodes act as either master or slave at a time.
                The master and slave interchange to each other
                dynamically such that the master/slave ratio remains
                constant.
                To prevent the loss of computation in case of master
                failure, there are some backup masters for each masters.
                     The master responsible for a job J is referred as the
                     primary master for J.
                     The primary master dynamically updates the job state on its
                     backup nodes, which are referred as backup masters for J.
                     When a primary master fails, its place is taken by one of its
                     backup masters.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

Architecture


                There are three type of nodes in P2P-MapReduce
                architecture:
                     User
                     Master
                     Slave
                The masters and slaves nodes form two logical
                peer-to-peer network M-net and S-net respectively.
                The composition of M-net and S-net changes dynamically.
                User node submits the MapReduce job to one of the
                available master nodes. The selection of master node is
                done by current workload of the available master nodes.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

Architecture cont..



                Master nodes perform three type of operations[2]:
                     Management: A master node that is acting as primary
                     master for one or more jobs, executes management
                     operation.
                     Recovery: A master node that is acting as backup master
                     for one or more jobs, executes recovery operation.
                     Coordination: The coordinator operation changes slaves
                     into masters and vice-versa, so as to keep the desired
                     master/slave ratio.
                The slave executes tasks that are assigned to it by one or
                more primary masters.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

Architecture cont..

                For each managed jobs, primary master runs one Job
                Manager.
                Backup masters runs Backup Job Manager.
                For each assigned tasks, slave runs one Task Manager.
                The task manager keeps informing to its job manager. The
                information includes the status of the slave(ACTIVE or
                DEAD) and howmuch computation has been done.
                If a master doesn’t get the signal from a task manager,
                then it reschedules that assigned task on another idle
                slave.
                In addition to this condition, if a slave works slowly, then
                also the master node reschedules that assigned task on
                another idle slave and consider that output which comes
                first and discards other.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Approach Used

System Mechanism

      The mechanism of a generic node can be understood by UML
      state diagram[2].




      Figure: Behaviour of a generic node described by an UML State
      Diagram
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Example

Example




                                Figure: P2P-MapReduce example
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End


Example

Example cont..



             The following recovery procedure takes place when a
             primary master Node1 fails[2]:
                     Backup masters Node2 and Node3 detect the failure of
                     Node1 and starts a distributed procedure to elect the new
                     primary master among them.
                     Assuming that Node3 is elected as the new primary master,
                     Node2 continues to play the backup function and, to keep
                     the desired number of backup masters active, another
                     backup node is chosen by Node3.
                     Node3 uses its local replica of the job to proceed from
                     where the Node1 fails.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Comparison between two Papers


                                   First Paper                           Second Paper
        Issues                     To perform data-intensive             To design a P2P MapReduce
                                   computation in Cloud en-              system that can handle all the
                                   vironment     in   reasonable         node’s failure including Mas-
                                   amount of time.                       ter node’s failure.
        Approaches Used            Simple MapReduce (pre-                Peer-to-peer architecture is
                                   sented by Google) imple-              used to handle all the dy-
                                   mentation is used.        The         namic churns in a cluster.
                                   implemented      version    is
                                   known as Hadoop, which is
                                   based on the Master-Slave
                                   Model.
        Advantages                 Hadoop is scalable, reliable          P2P-MapReduce can man-
                                   and distributed able to handle        age node churn, master fail-
                                   enormous amount of data. It           ures and job recovery in an ef-
                                   can process big data in real          fective way.
                                   time.

                           Table: Comparison between two Papers.
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




Conclusion



             MapReduce is scalable, reliable computing model to
             exploids the distributed environment of the cloud.
             MapReduce optimizes the system performance by
             rescheduling the slow task on multiple slaves.
             P2P-MapReduce has all the property of simple
             MapReduce.
             Since P2P-MapReduce provides fault-tolerance against
             master failures, so it is more reliable.
             P2P-MapReduce prevents computation loss by keeping
             job state at backup masters.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




References



             Gaizhen Yang, "The Application of MapReduce in the Cloud Computing",
             International Symposium on Intelligence Information Processing and Trusted
             Computing (IPTC), October 2011, pp. 154-156, http://ieeexplore.ieee.
             org/xpl/articleDetails.jsp?tp=&arnumber=6103560.
             Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, "P2P-MapReduce: Parallel
             data processing in dynamic Cloud environments", Journal of Computer and
             System Sciences, vol. 78, Issue 5 September 212, pp.
             1382-1402,http://dl.acm.org/citation.cfm?id=2240494.
             Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing on
             large clusters", OSDI’04 Proceedings of the 6th conference on Symposium on
             Opearting Systems Design & Implementation, vol. 6, 2004, pp.10-10,
             www.usenix.org/event/osdi04/tech/full_papers/dean/dean.
             pdfandhttp://dl.acm.org/citation.cfm?id=1251254.1251264..
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




References



             The Racket Guide, http://docs.racket-lang.org/guide/.

             Hadoop Tutorial - YDN,
             http://developer.yahoo.com/hadoop/tutorial/module4.html.
             http://readwrite.com/2012/10/15/
             why-the-future-of-software-and-apps-is-serverless.
             F. Marozzo, D. Talia, P. Trunfio, "A Peer-to-Peer Framework for Supporting
             MapReduce Applications in Dynamic Cloud Environments", In: N. Antonopoulos,
             L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications,
             Springer, Chapter 7, 113-125, 2010,
             IBM developer work, Using MapReduce and load balancing on the cloud, http:
             //www.ibm.com/developerworks/cloud/library/cl-mapreduce/.
                                                                                                     Return
Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End




                                        THANK YOU




                                                                                                     Return

Mais conteúdo relacionado

Mais procurados

Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Object Oriented Modeling and Design with UML
Object Oriented Modeling and Design with UMLObject Oriented Modeling and Design with UML
Object Oriented Modeling and Design with UMLMalek Sumaiya
 
Using prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbannUsing prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbannswapnac12
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Key Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTINGKey Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTINGAtul Chounde
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming ModelAdarshaDhakal
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture Ravindra Dastikop
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
cloud concepts and technologies
cloud concepts and technologiescloud concepts and technologies
cloud concepts and technologiesKalai Selvi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutionMajid Hajibaba
 
8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computingMajid Hajibaba
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing pptPravesh ARYA
 
Load Balancing In Cloud Computing newppt
Load Balancing In Cloud Computing newpptLoad Balancing In Cloud Computing newppt
Load Balancing In Cloud Computing newpptUtshab Saha
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 

Mais procurados (20)

Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Object Oriented Modeling and Design with UML
Object Oriented Modeling and Design with UMLObject Oriented Modeling and Design with UML
Object Oriented Modeling and Design with UML
 
Using prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbannUsing prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbann
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Key Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTINGKey Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTING
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Unit 4
Unit 4Unit 4
Unit 4
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
cloud concepts and technologies
cloud concepts and technologiescloud concepts and technologies
cloud concepts and technologies
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdution
 
8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing ppt
 
Load Balancing In Cloud Computing newppt
Load Balancing In Cloud Computing newpptLoad Balancing In Cloud Computing newppt
Load Balancing In Cloud Computing newppt
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 

Semelhante a Application of MapReduce in Cloud Computing

Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersBRNSSPublicationHubI
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce PerformanceAM Publications
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 

Semelhante a Application of MapReduce in Cloud Computing (20)

Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
H04502048051
H04502048051H04502048051
H04502048051
 
Spark
SparkSpark
Spark
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce Performance
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 

Application of MapReduce in Cloud Computing

  • 1. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End MapReduce in Cloud Computing Mohammad Mustaqeem M.Tech 2nd Year Reg No: 2011CS17 Computer Science and Engineering Department Motilal Nehru National Institute of Technology Allahabad November 8, 2012
  • 2. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Outline 1 Introduction 2 Motivation 3 Description of First Paper Issues Approach Used HDFS MapReduce Progamming Model Example: Word Count 4 Description of Second Paper Issues Approach Used Architecture System Mechanism Example 5 Comparison 6 Conclusion
  • 3. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Introduction MapReduce is a general-purpose programming model for data-intensive computing. It was introduced by Google in 2004 to construct its web index. It is also used at Yahoo, Facebook etc. It uses a parallel computing model that distributes computational tasks to large number of nodes(approx 1000-10000 nodes.) It is fault-tolerable. It can work even when 1600 nodes among 1800 nodes fails. Return
  • 4. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Introduction In MapReduce model, user has to write only two functions- map and reduce. Few examples that can be easily expressed as MapReduce computations: Distributed Grep Count of URL Access Frequency Inverted Index Mining Return
  • 5. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Motivation Cloud Computing refers to services that are offered by cluster having 1000 to 10000 machines[6]. e.g. services offered by Yahoo, Google etc. Cloud computing deliveres computing resources as a service. It may be - Infrastructure as a Service (IaaS). Platform as a Service (PaaS). Software as a Service (SaaS). Storage as a Service (STaaS). etc. Return
  • 6. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Motivation cont.. Cloud Service is different from traditional hosting service in following ways[6] - It is sold on demand, typically by the minute or the hour. It is elastic - a user can have as much or as little of a service as they want at any given time. It is fully managed by provider (the consumer needs nothing but a personal computer and Internet access) Amazon Web Services is the largest public cloud provider. Return
  • 7. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Motivation cont.. MapReduce is a programming model for large-scale computing[3]. It uses distributed environment of the cloud to process large amount of data in reasonable amount of time. It was inspired by map and reduce function of Functional Programming Language(like LISP, scheme, racket)[3]. Map and Reduce in Racket (Functional Programming Language)[4]: Map: (map f list1) → list2 e.g. (map square ’(1 2 3 4 5)) → ’(1 4 9 16 25) Reduce: (foldl f init list1) → any e.g. (foldl + 0 ’(1 2 3 4 5)) → 15 Return
  • 8. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Motivation cont.. Although, the map and reduce functions in MapReduce model is not exactly same as in functional programming. Map and Reduce functions in MapReduce model: Map: It process a (key, value) pair and returns a list of (intermediate key, value) pairs- map(k1, v1) → list(k2, v2) Reduce: It merges all intermediate values having the same intermediate key- reduce(k2, list(v2)) → list(v3) Return
  • 9. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Issues Issues Gaizhen Yang, "The Application of MapReduce in the Cloud Computing" It analyzes Hadoop. Hadoop is the implementation of MapReduce Model. It process data parallely in distributed manner. It divides the data into different logical blocks and process these logical blocks in parallel on different machines and at last combines all the results to produce the final result[1]. It is fault-tolerable. One attractive feature of Hadoop is that user can write the map and reduce functions in any programming langauge. Return
  • 10. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used Approach Used Hadoop is an open source Java framework for processing large amount of data on the clusters of machines[1]. Hadoop is the implementation of Google’s MapReduce programming model. Yahoo is the biggest contributor of Hadoop[5]. Hadoop has mainly two components: Hadoop Distributed File System (HDFS) MapReduce Return
  • 11. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used HDFS HDFS provides support for distributed storage[1]. Like traditional File System, the files can be deleted, renamed etc. HDFS has two types of nodes: Name Node Data Node Figure: HDFS Architecture
  • 12. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used HDFS cont.. Name Node: Name Node provides the main data services. It is a process that runs on a separate machine. It stores only the meta-data of the files and directories. Programmer access files through it. For reliablity of the file system, it keeps multiple copies of the same file blocks. Data Node: Data Node is a process that runs on individual machine of the cluster. The file blocks are stored in the local file system of these nodes. It periodically send the meta-data of the stored blocks to the Name Node. Return
  • 13. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used MapReduce Progamming Model MapReduce is the key concept behind the Hadoop. It is a technique for dividing the work across a distributed system. The user has to define only two functions: Map: It process a (key, value) pair and returns a list of (intermediate key, value) pairs- map(k1, v1) → list(k2, v2) Reduce: It merges all intermediate values having the same intermediate key- reduce(k2, list(v2)) → list(v3) Return
  • 14. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used MapReduce Progamming Model cont.. Execution phase of a MapReduce Application 1 MapReduce library splits files into M pieces and copies these pieces into multiple machines. 2 Master picks the idle workers and assigns a map task. 3 The map workers process key-value pairs of the input data and passes each pair to the user-defined map function and produces the intermediate key-value pairs. 4 The map worker buffers the output key-value pairs in the local memory. It passes these memory locations to the master and then master forwards it to the reducer. 5 After reading the intermediate key-value pairs, the reducer sorts these pairs by the intermediate key. 6 For each intermediate key, the user defined reduce function is applied to the corresponding intermediate values. Return
  • 15. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used MapReduce Progamming Model cont.. 7 When all map tasks and reduce tasks have been completed. Master gives the final output to the user. Figure: Execution phase of a generic MapReduce Application Return
  • 16. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Example: Word Count Example: Word Count The pseudo code of map and reduce function for word count problem is - Algorithm 3.1: MAPPER(filename, file − contents) for each word ∈ file − contents do EMIT(word, 1) Algorithm 3.2: REDUCER(word, values) sum ← 0 for each value ∈ values do sum ← sum + value EMIT(word, sum)
  • 17. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Example: Word Count Example: Word Count cont.. Figure: Word Count Execution Return
  • 18. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Issues Issues Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, "P2P-MapReduce: Parallel data processing in dynamic Cloud environments" The discussed MapReduce is centralized. It can’t deal with master failure. Since the nodes joins and leaves the cloud dynamically, we need a P2P-MapReduce model. This paper descibes an adaptive P2P-MapReduce system that can handle the master failure. Return
  • 19. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used Approach Used P2P-MapReduce is a programming model in which nodes may join and leave the cluster dynamically. The nodes act as either master or slave at a time. The master and slave interchange to each other dynamically such that the master/slave ratio remains constant. To prevent the loss of computation in case of master failure, there are some backup masters for each masters. The master responsible for a job J is referred as the primary master for J. The primary master dynamically updates the job state on its backup nodes, which are referred as backup masters for J. When a primary master fails, its place is taken by one of its backup masters. Return
  • 20. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used Architecture There are three type of nodes in P2P-MapReduce architecture: User Master Slave The masters and slaves nodes form two logical peer-to-peer network M-net and S-net respectively. The composition of M-net and S-net changes dynamically. User node submits the MapReduce job to one of the available master nodes. The selection of master node is done by current workload of the available master nodes. Return
  • 21. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used Architecture cont.. Master nodes perform three type of operations[2]: Management: A master node that is acting as primary master for one or more jobs, executes management operation. Recovery: A master node that is acting as backup master for one or more jobs, executes recovery operation. Coordination: The coordinator operation changes slaves into masters and vice-versa, so as to keep the desired master/slave ratio. The slave executes tasks that are assigned to it by one or more primary masters. Return
  • 22. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used Architecture cont.. For each managed jobs, primary master runs one Job Manager. Backup masters runs Backup Job Manager. For each assigned tasks, slave runs one Task Manager. The task manager keeps informing to its job manager. The information includes the status of the slave(ACTIVE or DEAD) and howmuch computation has been done. If a master doesn’t get the signal from a task manager, then it reschedules that assigned task on another idle slave. In addition to this condition, if a slave works slowly, then also the master node reschedules that assigned task on another idle slave and consider that output which comes first and discards other. Return
  • 23. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Approach Used System Mechanism The mechanism of a generic node can be understood by UML state diagram[2]. Figure: Behaviour of a generic node described by an UML State Diagram
  • 24. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Example Example Figure: P2P-MapReduce example
  • 25. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Example Example cont.. The following recovery procedure takes place when a primary master Node1 fails[2]: Backup masters Node2 and Node3 detect the failure of Node1 and starts a distributed procedure to elect the new primary master among them. Assuming that Node3 is elected as the new primary master, Node2 continues to play the backup function and, to keep the desired number of backup masters active, another backup node is chosen by Node3. Node3 uses its local replica of the job to proceed from where the Node1 fails. Return
  • 26. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Comparison between two Papers First Paper Second Paper Issues To perform data-intensive To design a P2P MapReduce computation in Cloud en- system that can handle all the vironment in reasonable node’s failure including Mas- amount of time. ter node’s failure. Approaches Used Simple MapReduce (pre- Peer-to-peer architecture is sented by Google) imple- used to handle all the dy- mentation is used. The namic churns in a cluster. implemented version is known as Hadoop, which is based on the Master-Slave Model. Advantages Hadoop is scalable, reliable P2P-MapReduce can man- and distributed able to handle age node churn, master fail- enormous amount of data. It ures and job recovery in an ef- can process big data in real fective way. time. Table: Comparison between two Papers.
  • 27. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End Conclusion MapReduce is scalable, reliable computing model to exploids the distributed environment of the cloud. MapReduce optimizes the system performance by rescheduling the slow task on multiple slaves. P2P-MapReduce has all the property of simple MapReduce. Since P2P-MapReduce provides fault-tolerance against master failures, so it is more reliable. P2P-MapReduce prevents computation loss by keeping job state at backup masters. Return
  • 28. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End References Gaizhen Yang, "The Application of MapReduce in the Cloud Computing", International Symposium on Intelligence Information Processing and Trusted Computing (IPTC), October 2011, pp. 154-156, http://ieeexplore.ieee. org/xpl/articleDetails.jsp?tp=&arnumber=6103560. Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, "P2P-MapReduce: Parallel data processing in dynamic Cloud environments", Journal of Computer and System Sciences, vol. 78, Issue 5 September 212, pp. 1382-1402,http://dl.acm.org/citation.cfm?id=2240494. Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI’04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, 2004, pp.10-10, www.usenix.org/event/osdi04/tech/full_papers/dean/dean. pdfandhttp://dl.acm.org/citation.cfm?id=1251254.1251264.. Return
  • 29. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End References The Racket Guide, http://docs.racket-lang.org/guide/. Hadoop Tutorial - YDN, http://developer.yahoo.com/hadoop/tutorial/module4.html. http://readwrite.com/2012/10/15/ why-the-future-of-software-and-apps-is-serverless. F. Marozzo, D. Talia, P. Trunfio, "A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments", In: N. Antonopoulos, L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications, Springer, Chapter 7, 113-125, 2010, IBM developer work, Using MapReduce and load balancing on the cloud, http: //www.ibm.com/developerworks/cloud/library/cl-mapreduce/. Return
  • 30. Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End THANK YOU Return