SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Introduction to MapReduce
    Concept and Programming




                   
                   




Public 2009/5/13



                                1
Outline

       •     Overview and Brief History
       •     MapReduce Concept
       •     Introduction to MapReduce programming using Hadoop
       •     Reference




                                                            Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                2
Brief history of MapReduce

•   A demand for large scale data processing in Google
•   The folks at Google discovered certain common themes for
    processing those large input sizes
    •   Many machines are needed
    •   Two basic operations on the input data: Map and Reduce
• Google’s idea is inspired by map and reduce functions
  commonly used in functional programming.
• Functional programming: MapReduce
– Map
    •   [1,2,3,4] – (*2)  [2,3,6,8]
– Reduce
    •   [1,2,3,4] – (sum)  10
– Divide and Conquer paradigm



                                                                 Copyright 2009 - Trend Micro Inc.



                                                                                                     3
MapReduce

     • MapReduce is a software framework introduced by Google
       to support distributed computing on large data sets on
       clusters of computers.
     • Users specify
             – a map function that processes a key/value pair to generate a set
               of intermediate key/value pairs
             – a reduce function that merges all intermediate values associated
               with the same intermediate key。
     • And MapReduce framework handles details of execution.




                                                                     Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                         4
Application Area

       • MapReduce framework is for computing certain kinds of
         distributable problems using a large number of computers
         (nodes), collectively referred to as a cluster.

       • Examples
                 – Large data processing such as search, indexing and sorting
                 – Data mining and machine learning in large data set
                 – Huge access logs analysis in large portals
       • More applications
                 http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/




                                                                                    Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                                        5
MapReduce Programming Model
•   User define two functions
        • map         (K1, V1)  list(K2, V2)
             – takes an input pair and produces a set of intermediate key/
               value
         • reduce      (K2, list(V2))  list(K3, V3)
             – accepts an intermediate key I and a set of values for that key. It
               merges together these values to form a possibly smaller set of
               values

•   MapReduce Logical Flow
                                  shuffle / sorting




                                                                       Copyright 2009 - Trend Micro Inc.



                                                                                                           6
MapReduce Execution Flow


                      shuffle / sorting




                             Shuffle (by key)


                                          Sorting (by key)




                                                                                                     7
Classification
                                                             Copyright 2009 - Trend Micro Inc.



                                                                                                 7
Word Count Example
•   Word count problem :
     – Consider the problem of counting the number of occurrences of
       each word in a large collection of documents.
•   Input:
     – A collection of (document name, document content)
•   Expected output
     – A collection of (word, count)
•   User define:
     – Map: (offset, line) ➝ [(word1, 1), (word2, 1), ... ]
     – Reduce: (word, [1, 1, ...]) ➝ [(word, count)]




                                                               Copyright 2009 - Trend Micro Inc.



                                                                                                   8
Word Count Execution Flow




                            Copyright 2009 - Trend Micro Inc.



                                                                9
More MapReduce Examples


•   Distributed Sort
     – Sorting large number of values
          • Map:
               – extracts the key from each record, and emits a <key, record> pair.
          • Reduce
               – emits all pairs unchanged.


•   Count of URL Access Frequency
     – Calculate the access frequency of URLs in web request logs especially when log file size and
       log file number are huge
          • Map:
               – processes logs and output <URL,1>
          • Reduce:
               – adds together all values for the same URL and emits a <URL, total count>
                 pair




                                                                                  Copyright 2009 - Trend Micro Inc.



                                                                                                                      10
The logical flow of a MapReduce Job in Hadoop




                                                 Copyright 2009 - Trend Micro Inc.
Classification
                                                                                     11
The execution flow of a MapReduce Job in Hadoop




                                              Copyright 2009 - Trend Micro Inc.
Classification
                                                                                  12
Introduction to
      MapReduce Programming using
      Hadoop




                                    Copyright 2009 - Trend Micro Inc.
Classification
                                                                        13
MapReduce from Hadoop

• Apache Hadoop project implements Google’s
  MapReduce
   – Provide an open-source MapReduce framework
   – Using Java as native language.
   – Using Hadoop distributed file system (HDFS) as backend.
• Yahoo! Is the main sponsor.
• Yahoo!, Google, IBM, Amazon.. use Hadoop.
• Trend Micro use Hadoop MapReduce for threat solution
  research.




                                                         Copyright 2009 - Trend Micro Inc.
                    Classifi
                    cation
                                                                                             14
The Architecture of Hadoop MapReduce Framework

 •     Map/Reduce framework consists of
         – one single master JobTracker
         – one slave TaskTracker per cluster node.
 •     JobTracker is responsible for
         – scheduling the jobs' component tasks on the slaves
         – monitoring them and re-executing the failed tasks.
 •     TaskTrackers execute the tasks




                                                                Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                    15
Minimal steps for MapReduce programming

  1. Supply map and reduce functions via implementations of
     appropriate interfaces and/or abstract-classes
  2. Comprise the job configuration
  3. Applications specify the input/output locations
  4. Job client submits the job
         1. JobTracker which then assumes the responsibility of distributing
            the software/configuration to the slaves
         2. scheduling tasks and monitoring them
         3. providing status and diagnostic information to the job-client.
  5. Application continue handling the output




                                                                    Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                        16
Simple skeleton of a MapReduce job
    class MyJob {

                 class Map {              // define Map function

                 }

                 class Reduce {          // define Reduce function
                 }
                 main() {
                            // prepare job configuration and run job
                            JobConf conf = new JobConf(“MyJob.class”);
                             conf.setInputPath(…);
                             conf.setOutputPath(…);
                             conf.setMapperClass(Map.class);
                             conf.setReduceClass(Reduce.class)
                            // prepare job configuration and run job
                             JobClient.runJob(conf);
                             // continue handling the output
                            …
                                                                         Copyright 2009 - Trend Micro Inc.
Classification   }
    }                                                                                                        17
Get started
•     Background
        –   In Trend Micro, threat research department receives large amount of users’ feedback log
            every day.
        –   A feedback contain information of threat-related events detected by productions of Trend
            Micro.
        –   Thread research folks analyze the feedback logs using data mining, correlation rules, and
            machine learning for detect more new viruses.
        –   The size of total feedback logs is possible to reach several peta bytes in a year, so we take
            advantage of HDFS to keep such a large amount of data and process them using
            MapReduce.


‧ Problem
        – Calculate the distribution of feedback logs number in a day
        – Log format
                 ‧ Each line contains a log record
                 ‧ Log format

                 No.,GUID, timestamp,module, behavior
                 1, 123, 131231231, VSAPI, open file
                 2, 456, 123123123, VSAPI, connect internet




                                                                                               Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                                                   18
Define Map function (0.19)
 •     Implements Mapper interface and define map() method




 •     Input of map method : <key, value>




 •     K1, K2 should implement WritableComparable
 •     V1, V2 should implement Writable
 •     For each input key pair, the map() function will be called once to process it
 •     The OutputCollector has collect() method to collect the immediate result
 •     Applications can use the Reporter to report progress

                                                                            Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                                19
Implements Mapper
 class MapClass extends MapReduceBase

 implements Mapper<LongWritable, Text, Text, IntWritable> {

 	               private final static IntWritable one = new IntWritable(1);

 	               private Text hour = new Text();

 	       public void map( LongWritable key, Text value,
 OutputCollector<Text,IntWritable> output, Reporter reporter) throws
 IOException {

 	                     String line = ((Text) value).toString();

                        String[] token = line.split(",");

                        String timestamp = token[1];

                        Calendar c = Calendar.getInstance();

                        c.setTimeInMillis(Long.parseLong(timestamp));

                        Integer h = c.get(Calendar.HOUR);

                        hour.set(h.toString());

                        output.collect(hour, one)

 }}}
                                                                              Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                                  20
Define Reduce function
 •     Implements Reducer interface and define reduce() method




 •     Input of reduce method : <key, a list of value>




 •     K2, K3 should implement WritableComparable
 •     V2, V3 should implement Writable
 •     The OutputCollector has collect() method to collect the final result
 •     The output of the Reducer is not re-sorted.
 •     Applications can use the Reporter to report progress
                                                                     Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                         21
Implements Reducer
class ReduceClass extends MapReduceBase implements Reducer< Text,
IntWritable, Text, IntWritable> {

	       IntWritable SumValue = new IntWritable();

	       public void reduce( Text key, Iterator<IntWritable> values,

	       OutputCollector<Text, IntWritable> output, Reporter reporter)

	       throws IOException {

	       	       int sum = 0;

	       	       while (values.hasNext())

	       	       	        sum += values.next().get();

	       	       SumValue.set(sum);

	       	       output.collect(key, SumValue);

}}




                                                                Copyright 2009 - Trend Micro Inc.



                                                                                                    22
Configuring and running your jobs

 • JobConf             class is used for setting
         – Mapper, Reducer, Inputformat, OutputFormat, Combiner,
           Partitioner and etc. .
         – input path
         – output path
         – Other job configurations
                 • Failure percentage of map and reduce tasks
                 • Retry number when errors arise
 • JobClient     class takes JobConf as configuration when
       running your job
         – JobClient.runJob(conf)
                 • Submits the job and returns only after the job has completed.
         – JobClient.submitJob(conf)
                 • Only submits the job, then poll the returned handle to the RunningJob to query
                   status and make scheduling decisions.
                 • JobClient.setJobEndNotificationURI(URI)
                      – Sets up a notification upon job-completion, thus avoiding polling.

                                                                                             Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                                                 23
More …
 •     Partitioner
         – Partitioner partitions the key space for “shuffle” phase




         – the total number of partitions i.e. number of reduce-tasks for the job


 • HashPartitioner is the default Partitioner.

                                                                            Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                                24
Main program
    Class MyJob{
    public static void main(String[] args) {
    	       JobConf conf = new JobConf(MyJob.class);
    	            conf.setJobName(”Caculate feedback log time distribution");
    	            // set path
    	            conf.setInputPath(new Path(args[0]));
    	            conf.setOutputPath(new Path(args[1]));
    	            // set map reduce
    	            conf.setOutputKeyClass(Text.class);    // set every word as key
    	            conf.setOutputValueClass(IntWritable.class); // set 1 as value
    	            conf.setMapperClass(MapClass.class);
    	            conf.setCombinerClass(Reduce.class);
    	            conf.setReducerClass(ReduceClass.class);
    	            onf.setInputFormat(TextInputFormat.class);
    	            conf.setOutputFormat(TextOutputFormat.class);
    	            // run
    	            JobClient.runJob(conf);
    }}


                                                                           Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                               25
Build
1.       compile
     –     javac -classpath hadoop-*-core.jar -d MyJava
     	     MyJob.java
‧        package
     –     jar –cvf MyJob.jar -C MyJava .
‧        execution
     –     bin/hadoop jar MyJob.jar MyJob input/ output/




                                                          Copyright 2009 - Trend Micro Inc.



                                                                                              26
Launch the program
 • bin/hadoop jar MyJob.jar MyJob input/ output/




                                                   Copyright 2009 - Trend Micro Inc.
Classification
                                                                                       27
Monitoring the execution from web console
http://172.16.203.132:50030/




                                            Copyright 2009 - Trend Micro Inc.
Classification
                                                                                28
How many mappers and reducers?
 • How many maps?
         – driven by the total size of the inputs, that is, the total number of
           blocks of the input files.
         – Can use JobConf’s setNumMapTasks(int) to provides a hint to
           the framework.


 • How many reducers?
         – The number of reduces for the job is set by the user via
           JobConf.setNumReduceTasks(int)
         – Increasing the number of reduces increases the framework
           overhead, but increases load balancing and lowers the cost of
           failures.




                                                                       Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                           29
Non-Java Interfaces from Hadoop
 • Hadoop Pipes
         – Provide C++ API and library for MapReduce
         – The C++ application runs as a sub-process of java task
 • Hadoop Streaming
         – A utility which allows users to create and run jobs with any
           executables (e.g. shell utilities) as the mapper and/or the reducer.




                                                                     Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                         30
Reference

  • Google MapReduce paper
         – http://labs.google.com/papers/mapreduce.html
  • Google: MapReduce in a Week
         – http://code.google.com/edu/submissions/mapreduce/listing.html
  • Google: Cluster Computing and MapReduce
         – http://code.google.com/edu/submissions/mapreduce-minilecture/
           listing.html
  • Hadoop project official site
         – http://hadoop.apache.org/core/




                                                                Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                    31
Reference

  • MapReduce Tools for Eclipse Plug-In (by IBM)
         – A robust plug-in that brings Hadoop support to the Eclipse
           platform
         – Download
  • Hadoop Virtual Image (by Google)
         – his VMware image contains a preconfigured single node
           instance of Hadoop. This provides the same interface as a full
           cluster without any of the overhead. It is suitable for educators
           exploring the platform and students working independently. The
           following Download and VMware Player links point to websites
           external to Google.
         – Document link




                                                                    Copyright 2009 - Trend Micro Inc.
Classification
                                                                                                        32

Mais conteúdo relacionado

Mais procurados

Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 

Mais procurados (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 

Destaque

Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingTrendProgContest13
 
Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsTrendProgContest13
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceTrendProgContest13
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 

Destaque (7)

Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computing
 
Zh tw introduction_to_h_base
Zh tw introduction_to_h_baseZh tw introduction_to_h_base
Zh tw introduction_to_h_base
 
Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfs
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 

Semelhante a Introduction to map reduce

David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESSysFera
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
10c introduction
10c introduction10c introduction
10c introductionInyoung Cho
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 

Semelhante a Introduction to map reduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
10c introduction
10c introduction10c introduction
10c introduction
 
10c introduction
10c introduction10c introduction
10c introduction
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Introduction to map reduce

  • 1. Introduction to MapReduce Concept and Programming   Public 2009/5/13 1
  • 2. Outline • Overview and Brief History • MapReduce Concept • Introduction to MapReduce programming using Hadoop • Reference Copyright 2009 - Trend Micro Inc. Classification 2
  • 3. Brief history of MapReduce • A demand for large scale data processing in Google • The folks at Google discovered certain common themes for processing those large input sizes • Many machines are needed • Two basic operations on the input data: Map and Reduce • Google’s idea is inspired by map and reduce functions commonly used in functional programming. • Functional programming: MapReduce – Map • [1,2,3,4] – (*2)  [2,3,6,8] – Reduce • [1,2,3,4] – (sum)  10 – Divide and Conquer paradigm Copyright 2009 - Trend Micro Inc. 3
  • 4. MapReduce • MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. • Users specify – a map function that processes a key/value pair to generate a set of intermediate key/value pairs – a reduce function that merges all intermediate values associated with the same intermediate key。 • And MapReduce framework handles details of execution. Copyright 2009 - Trend Micro Inc. Classification 4
  • 5. Application Area • MapReduce framework is for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. • Examples – Large data processing such as search, indexing and sorting – Data mining and machine learning in large data set – Huge access logs analysis in large portals • More applications http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/ Copyright 2009 - Trend Micro Inc. Classification 5
  • 6. MapReduce Programming Model • User define two functions • map (K1, V1)  list(K2, V2) – takes an input pair and produces a set of intermediate key/ value • reduce (K2, list(V2))  list(K3, V3) – accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values • MapReduce Logical Flow shuffle / sorting Copyright 2009 - Trend Micro Inc. 6
  • 7. MapReduce Execution Flow shuffle / sorting Shuffle (by key) Sorting (by key) 7 Classification Copyright 2009 - Trend Micro Inc. 7
  • 8. Word Count Example • Word count problem : – Consider the problem of counting the number of occurrences of each word in a large collection of documents. • Input: – A collection of (document name, document content) • Expected output – A collection of (word, count) • User define: – Map: (offset, line) ➝ [(word1, 1), (word2, 1), ... ] – Reduce: (word, [1, 1, ...]) ➝ [(word, count)] Copyright 2009 - Trend Micro Inc. 8
  • 9. Word Count Execution Flow Copyright 2009 - Trend Micro Inc. 9
  • 10. More MapReduce Examples • Distributed Sort – Sorting large number of values • Map: – extracts the key from each record, and emits a <key, record> pair. • Reduce – emits all pairs unchanged. • Count of URL Access Frequency – Calculate the access frequency of URLs in web request logs especially when log file size and log file number are huge • Map: – processes logs and output <URL,1> • Reduce: – adds together all values for the same URL and emits a <URL, total count> pair Copyright 2009 - Trend Micro Inc. 10
  • 11. The logical flow of a MapReduce Job in Hadoop Copyright 2009 - Trend Micro Inc. Classification 11
  • 12. The execution flow of a MapReduce Job in Hadoop Copyright 2009 - Trend Micro Inc. Classification 12
  • 13. Introduction to MapReduce Programming using Hadoop Copyright 2009 - Trend Micro Inc. Classification 13
  • 14. MapReduce from Hadoop • Apache Hadoop project implements Google’s MapReduce – Provide an open-source MapReduce framework – Using Java as native language. – Using Hadoop distributed file system (HDFS) as backend. • Yahoo! Is the main sponsor. • Yahoo!, Google, IBM, Amazon.. use Hadoop. • Trend Micro use Hadoop MapReduce for threat solution research. Copyright 2009 - Trend Micro Inc. Classifi cation 14
  • 15. The Architecture of Hadoop MapReduce Framework • Map/Reduce framework consists of – one single master JobTracker – one slave TaskTracker per cluster node. • JobTracker is responsible for – scheduling the jobs' component tasks on the slaves – monitoring them and re-executing the failed tasks. • TaskTrackers execute the tasks Copyright 2009 - Trend Micro Inc. Classification 15
  • 16. Minimal steps for MapReduce programming 1. Supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes 2. Comprise the job configuration 3. Applications specify the input/output locations 4. Job client submits the job 1. JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves 2. scheduling tasks and monitoring them 3. providing status and diagnostic information to the job-client. 5. Application continue handling the output Copyright 2009 - Trend Micro Inc. Classification 16
  • 17. Simple skeleton of a MapReduce job class MyJob { class Map { // define Map function } class Reduce { // define Reduce function } main() { // prepare job configuration and run job JobConf conf = new JobConf(“MyJob.class”); conf.setInputPath(…); conf.setOutputPath(…); conf.setMapperClass(Map.class); conf.setReduceClass(Reduce.class) // prepare job configuration and run job JobClient.runJob(conf); // continue handling the output … Copyright 2009 - Trend Micro Inc. Classification } } 17
  • 18. Get started • Background – In Trend Micro, threat research department receives large amount of users’ feedback log every day. – A feedback contain information of threat-related events detected by productions of Trend Micro. – Thread research folks analyze the feedback logs using data mining, correlation rules, and machine learning for detect more new viruses. – The size of total feedback logs is possible to reach several peta bytes in a year, so we take advantage of HDFS to keep such a large amount of data and process them using MapReduce. ‧ Problem – Calculate the distribution of feedback logs number in a day – Log format ‧ Each line contains a log record ‧ Log format No.,GUID, timestamp,module, behavior 1, 123, 131231231, VSAPI, open file 2, 456, 123123123, VSAPI, connect internet Copyright 2009 - Trend Micro Inc. Classification 18
  • 19. Define Map function (0.19) • Implements Mapper interface and define map() method • Input of map method : <key, value> • K1, K2 should implement WritableComparable • V1, V2 should implement Writable • For each input key pair, the map() function will be called once to process it • The OutputCollector has collect() method to collect the immediate result • Applications can use the Reporter to report progress Copyright 2009 - Trend Micro Inc. Classification 19
  • 20. Implements Mapper class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text hour = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = ((Text) value).toString(); String[] token = line.split(","); String timestamp = token[1]; Calendar c = Calendar.getInstance(); c.setTimeInMillis(Long.parseLong(timestamp)); Integer h = c.get(Calendar.HOUR); hour.set(h.toString()); output.collect(hour, one) }}} Copyright 2009 - Trend Micro Inc. Classification 20
  • 21. Define Reduce function • Implements Reducer interface and define reduce() method • Input of reduce method : <key, a list of value> • K2, K3 should implement WritableComparable • V2, V3 should implement Writable • The OutputCollector has collect() method to collect the final result • The output of the Reducer is not re-sorted. • Applications can use the Reporter to report progress Copyright 2009 - Trend Micro Inc. Classification 21
  • 22. Implements Reducer class ReduceClass extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> { IntWritable SumValue = new IntWritable(); public void reduce( Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) sum += values.next().get(); SumValue.set(sum); output.collect(key, SumValue); }} Copyright 2009 - Trend Micro Inc. 22
  • 23. Configuring and running your jobs • JobConf class is used for setting – Mapper, Reducer, Inputformat, OutputFormat, Combiner, Partitioner and etc. . – input path – output path – Other job configurations • Failure percentage of map and reduce tasks • Retry number when errors arise • JobClient class takes JobConf as configuration when running your job – JobClient.runJob(conf) • Submits the job and returns only after the job has completed. – JobClient.submitJob(conf) • Only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling decisions. • JobClient.setJobEndNotificationURI(URI) – Sets up a notification upon job-completion, thus avoiding polling. Copyright 2009 - Trend Micro Inc. Classification 23
  • 24. More … • Partitioner – Partitioner partitions the key space for “shuffle” phase – the total number of partitions i.e. number of reduce-tasks for the job • HashPartitioner is the default Partitioner. Copyright 2009 - Trend Micro Inc. Classification 24
  • 25. Main program Class MyJob{ public static void main(String[] args) { JobConf conf = new JobConf(MyJob.class); conf.setJobName(”Caculate feedback log time distribution"); // set path conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); // set map reduce conf.setOutputKeyClass(Text.class); // set every word as key conf.setOutputValueClass(IntWritable.class); // set 1 as value conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(ReduceClass.class); onf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); // run JobClient.runJob(conf); }} Copyright 2009 - Trend Micro Inc. Classification 25
  • 26. Build 1. compile – javac -classpath hadoop-*-core.jar -d MyJava MyJob.java ‧ package – jar –cvf MyJob.jar -C MyJava . ‧ execution – bin/hadoop jar MyJob.jar MyJob input/ output/ Copyright 2009 - Trend Micro Inc. 26
  • 27. Launch the program • bin/hadoop jar MyJob.jar MyJob input/ output/ Copyright 2009 - Trend Micro Inc. Classification 27
  • 28. Monitoring the execution from web console http://172.16.203.132:50030/ Copyright 2009 - Trend Micro Inc. Classification 28
  • 29. How many mappers and reducers? • How many maps? – driven by the total size of the inputs, that is, the total number of blocks of the input files. – Can use JobConf’s setNumMapTasks(int) to provides a hint to the framework. • How many reducers? – The number of reduces for the job is set by the user via JobConf.setNumReduceTasks(int) – Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Copyright 2009 - Trend Micro Inc. Classification 29
  • 30. Non-Java Interfaces from Hadoop • Hadoop Pipes – Provide C++ API and library for MapReduce – The C++ application runs as a sub-process of java task • Hadoop Streaming – A utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Copyright 2009 - Trend Micro Inc. Classification 30
  • 31. Reference • Google MapReduce paper – http://labs.google.com/papers/mapreduce.html • Google: MapReduce in a Week – http://code.google.com/edu/submissions/mapreduce/listing.html • Google: Cluster Computing and MapReduce – http://code.google.com/edu/submissions/mapreduce-minilecture/ listing.html • Hadoop project official site – http://hadoop.apache.org/core/ Copyright 2009 - Trend Micro Inc. Classification 31
  • 32. Reference • MapReduce Tools for Eclipse Plug-In (by IBM) – A robust plug-in that brings Hadoop support to the Eclipse platform – Download • Hadoop Virtual Image (by Google) – his VMware image contains a preconfigured single node instance of Hadoop. This provides the same interface as a full cluster without any of the overhead. It is suitable for educators exploring the platform and students working independently. The following Download and VMware Player links point to websites external to Google. – Document link Copyright 2009 - Trend Micro Inc. Classification 32