SlideShare uma empresa Scribd logo
1 de 69
What is Hadoop?
• The Apache Hadoop project develops open-
source software for reliable, scalable,
distributed computing.
In a nutshell
• Hadoop provides: a reliable shared storage and
analysis system.
• The storage is provided by HDFS
• The analysis by MapReduce.
Map Reduce
• HDFS handles the Distributed Filesystem layer
• MapReduce is a programming model for data processing.
• MapReduce
– Framework for parallel computing
– Programmers get simple API
– Don’t have to worry about handling
• parallelization
• data distribution
• load balancing
• fault tolerance
• Allows one to process huge amounts of data
(terabytes and petabytes) on thousands of processors
Map Reduce Concepts (Hadoop-
1.0)
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
Map Reduce Concepts
 Job Tracker
The Job-Tracker is responsible for accepting jobs
from clients, dividing those jobs into tasks, and
assigning those tasks to be executed by worker
nodes.
 Task Tracker
Task-Tracker is a process that manages the
execution of the tasks currently assigned to that
node. Each Task Tracker has a fixed number of
slots for executing tasks (two maps and two
reduces by default).
Job Tracker
DFS
Job Tracker
1. Copy Input Files
User
2. Submit Job
3. Get Input Files’ Info
6. Submit Job
4. Create Splits
5. Upload Job
Information
Input Files
Client
Job.xml.
Job.jar.
Job Tracker
DFS
Client
Job.xml.
Job.jar.
6. Submit Job
8. Read Job Files
7. Initialize Job
Job Queue
As many maps
as splitsInput Spilts
Maps Reduces
9. Create
maps and
reduces
Job Tracker
Job Tracker
Job Tracker
Job QueueH3
H4
H5
H1
Task Tracker -
H2
Task Tracker -
H4
10. Heartbeat
12. Assign Tasks
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Picks Tasks
(Data Local if possible)
Task Tracker -
H3
Task Tracker -
H1
Understanding Data Transformations
• In order to write MapReduce applications you need to have an understanding of
how data is transformed as it executes in the MapReduce framework.
From start to finish, there are four fundamental transformations. Data is:
• Transformed from the input files and fed into the mappers
• Transformed by the mappers
• Sorted, merged, and presented to the reducer
• Transform by reducers and written to output files
Solving a Programming Problem using
MapReduce
There are a total of 10 fields of information in each line. Our
programming objective uses only the first and fourth fields, which
are arbitrarily called "year" and "delta" respectively. We will ignore
all the other fields of data.
Designing and Implementing the
Mapper Class
Designing and Implementing the
Reducer Class
Design and Implement The Driver
Introduction to MapReduce
Framework
 A programming model for parallel data processing.
 Hadoop can run map reduce programs in multiple
languages like Java, Python, Ruby etc.
 Map function:
 Operate on set of key, value pairs
 Map is applied in parallel on input data set
 This produces output keys and list of values for each key
depending upon the functionality
 Mapper output are partitioned per reducer
 Reduce function:
 Operate on set of key, value pairs
 Reduce is then applied in parallel to each group, again
producing a collection of key, values.
 Total number of reducers can be set by the user.
Skeleton of a MapReduce Program
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Skeleton of a MapReduce Program
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Skeleton of a MapReduce program
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem.get(conf).delete(new Path(args[1]), true);
job.waitForCompletion(true);
}
Executing MR Job in Java
1)Compile all the 3 java files which will create 3
.class files
2)Add all 3 .class files into 1 single jar file by
writing this command
jar –cvf file_name.jar *.class
3)Now you just need to execute single jar file
by writing this command
bin/hadoop jar file_name.jar Basic
input_file_name output_file_name
Overall MR Word Count Process
Understanding processing in a
MapReduce framework
User runs a program on the client computer
Program submits a job to HDFS.
Job contains:
 Input data
 Map / Reduce program
 Configuration information
Two types of daemons that control job
execution:
 Job Tracker (master node)
 Task Trackers (slave nodes)
Understanding processing in a
MapReduce framework
Job sent to JobTracker.
JobTracker communicates with NameNode
and assigns parts of job to TaskTrackers
 Task is a single MAP or REDUCE
operation over piece of data.
 The JobTracker knows (from NameNode)
which node contains the data, and which
other machines are nearby.
Task processes send heartbeats to
TaskTracker
 TaskTracker sends heartbeats to the
JobTracker.
Understanding processing in a
MapReduce framework
Any tasks that did not report in certain time
(default is 10 min) assumed to be failed and
it’s JVM will be killed by TaskTracker and
reported to the JobTracker
The JobTracker will reschedule any failed
tasks (with different TaskTracker)
If same task failed 4 times all job fails
Any TaskTracker reporting high number of
failed jobs on particular node will be blacklist
the node (remove metadata from NameNode)
JobTracker maintains and manages the status
of each job. Results from failed tasks will be
ignored
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA
Node 1 Node 2
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
MapReduce Flow - Mapper
InputSplit InputSplit InputSplit
Input File Input File
InputSplit InputSplit
RecordReader RecordReader RecordReader RecordReader RecordReader
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
InputFormat
MapReduce Flow – Shuffle and Sort
Mapper Mapper Mapper Mapper Mapper
Partitioner Partitioner Partitioner Partitioner Partitioner
Intermediates Intermediates Intermediates Intermediates Intermediates
Reducer Reducer Reducer
Intermediates Intermediates Intermediates
MapReduce Flow - Reducer
Reducer Reducer Reduce
Output File
RecordWriter
OutputFormat
Output File
RecordWriter
Output File
RecordWriter
Map Reduce – Again Closure look
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
MapReduce API - Overview
MapReduce API Data Types: Writable
MapReduce - Input Format
 How the input files are split up and read is defined by
the InputFormat
 InputFormat is a class that does the following:
 Selects the files that should be used for input
 Defines the InputSplits that break a file
 Provides a factory for RecordReader objects that
read the file
MapReduce API - InputFormats
Input Splits
 An input split describes a unit of work that comprises
a single map task in a MapReduce program
 By default, the InputFormat breaks a file upto 64MB
splits
 By dividing the file into splits, we allow several map
tasks to operate on a single file in parallel
 If the file is very large, this can improve performance
significantly through parallelism
 Each map task corresponds to a single input split
RecordReader
 The input split defines a slice of work but does
not describe how to access it
 The RecordReader class actually loads data from
its source and converts it into (K, V) pairs suitable
for reading by Mappers
 The RecordReader is invoked repeatedly on the
input until the entire split is consumed
 Each invocation of the RecordReader leads to
another call of the map function defined by the
programmer
Mapper and Reducer
 The Mapper performs the user-defined work of
the first phase of the MapReduce program.
 A new instance of Mapper is created for each
split.
 The Reducer performs the user-defined work of
the second phase of the MapReduce program.
 A new instance of Reducer is created for each
partition.
 For each key in the partition assigned to a
Reducer, the Reducer is called once.
Combiner
• Apply reduce function to map output before it is sent to reducer
• Reduces number of records outputted by mapper!
Partitioner
 Each mapper may produce (K, V) pairs to any
partition.
 Therefore, the map nodes must all agree on
where to send different pieces of intermediate
data.
 The partitioner class determines which
partition a given (K,V) pair will go to.
The default partitioner computes a hash value for a
given key and assigns it to a partition based on
this result.
MapReduce Execution – Single
Reduced Task
MapReduce Execution – Multiple
Reduce tasks
MapReduce Execution – With No
Reduce Tasks
Shuffle and Sort
Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner
Shuffle and Sort
• Probably the most complex aspect of MapReduce and heart of
the map reduce!
• Map side
 Map outputs are buffered in memory in a circular buffer.
 When buffer reaches threshold, contents are “spilled” to
disk.
 Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here first.
• Reduce side
 First, map outputs are copied over to reducer machine.
 “Sort” is a multi-pass merge of map outputs (happens in
memory and on disk): combiner runs here again.
 Final merge pass goes directly into reducer.
Output Format
 The OutputFormat class defines the way (K,V) pairs
produced by Reducers are written to output files
 The instances of OutputFormat provided by Hadoop
write to files on the local disk or in HDFS
 Several OutputFormats are provided by Hadoop:
TextOutputFormat - Default; writes lines in "key t
value" format
SequenceFileOutputFormat - Writes binary files
suitable for reading into subsequent MR jobs
NullOutputFormat - Generates no output files
Job Scheduling in MapReduce
Job Queue
Job Scheduling in MapReduce
Job Queue
Job Scheduling in MapReduce
Job Queue
Job Scheduling in MapReduce
Job Queue
Job Scheduling in MapReduce
Job Queue
Fault Tolerance
 MapReduce can guide jobs toward a successful completion even
when jobs are run on a large cluster where probability of failures
increases
 The primary way that MapReduce achieves fault tolerance is
through restarting tasks
 If a TT fails to communicate with Application Manager for a
period of time (by default, 1 minute in Hadoop), JT will assume
that TT in question has crashed
 If the job is still in the map phase, JT asks another TT to re-
execute all Mappers that previously ran at the failed TT
 If the job is in the reduce phase, Application Manager asks
another TT to re-execute all Reducers that were in progress
on the failed TT
Speculative Execution
 A MapReduce job is dominated by the slowest task
 MapReduce attempts to locate slow tasks (stragglers)
and run redundant (speculative) tasks that will
optimistically commit before the corresponding
stragglers
 This process is known as speculative execution
 Only one copy of a straggler is allowed to be
speculated
 Whichever copy (among the two copies) of a task
commits first, it becomes the definitive copy, and the
other copy is killed by JT
Locating Stragglers
 How does Hadoop locate stragglers?
 Hadoop monitors each task progress using a
progress score between 0 and 1
 If a task’s progress score is less than (average –
0.2), and the task has run for at least 1 minute, it
is marked as a straggler
PS= 2/3
PS= 1/12
 Not a stragglerT1
T2
Time
A straggler
MapReduce Execution - One Picture
Data Flow in a MapReduce Program
• InputFormat
• Map function
• Partitioner
• Sorting & Merging
• Combiner
• Shuffling
• Merging
• Reduce function
• OutputFormat
 1:many
Counters
There are often things that you would like to know about the data you are analyzing
but that are peripheral to the analysis you are performing. Counters are a useful
channel for gathering statistics about the job: for quality control or for application-
level statistics.
Built-in Counters
Hadoop maintains some built-in counters for every job, and these report various
metrics. For example, there are counters for the number of bytes and records
processed, which allow you to confirm that the expected amount of input was
consumed and the expected amount of output was produced.
Counters are divided into groups, and there are several groups for the built-in
counters,
• MapReduce task counters
• Filesystem counters
• FileInputFormat counters
• FileOutputFormat counters
Each group either contains task counters (which are updated as a task progresses) or
job counters (which are updated as a job progresses).
Counters
Task counters
Task counters gather information about tasks over the course of their execution, and
the results are aggregated over all the tasks in a job.
The MAP_INPUT_RECORDS counter, for example, counts the input records read by
each map task and aggregates over all map tasks in a job, so that the final figure is the
total number of input records for the whole job.
Task counters are maintained by each task attempt, and periodically sent to the
application master so they can be globally aggregated.
Job counters
Job counters are maintained by the application master, so they don’t need to be sent
across the network, unlike all other counters, including user-defined ones. They
measure job-level statistics, not values that change while a task is running. For
example, TOTAL_LAUNCHED_MAPS counts the number of map tasks that were
launched over the course of a job (including tasks that failed).
User-Defined Java Counters
MapReduce allows user code to define a set of counters, which are then incremented
as desired in the mapper or reducer. Counters are defined by a Java enum.
JOINS
MAP-JOIN
• A map-side join between large inputs works by performing the join
before the data reaches the map function. For this to work, though,
the inputs to each map must be partitioned and sorted in a
particular way. Each input dataset must be divided into the same
number of partitions, and it must be sorted by the same key (the
join key) in each source. All the records for a particular key must
reside in the same partition. This may sound like a strict
requirement (and it is), but it actually fits the description of the
output of a MapReduce job.
Distributed Cache
• It is preferable to distribute datasets using Hadoop’s distributed
cache mechanism which provides a service for copying files to the
task nodes for the tasks to use them when they run. To save
network bandwidth, files are normally copied to any particular
node once per job.
JOINS
Reducer side –JOIN
• Reduce-side join is more general than a map-side
join, in that the input datasets don’t have to be
structured in any particular way, but it is less
efficient because both datasets have to go
through the MapReduce shuffle. The basic idea is
that the mapper tags each record with its source
and uses the join key as the map output key, so
that the records with the same key are brought
together in the reducer.
Secondary Sorting
• The MapReduce framework sorts the records by key
before they reach the reducers. For any particular key,
however, the values are not sorted. The order in which
the values appear is not even stable from one run to
the next, because they come from different map tasks,
which may finish at different times from run to run.
Generally, most MapReduce programs are written so as
not to depend on the order in which the values appear
to the reduce function. However, it is possible to
impose an order on the values by sorting and grouping
the keys in a particular way.
ToolRunner
public int run(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(multiInputFile.class);
…..
……
……
FileOutputFormat.setOutputPath(job, new Path(args[2]));
FileSystem.get(conf).delete(new Path(args[2]), true);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception
{
int ecode = ToolRunner.run(new multiInputFile(), args);
System.exit(ecode);
}
Practice Session
Thank You
• Question?
• Feedback?
explorehadoop@gmail.com

Mais conteúdo relacionado

Mais procurados

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

Mais procurados (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Hadoop
HadoopHadoop
Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 

Semelhante a Map Reduce

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Semelhante a Map Reduce (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Unit 2
Unit 2Unit 2
Unit 2
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

Mais de Prashant Gupta

Mais de Prashant Gupta (9)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Sqoop
SqoopSqoop
Sqoop
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Último

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Map Reduce

  • 1. What is Hadoop? • The Apache Hadoop project develops open- source software for reliable, scalable, distributed computing. In a nutshell • Hadoop provides: a reliable shared storage and analysis system. • The storage is provided by HDFS • The analysis by MapReduce.
  • 2. Map Reduce • HDFS handles the Distributed Filesystem layer • MapReduce is a programming model for data processing. • MapReduce – Framework for parallel computing – Programmers get simple API – Don’t have to worry about handling • parallelization • data distribution • load balancing • fault tolerance • Allows one to process huge amounts of data (terabytes and petabytes) on thousands of processors
  • 3. Map Reduce Concepts (Hadoop- 1.0) Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node
  • 4. Map Reduce Concepts  Job Tracker The Job-Tracker is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.  Task Tracker Task-Tracker is a process that manages the execution of the tasks currently assigned to that node. Each Task Tracker has a fixed number of slots for executing tasks (two maps and two reduces by default).
  • 5. Job Tracker DFS Job Tracker 1. Copy Input Files User 2. Submit Job 3. Get Input Files’ Info 6. Submit Job 4. Create Splits 5. Upload Job Information Input Files Client Job.xml. Job.jar.
  • 6. Job Tracker DFS Client Job.xml. Job.jar. 6. Submit Job 8. Read Job Files 7. Initialize Job Job Queue As many maps as splitsInput Spilts Maps Reduces 9. Create maps and reduces Job Tracker
  • 7. Job Tracker Job Tracker Job QueueH3 H4 H5 H1 Task Tracker - H2 Task Tracker - H4 10. Heartbeat 12. Assign Tasks 10. Heartbeat 10. Heartbeat 10. Heartbeat 11. Picks Tasks (Data Local if possible) Task Tracker - H3 Task Tracker - H1
  • 8. Understanding Data Transformations • In order to write MapReduce applications you need to have an understanding of how data is transformed as it executes in the MapReduce framework. From start to finish, there are four fundamental transformations. Data is: • Transformed from the input files and fed into the mappers • Transformed by the mappers • Sorted, merged, and presented to the reducer • Transform by reducers and written to output files
  • 9. Solving a Programming Problem using MapReduce There are a total of 10 fields of information in each line. Our programming objective uses only the first and fourth fields, which are arbitrarily called "year" and "delta" respectively. We will ignore all the other fields of data.
  • 10. Designing and Implementing the Mapper Class
  • 11.
  • 12.
  • 13. Designing and Implementing the Reducer Class
  • 14.
  • 15.
  • 16. Design and Implement The Driver
  • 17.
  • 18. Introduction to MapReduce Framework  A programming model for parallel data processing.  Hadoop can run map reduce programs in multiple languages like Java, Python, Ruby etc.  Map function:  Operate on set of key, value pairs  Map is applied in parallel on input data set  This produces output keys and list of values for each key depending upon the functionality  Mapper output are partitioned per reducer  Reduce function:  Operate on set of key, value pairs  Reduce is then applied in parallel to each group, again producing a collection of key, values.  Total number of reducers can be set by the user.
  • 19. Skeleton of a MapReduce Program public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 20. Skeleton of a MapReduce Program public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 21. Skeleton of a MapReduce program public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); FileSystem.get(conf).delete(new Path(args[1]), true); job.waitForCompletion(true); }
  • 22. Executing MR Job in Java 1)Compile all the 3 java files which will create 3 .class files 2)Add all 3 .class files into 1 single jar file by writing this command jar –cvf file_name.jar *.class 3)Now you just need to execute single jar file by writing this command bin/hadoop jar file_name.jar Basic input_file_name output_file_name
  • 23. Overall MR Word Count Process
  • 24. Understanding processing in a MapReduce framework User runs a program on the client computer Program submits a job to HDFS. Job contains:  Input data  Map / Reduce program  Configuration information Two types of daemons that control job execution:  Job Tracker (master node)  Task Trackers (slave nodes)
  • 25. Understanding processing in a MapReduce framework Job sent to JobTracker. JobTracker communicates with NameNode and assigns parts of job to TaskTrackers  Task is a single MAP or REDUCE operation over piece of data.  The JobTracker knows (from NameNode) which node contains the data, and which other machines are nearby. Task processes send heartbeats to TaskTracker  TaskTracker sends heartbeats to the JobTracker.
  • 26. Understanding processing in a MapReduce framework Any tasks that did not report in certain time (default is 10 min) assumed to be failed and it’s JVM will be killed by TaskTracker and reported to the JobTracker The JobTracker will reschedule any failed tasks (with different TaskTracker) If same task failed 4 times all job fails Any TaskTracker reporting high number of failed jobs on particular node will be blacklist the node (remove metadata from NameNode) JobTracker maintains and manages the status of each job. Results from failed tasks will be ignored
  • 27. MapReduce Job Submission Flow Input data is distributed to nodes Node 1 Node 2 INPUT DATA Node 1 Node 2
  • 28. MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Map Node 1 Map Node 2 INPUT DATA Node 1 Node 2
  • 29. MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Map Node 1 Map Node 2 INPUT DATA Node 1 Node 2
  • 30. MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Map Node 1 Map Node 2 INPUT DATA Node 1 Node 2
  • 31. MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Map Node 1 Map Node 2 Reduce Reduce INPUT DATA Node 1 Node 2
  • 32. MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored Map Node 1 Map Node 2 Reduce Reduce INPUT DATA Node 1 Node 2
  • 33. MapReduce Flow - Mapper InputSplit InputSplit InputSplit Input File Input File InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Intermediates Mapper Intermediates Mapper Intermediates Mapper Intermediates Mapper Intermediates InputFormat
  • 34. MapReduce Flow – Shuffle and Sort Mapper Mapper Mapper Mapper Mapper Partitioner Partitioner Partitioner Partitioner Partitioner Intermediates Intermediates Intermediates Intermediates Intermediates Reducer Reducer Reducer Intermediates Intermediates Intermediates
  • 35. MapReduce Flow - Reducer Reducer Reducer Reduce Output File RecordWriter OutputFormat Output File RecordWriter Output File RecordWriter
  • 36. Map Reduce – Again Closure look file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store Node 1 Node 2 Shuffling Process Intermediate (K,V) pairs exchanged by all nodes
  • 37. MapReduce API - Overview
  • 38. MapReduce API Data Types: Writable
  • 39. MapReduce - Input Format  How the input files are split up and read is defined by the InputFormat  InputFormat is a class that does the following:  Selects the files that should be used for input  Defines the InputSplits that break a file  Provides a factory for RecordReader objects that read the file
  • 40. MapReduce API - InputFormats
  • 41. Input Splits  An input split describes a unit of work that comprises a single map task in a MapReduce program  By default, the InputFormat breaks a file upto 64MB splits  By dividing the file into splits, we allow several map tasks to operate on a single file in parallel  If the file is very large, this can improve performance significantly through parallelism  Each map task corresponds to a single input split
  • 42. RecordReader  The input split defines a slice of work but does not describe how to access it  The RecordReader class actually loads data from its source and converts it into (K, V) pairs suitable for reading by Mappers  The RecordReader is invoked repeatedly on the input until the entire split is consumed  Each invocation of the RecordReader leads to another call of the map function defined by the programmer
  • 43. Mapper and Reducer  The Mapper performs the user-defined work of the first phase of the MapReduce program.  A new instance of Mapper is created for each split.  The Reducer performs the user-defined work of the second phase of the MapReduce program.  A new instance of Reducer is created for each partition.  For each key in the partition assigned to a Reducer, the Reducer is called once.
  • 44. Combiner • Apply reduce function to map output before it is sent to reducer • Reduces number of records outputted by mapper!
  • 45. Partitioner  Each mapper may produce (K, V) pairs to any partition.  Therefore, the map nodes must all agree on where to send different pieces of intermediate data.  The partitioner class determines which partition a given (K,V) pair will go to. The default partitioner computes a hash value for a given key and assigns it to a partition based on this result.
  • 46. MapReduce Execution – Single Reduced Task
  • 47. MapReduce Execution – Multiple Reduce tasks
  • 48. MapReduce Execution – With No Reduce Tasks
  • 49. Shuffle and Sort Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Combiner
  • 50. Shuffle and Sort • Probably the most complex aspect of MapReduce and heart of the map reduce! • Map side  Map outputs are buffered in memory in a circular buffer.  When buffer reaches threshold, contents are “spilled” to disk.  Spills merged in a single, partitioned file (sorted within each partition): combiner runs here first. • Reduce side  First, map outputs are copied over to reducer machine.  “Sort” is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs here again.  Final merge pass goes directly into reducer.
  • 51. Output Format  The OutputFormat class defines the way (K,V) pairs produced by Reducers are written to output files  The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS  Several OutputFormats are provided by Hadoop: TextOutputFormat - Default; writes lines in "key t value" format SequenceFileOutputFormat - Writes binary files suitable for reading into subsequent MR jobs NullOutputFormat - Generates no output files
  • 52. Job Scheduling in MapReduce Job Queue
  • 53. Job Scheduling in MapReduce Job Queue
  • 54. Job Scheduling in MapReduce Job Queue
  • 55. Job Scheduling in MapReduce Job Queue
  • 56. Job Scheduling in MapReduce Job Queue
  • 57. Fault Tolerance  MapReduce can guide jobs toward a successful completion even when jobs are run on a large cluster where probability of failures increases  The primary way that MapReduce achieves fault tolerance is through restarting tasks  If a TT fails to communicate with Application Manager for a period of time (by default, 1 minute in Hadoop), JT will assume that TT in question has crashed  If the job is still in the map phase, JT asks another TT to re- execute all Mappers that previously ran at the failed TT  If the job is in the reduce phase, Application Manager asks another TT to re-execute all Reducers that were in progress on the failed TT
  • 58. Speculative Execution  A MapReduce job is dominated by the slowest task  MapReduce attempts to locate slow tasks (stragglers) and run redundant (speculative) tasks that will optimistically commit before the corresponding stragglers  This process is known as speculative execution  Only one copy of a straggler is allowed to be speculated  Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other copy is killed by JT
  • 59. Locating Stragglers  How does Hadoop locate stragglers?  Hadoop monitors each task progress using a progress score between 0 and 1  If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler PS= 2/3 PS= 1/12  Not a stragglerT1 T2 Time A straggler
  • 60. MapReduce Execution - One Picture
  • 61. Data Flow in a MapReduce Program • InputFormat • Map function • Partitioner • Sorting & Merging • Combiner • Shuffling • Merging • Reduce function • OutputFormat  1:many
  • 62. Counters There are often things that you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing. Counters are a useful channel for gathering statistics about the job: for quality control or for application- level statistics. Built-in Counters Hadoop maintains some built-in counters for every job, and these report various metrics. For example, there are counters for the number of bytes and records processed, which allow you to confirm that the expected amount of input was consumed and the expected amount of output was produced. Counters are divided into groups, and there are several groups for the built-in counters, • MapReduce task counters • Filesystem counters • FileInputFormat counters • FileOutputFormat counters Each group either contains task counters (which are updated as a task progresses) or job counters (which are updated as a job progresses).
  • 63. Counters Task counters Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. The MAP_INPUT_RECORDS counter, for example, counts the input records read by each map task and aggregates over all map tasks in a job, so that the final figure is the total number of input records for the whole job. Task counters are maintained by each task attempt, and periodically sent to the application master so they can be globally aggregated. Job counters Job counters are maintained by the application master, so they don’t need to be sent across the network, unlike all other counters, including user-defined ones. They measure job-level statistics, not values that change while a task is running. For example, TOTAL_LAUNCHED_MAPS counts the number of map tasks that were launched over the course of a job (including tasks that failed). User-Defined Java Counters MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapper or reducer. Counters are defined by a Java enum.
  • 64. JOINS MAP-JOIN • A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job. Distributed Cache • It is preferable to distribute datasets using Hadoop’s distributed cache mechanism which provides a service for copying files to the task nodes for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job.
  • 65. JOINS Reducer side –JOIN • Reduce-side join is more general than a map-side join, in that the input datasets don’t have to be structured in any particular way, but it is less efficient because both datasets have to go through the MapReduce shuffle. The basic idea is that the mapper tags each record with its source and uses the join key as the map output key, so that the records with the same key are brought together in the reducer.
  • 66. Secondary Sorting • The MapReduce framework sorts the records by key before they reach the reducers. For any particular key, however, the values are not sorted. The order in which the values appear is not even stable from one run to the next, because they come from different map tasks, which may finish at different times from run to run. Generally, most MapReduce programs are written so as not to depend on the order in which the values appear to the reduce function. However, it is possible to impose an order on the values by sorting and grouping the keys in a particular way.
  • 67. ToolRunner public int run(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(multiInputFile.class); ….. …… …… FileOutputFormat.setOutputPath(job, new Path(args[2])); FileSystem.get(conf).delete(new Path(args[2]), true); return (job.waitForCompletion(true) ? 0 : 1); } public static void main(String[] args) throws Exception { int ecode = ToolRunner.run(new multiInputFile(), args); System.exit(ecode); }
  • 69. Thank You • Question? • Feedback? explorehadoop@gmail.com

Notas do Editor

  1. The key of the first record is the byte offset to the line in the input file (the 0th byte). The value of the first record includes the year, number of receipts, outlays, and the delta (receipts – outlays). Remember – we are interested only in the first and fourth fields of the record value. Since the record value is in Text format, we will use a StringTokenizer to break up the Text string into individual fields. Here we construct the StringTokenizer using white space as the delimiter.
  2. Since we hard-coded the key to always be the string “summary,” there will be only one partition (and therefore only one reducer) when this mapreduce program is launched.
  3. We determine if we’ve found a global minimum delta, and if so, assign the min and minYear accordingly. When we pop out of the loop, we have the global min delta and the year associated with the min. We emit the year and min delta.
  4. In the Driver class, we also define the types for output key and value in the job as Text and FloatWritable respectively.  If the mapper and reducer classes do NOT use the same output key and value types, we must specify for the mapper.  In this case, the output value type of the mapper is Text, while the output value type of the reducer is FloatWritable. There are 2 ways to launch the job – syncronously and asyncronously.  The job.waitForCompletion() launches the job syncronously.   The driver code will block waiting for the job to complete at this line.  The true argument informs the framework to write verbose output to the controlling terminal of the job. The main() method is the entry point for the driver. In this method, we instantiate a new Configuration object for the job. We then call the ToolRunner static run() method.