MapReduce is one of the most important and major component in Hadoop Ecosystem. Whenever we are having a large set of data then in the case of the huge data set will be divided into smaller pieces and processing will be done on them in parallel in MapReduce.
3. The MapReduce is one of the main components of the Hadoop Ecosystem.
MapReduce is designed to process a large amount of data in parallel by dividing
the work into some smaller and independent tasks.
MapReduce programs take input as a list and convert to the output as a list also.
4. The Map takes a set of keys and values as input. It may be in a structured
or unstructured form
The Keys are the reference of input files and Values are the dataset
The task is applied on every input value
5. The Reducer takes the key-value pair which is created by the mapper as
input
The key-value pairs are sorted by the key elements
In the Reducer, we perform the sorting, aggregation or summation type
jobs
6. The given inputs are processed
by the user-defined methods.
All different business logics
are working on the mapper
section. Mapper generates
intermediate data and Reducer
takes them as input. The data
are processed by user-defined
function in the Reducer
section. The final output is
stored in HDFS (Hadoop
Distributed File System).
7. When Mapper output is collected it is partitioned which means that it will be
written to the output specified by the partitioner
Partitioning is responsible for dividing up the intermediate key space and
assigning intermediate key-value pairs to reducers
It assigns approximately the same number of keys to each reducer
8. Combiners are an optimization in MapReduce that allow for local
aggregation before the shuffle and sort phase
If a Combiner is used then the map key-value pairs are not immediately
written to the output. Instead they will be collected in lists, one list per
each key-value
9. Let us take a real-world example to comprehend the power of
MapReduce
Twitter receives around 500 million tweets per day which is nearly 3000
tweets per second