2. • Understanding
MapReduce
• Map
Reduce
-‐
An
Introduction
• Word
count
–
default
• Word
count
–
custom
nagarjuna@outlook.com
3. ¡ Programming
model
to
process
large
datasets
¡ Supported
languages
for
MR
§ Java
§ Ruby
§ Python
§ C++
¡ Map
Reduce
Programs
are
Inherently
parallel.
§ More
data
è
more
machines
to
analyze.
§ No
need
to
change
anything
in
the
code.
nagarjuna@outlook.com
4. ¡ Start
with
WORDCOUNT
example
§ “Do
as
I
say,
not
as
I
do”
Word
Count
As
2
Do
2
I
2
Not
2
Say
1
nagarjuna@outlook.com
5. define
wordCount
as
Map<String,long>;
for
each
document
in
documentSet
{
T
=
tokenize(document);
for
each
token
in
T
{
wordCount[token]++;
}
}
display(wordCount);
¡ This
works
until
the
no.of
documents
to
process
is
not
very
large
nagarjuna@outlook.com
6. ¡ Spam
filter
§ Millions
of
emails
§ Word
count
for
analysis
¡ Working
from
a
single
computer
is
time
consuming
¡ Rewrite
the
program
to
count
form
multiple
machines
nagarjuna@outlook.com
7. ¡ How
do
we
attain
parallel
computing
?
1. All
the
machines
compute
fraction
of
documents
2. Combine
the
results
from
all
the
machines
nagarjuna@outlook.com
8. STAGE
1
define
wordCount
as
Map<String,long>;
for
each
document
in
documentSUBSet
{
T
=
tokenize(document);
for
each
token
in
T
{
wordCount[token]++;
}
}
nagarjuna@outlook.com
9. STAGE
2
define
totalWordCount
as
Multiset;
for
each
wordCount
received
from
firstPhase
{
multisetAdd
(totalWordCount,
wordCount);
}
Display(totalWordcount)
nagarjuna@outlook.com
11. Problems
STAGE
1
• Documents
segregations
to
be
well
Master
defined
Comp-‐1
• Bottle
neck
in
network
transfer
• Data-‐intensive
processing
• Not
computational
intensive
Comp-‐2
• So
better
store
files
over
Documents
processing
machines
• BIGGEST
FLAW
Comp-‐3
• Storing
the
words
and
count
in
memory
• Disk
based
hash-‐table
Comp-‐4
nagarjuna@outlook.com
implementation
needed
12. Problems
STAGE
2
Master
• Phase
2
has
only
once
machine
• Bottle
Neck
Comp-‐1
• Phase
1
highly
distributed
though
• Make
phase
2
also
distributed
Comp-‐2
Documents
• Need
changes
in
Phase
1
• Partition
the
phase-‐1
output
(say
based
on
first
character
of
the
word)
Comp-‐3
• We
have
26
machines
in
phase
2
• Single
Disk
based
hash-‐table
should
be
now
26
Disk
based
hash-‐table
• Word
count-‐a
,
worcount-‐b,wordcount-‐c
Comp-‐4
nagarjuna@outlook.com
13.
Master
A
B
C
D
E
Comp-‐1
Comp-‐10
1
2
4
5
10
Comp-‐2
Comp-‐20
Documents
A
B
C
D
E
10
20
40
5
9
Comp-‐3
Comp-‐30
.
.
.
Comp-‐4
nagarjuna@outlook.com
Comp-‐40
14. ¡ After
phase-‐1
§ From
comp-‐1
▪ WordCount-‐A
à
comp-‐10
▪ WordCount-‐B
à
comp-‐20
▪ .
▪ .
▪ .
¡ Each
machine
in
phase
1
will
shuffle
its
output
to
different
machines
in
phase
2
nagarjuna@outlook.com
15. ¡ This
is
getting
complicated
§ Store
files
where
are
they
are
being
processed
§ Write
disk-‐based
hash
table
obviating
RAM
limitations
§ Partition
the
phase-‐1
output
§ Shuffle
the
phase-‐1
output
and
send
it
to
appropriate
reducer
nagarjuna@outlook.com
16. ¡ This
is
more
than
a
lot
for
word
count
¡ We
haven’t
even
touched
the
fault
tolerance
§ What
if
comp-‐1
or
com-‐10
fails
¡ So,
A
need
of
frame
work
to
take
care
of
all
these
things
§ We
concentrate
only
on
business
nagarjuna@outlook.com
17. Interim
MAPPER
output
REDUCER
Master
A
B
C
D
E
Comp-‐1
Comp-‐10
Shuffling
Partitioning
1
2
4
5
10
Comp-‐2
Comp-‐20
Documents
A
B
C
D
E
HDFS
1
2
4
5
10
Comp-‐3
Comp-‐30
.
.
.
Comp-‐4
nagarjuna@outlook.com
Comp-‐40
18. ¡ Mapper
¡ Reducer
Mapper
filters
and
transforms
the
input
Reducer
collects
that
and
aggregate
on
that.
Extensive
research
is
done
two
arrive
at
two
phase
strategy
nagarjuna@outlook.com
19. ¡ Mapper,Reducer,Partitioner,Shuffling
§ Work
together
à
common
structure
for
data
processing
Input
Output
Mapper
<K1,V1>
List<K2,V2>
Reducer
<k2,list(v2)>
List<k3,v3>
nagarjuna@outlook.com
21. ¡ As
said,
don’t
store
the
data
in
memory
§ So
keys
and
values
regularly
have
to
be
written
to
disk.
§ They
must
be
serialized.
§ Hadoop
provides
its
way
of
deserialization
§ Any
class
to
be
key
or
value
have
to
implement
WRITABLE
class.
nagarjuna@outlook.com
22. Java
Type
Hadoop
Serialized
Types
String
Text
Integer
IntWritable
Long
LongWritable
nagarjuna@outlook.com
23. ¡ Let’s
try
to
execute
the
following
command
▪ hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar
wordcount
▪ hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar
wordcount
<input>
<output>
¡ What
does
this
code
do
?
nagarjuna@outlook.com