Map Reduce An Introduction

-  Nagarjuna K
nagarjuna@outlook.com

•  Understanding
MapReduce

•  Map
Reduce

-‐
An
Introduction

•  Word
count
–
default

•  Word
count
–
custom


¡  Programming
model
to
process
large
datasets

¡  Supported
languages
for
MR

§  Java

§  Ruby

§  Python

§  C++

¡  Map
Reduce
Programs
are
Inherently
parallel.

§  More
data
è
more
machines
to
analyze.

§  No
need
to
change
anything
in
the
code.


¡  Start
with
WORDCOUNT
example

§  “Do
as
I
say,
not
as
I
do”
Word

Count

As
2

Do
2

I
2

Not
2

Say
1


deﬁne
wordCount
as
Map<String,long>;

for
each
document
in
documentSet
{

T
=
tokenize(document);

for
each
token
in
T
{

wordCount[token]++;

}

}

display(wordCount);

¡  This
works
until
the
no.of
documents
to
process
is
not

very
large


¡  Spam
ﬁlter

§  Millions
of
emails

§  Word
count
for
analysis

¡  Working
from
a
single
computer
is
time

consuming

¡  Rewrite
the
program
to
count
form
multiple

machines


¡  How
do
we
attain
parallel
computing
?

1.  All
the
machines
compute
fraction
of

documents

2.  Combine
the
results
from
all
the
machines


STAGE
1

deﬁne
wordCount
as
Map<String,long>;

for
each
document
in
documentSUBSet
{

T
=
tokenize(document);

for
each
token
in
T
{

wordCount[token]++;

}

}


STAGE
2

deﬁne
totalWordCount
as
Multiset;

for
each
wordCount
received
from
ﬁrstPhase
{

multisetAdd
(totalWordCount,
wordCount);

}

Display(totalWordcount)


Master

Comp-‐1

Comp-‐2

Documents

Comp-‐3

Comp-‐4


Problems

STAGE
1

•  Documents
segregations
to
be
well

Master

deﬁned

Comp-‐1
•  Bottle
neck
in
network
transfer

•  Data-‐intensive
processing

•  Not
computational
intensive

Comp-‐2
•  So
better
store
ﬁles
over

Documents

processing
machines

•  BIGGEST
FLAW

Comp-‐3

•  Storing
the
words
and
count
in

memory

•  Disk
based
hash-‐table

Comp-‐4

implementation
needed

Problems

STAGE
2

Master

•  Phase
2
has
only
once
machine

•  Bottle
Neck

Comp-‐1
•  Phase
1
highly
distributed
though

•  Make
phase
2
also
distributed

Comp-‐2

Documents

•  Need
changes
in
Phase
1

•  Partition
the
phase-‐1
output
(say
based

on
ﬁrst
character
of
the
word)

Comp-‐3
•  We
have
26
machines
in
phase
2

•  Single
Disk
based
hash-‐table
should
be

now
26
Disk
based
hash-‐table

•  Word
count-‐a
,
worcount-‐b,wordcount-‐c

Comp-‐4


Master

A
B
C
D
E

Comp-‐1
Comp-‐10

1
2
4
5
10

Comp-‐2
Comp-‐20

Documents

A
B
C
D
E

10
20
40
5
9

Comp-‐3
Comp-‐30

.

.

.

Comp-‐4


Comp-‐40

¡  After
phase-‐1

§  From
comp-‐1

▪  WordCount-‐A
à
comp-‐10

▪  WordCount-‐B
à
comp-‐20

▪  .

▪  .

▪  .

¡  Each
machine
in
phase
1
will
shuﬄe
its
output
to

diﬀerent
machines
in
phase
2


¡  This
is
getting
complicated

§  Store
ﬁles
where
are
they
are
being
processed

§  Write
disk-‐based
hash
table
obviating
RAM

limitations

§  Partition
the
phase-‐1
output

§  Shuﬄe
the
phase-‐1
output
and
send
it
to

appropriate
reducer


¡  This
is
more
than
a
lot
for
word
count

¡  We
haven’t
even
touched
the
fault
tolerance

§  What
if
comp-‐1
or
com-‐10
fails

¡  So,
A
need
of
frame
work
to
take
care
of
all

these
things

§  We
concentrate
only
on
business


Interim

MAPPER
output
REDUCER

Master

A
B
C
D
E

Comp-‐1
Comp-‐10

Shuﬄing

Partitioning

1
2
4
5
10

Comp-‐2
Comp-‐20

Documents

A
B
C
D
E

HDFS

1
2
4
5
10

Comp-‐3
Comp-‐30

.

.

.

Comp-‐4


Comp-‐40

¡  Mapper

¡  Reducer

Mapper
ﬁlters
and
transforms
the
input

Reducer
collects
that
and
aggregate
on
that.

Extensive
research
is
done
two
arrive
at
two

phase
strategy


¡  Mapper,Reducer,Partitioner,Shuﬄing

§  Work
together
à
common
structure
for
data

processing

Input
Output

Mapper
<K1,V1>
List<K2,V2>

Reducer
<k2,list(v2)>
List<k3,v3>


¡  Mapper

§  <key,words_per_line>

:
Input

§  <word,1>
:
output
Input
Output

¡  Reducer
Mapper
<K1,V1>
List<K2,V2>

Reducer
<k2,list(v2)>
List<k3,v3>

§  <word,list(1)>

:
Input

§  <word,count(list(1))>

:
Output


¡  As
said,
don’t
store
the
data
in
memory

§  So
keys
and
values
regularly
have
to
be
written
to

disk.

§  They
must
be
serialized.

§  Hadoop
provides
its
way
of
deserialization

§  Any
class
to
be
key
or
value
have
to
implement

WRITABLE
class.


Java
Type
Hadoop
Serialized

Types

String
Text

Integer
IntWritable

Long
LongWritable


¡  Let’s
try
to
execute
the
following
command

▪  hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar

wordcount

▪  hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar

wordcount
<input>

<output>

¡  What
does
this
code
do
?


¡  Switch
to
eclipse


Map Reduce An Introduction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (9)

Semelhante a Map Reduce An Introduction

Semelhante a Map Reduce An Introduction (20)

Map Reduce An Introduction