Njug presentation

01-‐1
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Hadoop
101:

WriCng
a
Java
MapReduce
Program

Ian
Wrigley

Sr.
Curriculum
Manager,
Cloudera

ian@cloudera.com
|
@iwrigley

01-‐2
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

And,
by
the
way,
what
is
Hadoop?

Why
the
World
Needs
Hadoop

01-‐3
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ Every
day…

– More
than
1.5
billion
shares
are
traded
on
the
NYSE

– Facebook
stores
2.7
billion
comments
and
Likes

§ Every
minute…

– Foursquare
handles
more
than
2,000
check-‐ins

– TransUnion
makes
nearly
70,000
updates
to
credit
ﬁles

§ And
every
second…

– Banks
process
more
than
10,000
credit
card
transacCons

Volume

01-‐4
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ We
are
genera;ng
data
faster
than
ever

– Processes
are
increasingly
automated

– People
are
increasingly
interacCng
online

– Systems
are
increasingly
interconnected

Velocity

01-‐5
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ We’re
producing
a
variety
of
data,
including

– Audio

– Video

– Images

– Log
ﬁles

– Web
pages

– Product
raCng
comments

– Social
network
connecCons

§ Not
all
of
this
maps
cleanly
to
the
rela;onal
model

Variety

01-‐6
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ One
tweet
is
an
anecdote

– But
a
million
tweets
may
signal
important
trends

§ One
person’s
product
review
is
an
opinion

– But
a
million
reviews
might
uncover
a
design
ﬂaw

§ One
person’s
diagnosis
is
an
isolated
case

– But
a
million
medical
records
could
lead
to
a
cure

Big
Data
Can
Mean
Big
Opportunity

01-‐7
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

A
Scalable
Data
Processing
Framework

MapReduce

01-‐8
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ MapReduce
is
a
programming
model

– It’s
a
way
of
processing
data

§ In
Hadoop,
you
supply
two
func;ons
to
process
data:
Map
and
Reduce

– Map:
typically
used
to
transform,
parse,
or
filter
data

– Reduce:
typically
used
to
summarize
results

§ The
Map
func;on
always
runs
first

– The
Reduce
funcCon
runs
acerwards

– The
Hadoop
framework
performs
a
shuffle
and
sort
to
transfer
data

from
the
Map
funcCon
to
the
Reduce
funcCon

§ Each
piece
is
simple,
but
can
be
powerful
when
combined

What
is
MapReduce?

01-‐9
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ …
in
which
Ian
waves
his
hands
around
and
aRempts
to
explain
the

MapReduce
ﬂow

MapReduce:
An
Example

01-‐10
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ MapReduce
processing
in
Hadoop
is
batch-‐oriented

§ Usually
wriRen
in
Java

– This
uses
Hadoop’s
API
directly

– You
can
do
basic
MapReduce
in
other
languages

– Using
the
Hadoop
Streaming
wrapper
program

– Some
advanced
features
require
Java
code

MapReduce
Code
for
Hadoop

01-‐11
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ Some
(very)
basic
concepts:

– Input
and
output
data
is
typed

– The
framework
passes
each
input
record
to
the
Mapper
in
turn

– A
record
is
a
(key,
value)
pair

– For
text
files:

– The
key
is
the
byte
offset
of
the
start
of
the
line

– The
value
is
the
line
itself

– Output
data
from
the
Mapper
is
transferred
to
the
Reducer
via
a

process
known
as
the
shuffle
and
sort

– Reducers
receive
(key,
Iterable
of
values)
sets,
in
sorted
key
order

– Job
is
configured
and
executed
using
a
driver
class

Basic
Java
API
Concepts

01-‐12
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Data
Flow

Map
input

Map
output
Reduce
input
Reduce
output

Shuﬄe

and
sort

Nashville J. Jones 12.95 2013-07-21
Memphis S. Smith 66.57 2013-07-21
Nashville T. Harding 55.35 2013-07-22
Knoxville S. Warne 10.99 2013-07-22
Kingsport M. Thompson 99.95 2013-07-22
Nashville 12.95
Memphis 66.57
Nashville 55.35
Knoxville 10.99
Kingsport 99.95
Kingsport[99.95]
Knoxville[10.99]
Memphis [66.57]
Nashville[12.95, 55.35]
Kingsport 99.95
Knoxville 10.99
Memphis 66.57
Nashville 68.30

01-‐13
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Mapper

package com.cloudera.example;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class StoreSalesMapper extends Mapper<LongWritable, Text,
Text, DoubleWritable> {
1
2
3
4
5
6
7
8
9
10
Input
key
and
value
types

Output
key
and
value
types

01-‐14
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Mapper

/*
* The map method is invoked once for each line of text in the
* input data. The method receives a key of type LongWritable
* (which corresponds to the byte offset in the current input
* file), a value of type Text (representing the line of input
* data), and a Context object (which allows us to print status
* messages, among other things).
*/
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
11
12
13
14
15
16
17
18
19
20
21
22
23

01-‐15
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Mapper

String line = value.toString();
// ignore empty lines
if (line.trim().isEmpty()) {
return;
}
String[] fields = line.split("t");
// ensure this line is not malformed
if (fields.length != 4) {
return;
}
24
25
26
27
28
29
30
31
32
33
34
35
36
Convert
value
to
a
Java
String

Defensive
programming!

Split
record
into
ﬁelds

Even
more
defensive

programming!

01-‐16
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Mapper

String storeName = fields[0];
Double saleValue = Double.parseDouble(fields[2]);
context.write(new Text(storeName), new DoubleWritable(saleValue));
}
}
37
38
39
40
41
42
43
44
45
46
47
Output
key
and
value

Extract
based
on
posiCon

01-‐17
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Reducer

import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, DoubleWritable,
Text, DoubleWritable> {
1
2
3
4
5
6
7
8
9
10
Output
key
and
value
types

Input
key
and
value
types

01-‐18
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Reducer

/*
* The reduce method is invoked once for each key received from
* the shuffle and sort phase of the MapReduce framework.
* The method receives a key of type Text (representing the key),
* a set of values of type DoubleWritable, and a Context object.
*/
@Override
public void reduce(Text key, Iterable<DoubleWritable> values,
Context context) throws IOException, InterruptedException {
11
12
13
14
15
16
17
18
19

01-‐19
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Reducer

// used to sum up the store sales
double sum = 0;
// add to it it for each new value received
for (DoubleWritable value : values) {
sum += value.get();
}
// Our output is the event type (key) and the sum (value)
context.write(key, new DoubleWritable(sum));
}
}
20
21
22
23
24
25
26
27
28
29
30
31
Output
key
and
value

01-‐20
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Driver

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
// The driver is just a regular Java class with a "main" method
public class StoreSales {
public static void main(String[] args) throws Exception {
1
2
3
4
5
6
7
8
9
10
11
12
13

01-‐21
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Driver

// validate command line arguments (we require the user
// to specify the HDFS paths to use for the job; see below)
if (args.length != 2) {
System.out.printf("Usage: Driver <input dir> <output dir>n");
System.exit(-1);
}
// Instantiate a Job object for our job's configuration.
Job job = new Job();
// configure input and output paths based on supplied arguments
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
14
15
16
17
18
19
20
21
22
23
24
25
26

01-‐22
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Driver

// tells Hadoop to copy the JAR containing this class
// to cluster nodes, as required to run this job
job.setJarByClass(StoreSales.class);
// give the job a descriptive name. This is optional, but
// helps us identify this job on a busy cluster
job.setJobName("Store Sale Aggregator");
// Specify which classes to use for the Mapper and Reducer
job.setMapperClass(StoreSalesMapper.class);
job.setReducerClass(SumReducer.class);
27
28
29
30
31
32
33
34
35
36
37

01-‐23
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MR
Job
Example:
Driver

// specify the Mapper's output key and value classes
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
// specify the job's output key and value classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
// start the MapReduce job and wait for it to finish.
// if it finishes successfully, return 0; otherwise 1.
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
38
39
40
41
42
43
44
45
46
47
48
49
50
51

01-‐25
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ Obviously
there’s
much
more
to
the
Hadoop
API
than
this

– ParCConers

– Combiners

– Custom
Writables,
custom
WritableComparables

– DistributedCache

– Counters

– Etc.,
etc.,
etc

§ …but
even
with
just
this
amount
of
knowledge,
you
could
write
real-‐world

Hadoop
applica;ons

Conclusion

01-‐26
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

§ Helps
companies
profit
from
all
their
data

– Founded
by
experts
from
Facebook,
Google,
Oracle,
and
Yahoo

§ We
offer
products
and
services
for
large-‐scale
data
analysis

– Socware
(CDH
distribuCon
and
Cloudera
Manager)

– ConsulCng
and
support
services

– Training
and
cerCficaCon

§ Want
to
aRend
a
training
course?
Use
the
code
Nashville_15
for
15%
off

any
Cloudera-‐delivered
class

About
Cloudera

Njug presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Njug presentation

Similar to Njug presentation (20)

Recently uploaded

Recently uploaded (20)

Njug presentation