More Related Content
Similar to Njug presentation (20)
Njug presentation
- 1. 01-‐1
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Hadoop
101:
WriCng
a
Java
MapReduce
Program
Ian
Wrigley
Sr.
Curriculum
Manager,
Cloudera
ian@cloudera.com
|
@iwrigley
- 2. 01-‐2
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
And,
by
the
way,
what
is
Hadoop?
Why
the
World
Needs
Hadoop
- 3. 01-‐3
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ Every
day…
– More
than
1.5
billion
shares
are
traded
on
the
NYSE
– Facebook
stores
2.7
billion
comments
and
Likes
§ Every
minute…
– Foursquare
handles
more
than
2,000
check-‐ins
– TransUnion
makes
nearly
70,000
updates
to
credit
files
§ And
every
second…
– Banks
process
more
than
10,000
credit
card
transacCons
Volume
- 4. 01-‐4
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ We
are
genera;ng
data
faster
than
ever
– Processes
are
increasingly
automated
– People
are
increasingly
interacCng
online
– Systems
are
increasingly
interconnected
Velocity
- 5. 01-‐5
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ We’re
producing
a
variety
of
data,
including
– Audio
– Video
– Images
– Log
files
– Web
pages
– Product
raCng
comments
– Social
network
connecCons
§ Not
all
of
this
maps
cleanly
to
the
rela;onal
model
Variety
- 6. 01-‐6
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ One
tweet
is
an
anecdote
– But
a
million
tweets
may
signal
important
trends
§ One
person’s
product
review
is
an
opinion
– But
a
million
reviews
might
uncover
a
design
flaw
§ One
person’s
diagnosis
is
an
isolated
case
– But
a
million
medical
records
could
lead
to
a
cure
Big
Data
Can
Mean
Big
Opportunity
- 7. 01-‐7
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
A
Scalable
Data
Processing
Framework
MapReduce
- 8. 01-‐8
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ MapReduce
is
a
programming
model
– It’s
a
way
of
processing
data
§ In
Hadoop,
you
supply
two
func;ons
to
process
data:
Map
and
Reduce
– Map:
typically
used
to
transform,
parse,
or
filter
data
– Reduce:
typically
used
to
summarize
results
§ The
Map
func;on
always
runs
first
– The
Reduce
funcCon
runs
acerwards
– The
Hadoop
framework
performs
a
shuffle
and
sort
to
transfer
data
from
the
Map
funcCon
to
the
Reduce
funcCon
§ Each
piece
is
simple,
but
can
be
powerful
when
combined
What
is
MapReduce?
- 9. 01-‐9
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ …
in
which
Ian
waves
his
hands
around
and
aRempts
to
explain
the
MapReduce
flow
MapReduce:
An
Example
- 10. 01-‐10
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ MapReduce
processing
in
Hadoop
is
batch-‐oriented
§ Usually
wriRen
in
Java
– This
uses
Hadoop’s
API
directly
– You
can
do
basic
MapReduce
in
other
languages
– Using
the
Hadoop
Streaming
wrapper
program
– Some
advanced
features
require
Java
code
MapReduce
Code
for
Hadoop
- 11. 01-‐11
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ Some
(very)
basic
concepts:
– Input
and
output
data
is
typed
– The
framework
passes
each
input
record
to
the
Mapper
in
turn
– A
record
is
a
(key,
value)
pair
– For
text
files:
– The
key
is
the
byte
offset
of
the
start
of
the
line
– The
value
is
the
line
itself
– Output
data
from
the
Mapper
is
transferred
to
the
Reducer
via
a
process
known
as
the
shuffle
and
sort
– Reducers
receive
(key,
Iterable
of
values)
sets,
in
sorted
key
order
– Job
is
configured
and
executed
using
a
driver
class
Basic
Java
API
Concepts
- 12. 01-‐12
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Data
Flow
Map
input
Map
output
Reduce
input
Reduce
output
Shuffle
and
sort
Nashville J. Jones 12.95 2013-07-21
Memphis S. Smith 66.57 2013-07-21
Nashville T. Harding 55.35 2013-07-22
Knoxville S. Warne 10.99 2013-07-22
Kingsport M. Thompson 99.95 2013-07-22
Nashville 12.95
Memphis 66.57
Nashville 55.35
Knoxville 10.99
Kingsport 99.95
Kingsport[99.95]
Knoxville[10.99]
Memphis [66.57]
Nashville[12.95, 55.35]
Kingsport 99.95
Knoxville 10.99
Memphis 66.57
Nashville 68.30
- 13. 01-‐13
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Mapper
package com.cloudera.example;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class StoreSalesMapper extends Mapper<LongWritable, Text,
Text, DoubleWritable> {
1
2
3
4
5
6
7
8
9
10
Input
key
and
value
types
Output
key
and
value
types
- 14. 01-‐14
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Mapper
/*
* The map method is invoked once for each line of text in the
* input data. The method receives a key of type LongWritable
* (which corresponds to the byte offset in the current input
* file), a value of type Text (representing the line of input
* data), and a Context object (which allows us to print status
* messages, among other things).
*/
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
11
12
13
14
15
16
17
18
19
20
21
22
23
- 15. 01-‐15
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Mapper
String line = value.toString();
// ignore empty lines
if (line.trim().isEmpty()) {
return;
}
String[] fields = line.split("t");
// ensure this line is not malformed
if (fields.length != 4) {
return;
}
24
25
26
27
28
29
30
31
32
33
34
35
36
Convert
value
to
a
Java
String
Defensive
programming!
Split
record
into
fields
Even
more
defensive
programming!
- 16. 01-‐16
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Mapper
String storeName = fields[0];
Double saleValue = Double.parseDouble(fields[2]);
context.write(new Text(storeName), new DoubleWritable(saleValue));
}
}
37
38
39
40
41
42
43
44
45
46
47
Output
key
and
value
Extract
based
on
posiCon
- 17. 01-‐17
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Reducer
package com.cloudera.example;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, DoubleWritable,
Text, DoubleWritable> {
1
2
3
4
5
6
7
8
9
10
Output
key
and
value
types
Input
key
and
value
types
- 18. 01-‐18
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Reducer
/*
* The reduce method is invoked once for each key received from
* the shuffle and sort phase of the MapReduce framework.
* The method receives a key of type Text (representing the key),
* a set of values of type DoubleWritable, and a Context object.
*/
@Override
public void reduce(Text key, Iterable<DoubleWritable> values,
Context context) throws IOException, InterruptedException {
11
12
13
14
15
16
17
18
19
- 19. 01-‐19
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Reducer
// used to sum up the store sales
double sum = 0;
// add to it it for each new value received
for (DoubleWritable value : values) {
sum += value.get();
}
// Our output is the event type (key) and the sum (value)
context.write(key, new DoubleWritable(sum));
}
}
20
21
22
23
24
25
26
27
28
29
30
31
Output
key
and
value
- 20. 01-‐20
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Driver
package com.cloudera.example;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
// The driver is just a regular Java class with a "main" method
public class StoreSales {
public static void main(String[] args) throws Exception {
1
2
3
4
5
6
7
8
9
10
11
12
13
- 21. 01-‐21
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Driver
// validate command line arguments (we require the user
// to specify the HDFS paths to use for the job; see below)
if (args.length != 2) {
System.out.printf("Usage: Driver <input dir> <output dir>n");
System.exit(-1);
}
// Instantiate a Job object for our job's configuration.
Job job = new Job();
// configure input and output paths based on supplied arguments
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
14
15
16
17
18
19
20
21
22
23
24
25
26
- 22. 01-‐22
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Driver
// tells Hadoop to copy the JAR containing this class
// to cluster nodes, as required to run this job
job.setJarByClass(StoreSales.class);
// give the job a descriptive name. This is optional, but
// helps us identify this job on a busy cluster
job.setJobName("Store Sale Aggregator");
// Specify which classes to use for the Mapper and Reducer
job.setMapperClass(StoreSalesMapper.class);
job.setReducerClass(SumReducer.class);
27
28
29
30
31
32
33
34
35
36
37
- 23. 01-‐23
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
Java
MR
Job
Example:
Driver
// specify the Mapper's output key and value classes
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
// specify the job's output key and value classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
// start the MapReduce job and wait for it to finish.
// if it finishes successfully, return 0; otherwise 1.
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
38
39
40
41
42
43
44
45
46
47
48
49
50
51
- 24. 01-‐24
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ And
now…
the
program
actually
running
on
a
pseudo-‐distributed
cluster
Demo
- 25. 01-‐25
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ Obviously
there’s
much
more
to
the
Hadoop
API
than
this
– ParCConers
– Combiners
– Custom
Writables,
custom
WritableComparables
– DistributedCache
– Counters
– Etc.,
etc.,
etc
§ …but
even
with
just
this
amount
of
knowledge,
you
could
write
real-‐world
Hadoop
applica;ons
Conclusion
- 26. 01-‐26
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
§ Helps
companies
profit
from
all
their
data
– Founded
by
experts
from
Facebook,
Google,
Oracle,
and
Yahoo
§ We
offer
products
and
services
for
large-‐scale
data
analysis
– Socware
(CDH
distribuCon
and
Cloudera
Manager)
– ConsulCng
and
support
services
– Training
and
cerCficaCon
§ Want
to
aRend
a
training
course?
Use
the
code
Nashville_15
for
15%
off
any
Cloudera-‐delivered
class
About
Cloudera
- 27. 01-‐27
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.