Apache Hadoop & Friends at Utah Java User's Group

Apache Hadoop &
Friends
Philip Zeyliger
philip@cloudera.com
@philz42 @cloudera
February 18, 2010

1

Hi there!

Software Engineer
Worked at

And, oh, I ski (though mostly in Tahoe)

2

@cloudera, I work on...

Apache Avro
Cloudera Desktop

3

Outline
Why should you care? (Intro)
Challenging yesteryear’s assumptions
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions

4

Data is everywhere.

Data is important.

5

“I keep saying that the sexy job
in the next 10 years will be
statisticians, and I’m not kidding.”

Hal Varian
(Google’s chief economist)

9

Are you throwing
away data?
Data comes in many shapes and sizes:
relational tuples, log ﬁles, semistructured
textual data (e.g., e-mail), … .
Are you throwing it away because it
doesn’t ‘ﬁt’?
10

So, what’s Hadoop?

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry

11

Apache Hadoop is an open-source
system (written in Java!) to reliably
store and process
gobs of data
across many commodity computers.

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry

12

Two Big
Components
HDFS Map/Reduce

Self-healing high-
bandwidth Fault-tolerant
clustered storage. distributed computing.

13

Challenging some of
yesteryear’s
assumptions...

14

Assumption 1: Machines can be reliable...

Image: MadMan the Mighty CC BY-NC-SA
15

Hadoop Goal:

Separate distributed
system fault-tolerance
code from application logic.
Systems
Programmers Statisticians

16

Assumption 2: Machines have identities...
Image:Laughing Squid CC BY-
NC-SA
17

Hadoop Goal:

Users should interact with
clusters, not machines.

18

Assumption 3: A data set ﬁts on one machine...
Image: Matthew J. Stinson CC-
BY-NC
19

Hadoop Goal:

System should scale
linearly (or better) with
data size.

20

The M/R
Programming Model

21

You specify map()
and reduce()
functions.

The framework does
the rest.
22

fault-tolerance
(that’s what’s important)
(and that’s why Hadoop)

23

map()
map: K₁,V₁→list K₂,V₂

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException,
InterruptedException {
// context.write() can be called many times
// this is default “identity mapper” implementation
context.write((KEYOUT) key, (VALUEOUT) value);
}
}

24

(the shufﬂe)

map output is assigned to a “reducer”

map output is sorted by key

25

reduce()
K₂, iter(V₂)→list(K₃,V₃)

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}

26

Putting it together...
Logical

Physical Flow

Physical

27

Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes
for text classiﬁcation)
Augment traditional BI/DW
technologies (by archiving raw data).
28

There’s more than the
Java API
Streaming Pig Hive
perl, python, Higher-level SQL interface.
ruby, whatever. dataﬂow language
for easy ad-hoc Great for
stdin/stdout/ analysis. analysts.
stderr
Developed at Developed at
Yahoo! Facebook

Many tasks actually require a series
of M/R jobs; that’s ok!
29

A Typical Look...
Commodity servers (8-core, 8-16GB
RAM, 4-12 TB, 2x1 gE NIC)
2-level network architecture
20-40 nodes per rack

30

So, how does it all
work?

31

dramatis personae
Starring...

NameNode (metadata server and database)

SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)

The Chorus…
DataNodes TaskTrackers
(block storage) (task execution)

Thanks to Zak Stone for earmuff image!
32

HDFS
3x64MB file, 3 rep
(fs metadata)
4x64MB file, 3 rep
Namenode
Small file, 7 rep
Datanodes

One Rack A Different Rack 33

HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance
Namenode crash?
uh-oh

35

M/R
Job on stars
Tasktrackers on the same Different job
machines as datanodes Idle

One Rack A Different Rack 36

M/R Failures
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence

38

Hadoop in the Wild
(yes, it’s used in production)

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)

(M-R, but not Hadoop)
Google: 40 GB/s GFS read/write load (Jeff Dean,
LADIS ’09) [~3,500 TB/day]

Facebook: 4TB new data per day; DW: 4800 cores, 5.5
PB (Dhruba Borthakur, HadoopWorld)

39

The Hadoop Ecosystem
ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) Sqoop
Zookeepr (Coordination)

Avro (Serialization)
MapReduce (Job Scheduling/Execution System)

HBase (Key-Value store)

HDFS
(Hadoop Distributed File System)

40

Ok, ﬁne, what next?
Get Hadoop!
Cloudera’s Distribution for Hadoop
http://hadoop.apache.org/
Try it out! (Locally, or on EC2) Door
Prize

41

Just one slide...

Software: Cloudera Distribution for
Hadoop, Cloudera Desktop, more…
Training and certiﬁcation…
Free on-line training materials
(including video)
Support & Professional Services
@cloudera, blog, etc.
42

Okay, two slides

Talk to us if you’re going to do Hadoop
in your organization.

43

Questions?

philip@cloudera.com

(feedback? yes!)

(hiring? yes!)

44

Backup Slides

(i’ve got your back)

45

Important APIs
→ is 1:many
Input Format data→K₁,V₁
Writable
Mapper K₁,V₁→K₂,V₂
JobClient
M/R Flow

Other
Combiner K₂,iter(V₂)→K₂,V₂

Partitioner K₂,V₂→int *Context

Reducer K₂, iter(V₂)→K₃,V₃ Filesystem
Out. Format K₃,V₃→data
46

public int run(String[] args)
throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort
if (args.length < 3) { cer.class); Job, new Path(args[1]));
System.out.println("Grep // sort by decreasing freq
<inDir> <outDir> <regex>
[<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass
Job, tempDir); (LongWritable.DecreasingComparator.
ToolRunner.printGenericCommandUsage class);
(System.out); grepJob.setOutputFormat(SequenceFil
return -1; eOutputFormat.class); JobClient.runJob(sortJob);
} } finally {
Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas
temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp
Random().nextInt(Integer.MAX_VALUE) Dir, true);
)); grepJob.setOutputValueClass(LongWri }
JobConf grepJob = new table.class); return 0;
JobConf(getConf(), Grep.class); }
try { JobClient.runJob(grepJob);
grepJob.setJobName("grep-
search"); JobConf sortJob = new
JobConf(Grep.class);
sortJob.setJobName("grep-
FileInputFormat.setInputPaths(grepJ sort");

the “grep”
ob, args[0]);

FileInputFormat.setInputPaths(sortJ
grepJob.setMapperClass(RegexMapper. ob, tempDir);
class);

example
sortJob.setInputFormat(SequenceFile
grepJob.set("mapred.mapper.regex", InputFormat.class);
args[2]);
if (args.length == 4)
sortJob.setMapperClass(InverseMappe
grepJob.set("mapred.mapper.regex.gr r.class);
oup", args[3]);
// write a single file
sortJob.setNumReduceTasks(1);
grepJob.setCombinerClass(LongSumRed
ucer.class);

47

$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop

$ bin/hadoop jar hadoop-0.18.3-
examples.jar grep input.txt output1
'dunster|adams'

$ cat output1/part-00000
4 dunster
2 adams

48

JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

Job
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
1of 2
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);
} ...

49

JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir);

Job
sortJob.setInputFormat(SequenceFileInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);
(implicit identity reducer)
// write a single file
sortJob.setNumReduceTasks(1); 2 of 2
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}

50

The types there...
?, Text

Text, Long

Text, list(Long)

Text, Long

Long, Text

51

BI / Reporting Ad-hoc
Queries
RDBMS (Aggregates)
Data Mining
ETL (Extraction, Transform, Load)

Storage (Raw Data)
Collection
Instrumentation
}

Traditional DW
52

Facebook’s DW (phase M)
M 2008N
>
Facebook Data Infrastructure
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

Wednesday, April 1, 2009

53

Apache Hadoop & Friends at Utah Java User's Group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Hadoop & Friends at Utah Java User's Group

Similar to Apache Hadoop & Friends at Utah Java User's Group (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Apache Hadoop & Friends at Utah Java User's Group