Thinking at scale with hadoop

1

Thinking at
Scale with
Hadoop

Ken Krugler

photo by: i_pinz, ﬂickr
Bixo Labs &
Scale Unlimited

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010

2

Introduction - Ken Krugler

Founder, Bixo Labs - http://bixolabs.com
Large scale web mining workﬂows, running in Amazon’s cloud
Instructor, Scale Unlimited - http://scaleunlimited.com
Teaching Hadoop Boot Camp
Founder, Krugle (2005 - 2009)
Code search startup
Parse/index 2.6 billion lines of open source code



3

Goals for Talk

Intro to Map-Reduce
Overview of Hadoop
Challenges with Map-Reduce
Solutions to Complexity


4

Introduction to
Map-Reduce

Ken Krugler

photo by: exfordy, ﬂickr
Bixo Labs &
Scale Unlimited



5

Definitions

Map -> The ‘map’ function in the MapReduce algorithm, user defined
Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined
Group -> The ‘magic’ that happens between Map and Reduce
Key Value Pair -> two units of data, exchanged between Map & Reduce



6

How MapReduce Works

Map translates input to keys and values to new keys and values
[K1,V1] Map [K2,V2]

System Groups each unique key with all its values
[K2,V2] Group [K2,{V2,V2,....}]

Reduce translates the values of each unique key to new keys and values
[K2,{V2,V2,....}] Reduce [K3,V3]



7

All Together
Mapper

Mapper Shufﬂe Reducer



Mapper



8

Canonical Example - Word Count

Word Count
Read a document, parse out the words, count the frequency of each word

Speciﬁcally, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
Sum the values (ones) for each unique word



9
[1, When in the Course of human events, it becomes]

[2, necessary for one people to dissolve the political bands]

[3, which have connected them with another, and to assume]
[K1,V1]

[n, {k,k,k,k,...}]

Map

[K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1]

User
Group
Deﬁned

[K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]

[connected,{1,1,1,1,1}] [k,{v,v,...}]

Reduce

[When,4] [peope,6] [dissolve,3]
[K3,V3]
[connected,5] [k,sum(v,v,v,..)]



10

Divide & Conquer

Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Mappers & Reducers can run Map/Reduce functions independently
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
A Reducer can invoke Reduce on an arbitrary number of the unique keys
Any number of Mappers & Reducers can be run on any number of nodes



11

Overview of
Hadoop

Ken Krugler

photo by: exfordy, ﬂickr
Bixo Labs &
Scale Unlimited



12

Logical Architecture
Cluster

Execution

Storage

Logically, Hadoop is simply a computing cluster that provides:
a Storage layer, and
an Execution layer



13

Logical Architecture
Users Cluster

Client

Client
Execution
job job job
Storage
Client

Where users can submit jobs that:
Run on the Execution layer, and
Read/write data from the Storage layer



14

Storage - HDFS
Geography Geography

Rack Rack Rack

Node Node Node Node ...

block block block block

block block block block

Performance
Optimized for many concurrent serialized reads of large ﬁles
But only a single writer (write-once, no append)
Fidelity
Replication via block replication



15

Execution
Geography Geography

Rack Rack Rack

Node Node Node Node ...
Performance map map map map map

Divide and Conquer
reduce reduce
reduce
MapReduce
Data Afﬁnity
Move code to data

Fidelity
Applications must choose to fail on bad data, or ignore bad data



16

Architectural Components
NameNode DataNode
DataNode
DataNode
DataNode data block

ns read/write
operations Secondary ns
operations
read/write ns
operations read/write
mapper
mapper
child jvm
mapper
child jvm
jobs tasks child jvm
Client JobTracker

TaskTracker
reducer
reducer
child jvm
reducer
child jvm
child jvm
Solid boxes are unique applications
Dashed boxes are child JVM instances on same node as parent
Dotted boxes are blocks of managed ﬁles on same node as parent



17

Challenges of
Map-Reduce

Ken Krugler

photo by: Graham Racher, ﬂickr
Bixo Labs &
Scale Unlimited

Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.


18

Complexity of keys/values
Simple != Simplicity
People think about processing records/objects
The correct key definition changes during a workflow
A “value” is typically a set of fields



19

Complexity of chaining jobs
Most real-world workﬂows require multiple Map-Reduce jobs
Connecting data between jobs is tedious
Keeping key/value deﬁnitions in sync is error-prone
As an example...



20

A trivial ﬁlter/count/sort workﬂow
private void start(String inputPath, String outputPath, String filterUser) // run job and block until job is done
throws Exception { filterCountJob.waitForCompletion(false);
// counts the pages the users clicked and sort by number of clicks
// sort
// filter and count Job sortJob = new Job(conf, "sort job");
Path outputDir = new Path(outputPath); FileInputFormat.setInputPaths(sortJob, tmpOutput);
Path parent = outputDir.getParent(); FileOutputFormat.setOutputPath(sortJob, outputDir);
Path tmpOutput = new Path(parent, "" + System.currentTimeMillis());
sortJob.setMapperClass(SortMapper.class);
// Create the job configuration sortJob.setMapOutputKeyClass(UserAndCount.class);
Configuration conf = new Configuration(); sortJob.setMapOutputValueClass(NullWritable.class);
conf.set(FilterMapper.USER_KEY, filterUser);
// sort are only stable with one reducer but you lose distribution.
Job filterCountJob = new Job(conf, "filter+count job"); sortJob.setNumReduceTasks(1);
FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath)); sortJob.setInputFormatClass(SequenceFileInputFormat.class);
FileOutputFormat.setOutputPath(filterCountJob, tmpOutput); sortJob.setOutputFormatClass(TextOutputFormat.class);
filterCountJob.setMapperClass(FilterMapper.class); // run job and block until job is done
filterCountJob.setMapOutputKeyClass(Text.class); sortJob.waitForCompletion(false);
filterCountJob.setMapOutputValueClass(LongWritable.class);
// Get rid of temp directory used to bridge between the two jobs
filterCountJob.setReducerClass(CounterReducer.class); FileSystem.get(conf).delete(tmpOutput, true);
filterCountJob.setOutputKeyClass(Text.class);
filterCountJob.setOutputValueClass(LongWritable.class);

filterCountJob.setInputFormatClass(TextInputFormat.class);
filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class);



21

Not seeing the forest...

Too much low-level detail obscures intent
Not seeing intent leads to bugs
And makes optimization harder



22

Hard to use MR optimizations

MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,}
Combiners can reduce map output size
Map-side joins where one of the sides is “small”



23

Solutions to Complexity

Cascading - Tuples, Pipes, and Operations
FlumeJava - Java Classes and Deferred Execution
Pig/Hive - “It’s just a big database”
Datameer - “It’s just a big spreadsheet”



24

Cascading

Ken Krugler

Bixo Labs &
Scale Unlimited



25

Cascading

An API for deﬁning data processing workﬂows
First public release Jan 2008
Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2
http://www.cascading.org



26

Cascading

Basic unit of data is “Tuple” - roughly equal to a record
Implements all standard data processing operations
function, ﬁlter, group by, co-group, aggregator, etc
Complex applications are built by chaining operations with pipes
And are planned into the minimum number of MapReduce jobs



27

Operations on Tuples

Map and Reduce communicate through Keys and Values
Our problems are really about elements of Keys and Values
e.g. “username”, not “line”
The effect is more reusability of our ‘map’ and ‘reduce’ operations



28

Mapping onto Hadoop

Functions & Filters are map operations
Aggregates & Buffers are reduce operations
Built-in support for joining, multi-ﬁeld sorting
Concept of “Taps” with “Schemes” as input and output formats



29

Loose Coupling
func

Source func Group aggr Sink

Above, all the possible relationships between operations and data
If Fields are the interface between operations, we can build ﬂexible apps



30

Layered Over MapReduce

Map Reduce Map Reduce

Source func CoGroup aggr temp GroupBy aggr func Sink

Source func func



31

Cascading Code

Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath );
Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE );

Pipe pipe = new Pipe( "example" );

RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "s+" );
pipe = new Each( pipe, new Fields( "line" ), function );

pipe = new GroupBy( pipe, new Fields( "word" ) );

pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) );
pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) );

pipe = new GroupBy( pipe, new Fields( "count", -1 ) );

FlowConnector.setApplicationJarClass( new Properties(), Main.class );
Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe );
flow.complete();



32
[head]

Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']

['ip', 'time', 'method', 'event', 'status', 'size']
['ip', 'time', 'method', 'event', 'status', 'size']

Each('arrival rate')[DateParser[decl:'ts'][args:1]]

['ts'] ['ts']
['ts'] ['ts']

GroupBy('tsCount')[by:['ts']] Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]

['tm']
tsCount['ts']
['tm']
['ts']

GroupBy('tmCount')[by:['tm']]
Every('tsCount')[Count[decl:'count']]

tmCount['tm']
['ts', 'count'] ['tm']
['ts']

Every('tmCount')[Count[decl:'count']]

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']
['tm', 'count']
['tm']

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']

[tail]



33
[head]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

['', ''] ['', '']
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
['', ''] ['', '']

['', '']
Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]
['', '']

['', ''] ['', '']
Each('')[RegexSplitter[decl:'', '', ''][args:1]]
['', ''] ['', '']

TempHfs['SequenceFile[['', '']]'][] TempHfs['SequenceFile[['', '']]'][]

['', ''] ['', '']
['', ''] ['', ''] ['', '']
['', '']
CoGroup('')[by:a:['']b:['']]

a[''],b['']
['', ''] ['', '', '', '']
['', '']
Each('')[Identity[decl:'', '', '']]

['', ''] ['', '', '']
['', ''] ['', '', '']

TempHfs['SequenceFile[['', '', '']]'][] TempHfs['SequenceFile[['', '']]'][]

['', '', ''] ['', '']
['', '', ''] ['', '']

CoGroup('')[by:a:['']b:['']]
['', '', '']
['', '', ''] ['', '']
['', '', ''] a[''],b[''] ['', '']
['', '', ''] ['', '', '', '', '']

Each('')[Identity[decl:'', '', '', '']]

['', '', '', '']
['', '', '', '']

TempHfs['SequenceFile[['', '', '', '']]'][]
['', '', '', '']
['', '', '', ''] ['', '', '', '']
['', '', '', '']

CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']]

a[''],b[''] a[''],b['']
['', '', '', '', '', '', ''] ['', '', '', '', '', '']

TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][] Each('')[Identity[decl:'', '', '', '']]

['', '', '', '', '', '', ''] ['', '', '', '']
['', '', '', '', '', '', ''] ['', '', '', '']

CoGroup('')[by:a:['']b:['']] TempHfs['SequenceFile[['', '', '', '']]'][]

a[''],b[''] ['', '', '', '']
['', '', '', '', '', '', '', '', ''] ['', '', '', '']

Each('')[Identity[decl:'', '']] CoGroup('')[by:a:['']b:['']]

['', ''] a[''],b['']
['', ''] ['', '', '', '', '', '', '']

TempHfs['SequenceFile[['', '']]'][] Each('')[Identity[decl:'', '', '']]

['', ''] ['', '', '']
['', ''] ['', '', '']

GroupBy('')[by:['']] TempHfs['SequenceFile[['', '', '']]'][]

a[''] ['', '', '']
['', ''] ['', '', '']

Every('')[Agregator[decl:'']] CoGroup('')[by:a:['']b:['']]

a[''],b['']
['', '', '', '', '']
['', '']
['', '']
Each('')[Identity[decl:'', '', '']]

['', '', '']
['', '', '']

TempHfs['SequenceFile[['', '', '']]'][]

['', '', '']
['', '', '']

GroupBy('')[by:['']]

a['']
['', '', '']

Every('')[Agregator[decl:'', '']]

['', '', '']
['', '', '']


[tail]



34

FlumeJava

Ken Krugler

Bixo Labs &
Scale Unlimited



35

FlumeJava

Recent paper (June 2010) by Googlites - Chambers et. al.
Internal project to make it easier to write Map-Reduce jobs
Equivalent open source project just started
Plume - http://github.com/tdunning/Plume



36

Java Classes, Deferred Exec

Java-centric view of map-reduce processing
Handful of classes to represent immutable, parallel collections
PCollection, PTable
Small number of primitive operations on classes
parallelDo, groupByKey, combineValues, ﬂatten
Deferred execution allows construction/optimization of workﬂow



37

FlumeJava Code

PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt");

Collection<String> words = lines.parallelDo(new DoFn<String,String>() {
void process(String line, EmitFn<String> emitFn) {
for (String word : splitIntoWords(line)) {
emitFn.emit(word);
}
}
}, collectionOf(strings()));



38

Results of Use at Google

Text
Text

Lines of code

Number of Map-Reduce Jobs
Text


39

Pig & Hive

photo by: Tambako the Jaguar, ﬂickr
Ken Krugler
Bixo Labs &
Scale Unlimited



40

Pig

Pig is a data flow ‘programming environment’ for processing large files
PigLatin is the text language it provides for defining data flows
Developed by researchers in Yahoo! and widely used internally
A sub-project of Hadoop
APL licensed
http://hadoop.apache.org/pig/
First incubator release v0.1.0 on Sept 2008



41

Pig

Provides two kinds of optimizations:
Physical - caching and stream optimizations at the MapReduce level
Algebraic - mathematical reductions at the syntax level
Allows for User Deﬁned Functions (UDF)
must register jar of functions via register path/some.jar
can be called directly from PigLatin
B = foreach A generate myfunc.MyEvalFunc($0);



42

PigLatin - Example

W = LOAD 'filename' AS (url, outlink);
G = GROUP W by url;
R = FOREACH G {
FW = FILTER W BY outlink eq 'www.apache.org';
PW = FW.outlink;
DW = DISTINCT PW;
GENERATE group, COUNT(DW);
}



43

Hive

Hive is a data warehouse built on top of ﬂat ﬁles
Developed by Facebook
Available as a Hadoop contribution in v0.19.0
http://wiki.apache.org/hadoop/Hive



44

Hive

Data Organization into Tables with logical and hash partitioning
A Metastore to store metadata about Tables/Partitions/Buckets
Partitions are ‘how’ the data is stored (collections of related rows)
Buckets further divide Partitions based on a hash of a column
metadata is actually stored in a RDBMS
A SQL like query language over object data stored in Tables
Custom types, structs of primitive types



45

Hive - Queries

Queries always write results into a table

FROM user
INSERT OVERWRITE TABLE user_active
SELECT user.*
WHERE user.active = true;

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;


46

Datameer
Analytics
Solution (DAS)

Ken Krugler
Bixo Labs &
Scale Unlimited



47

DAS - Big Spreadsheets

Commercial solution from Datameer
Similar to IBM BigSheets
GUI + workﬂow editor on top of Hadoop
Deﬁne data sources, operations (via spreadsheet), and data sinks
http://www.datameer.com



48



49



50

Summary

Ken Krugler
Bixo Labs &
Scale Unlimited



51

Thinking at Scale with Hadoop

Understand the underlying Map-Reduce architecture
Model the problem as a workﬂow with operations on records
Consider using Cascading or (eventually) Plume
Use Pig/Hive to solve query-centric problems
Look at DAS or BigSheets for easier analytics solutions



52

Resources

Hadoop mailing lists - many, e.g. common-user@hadoop.apache.org
Hadoop: The Deﬁnitive Guide by Tom White (O’Reilly)
Hadoop Training
Scale Unlimited: http://scaleunlimited.com/courses
Cloudera: http://www.cloudera.com/hadoop-training/
Cascading: http://www.cascading.org
DAS: http://www.datameer.com


53

Questions?



Thinking at scale with hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (19)

Semelhante a Thinking at scale with hadoop

Semelhante a Thinking at scale with hadoop (20)

Mais de Ken Krugler

Mais de Ken Krugler (9)

Último

Último (20)

Thinking at scale with hadoop