Hadoop at JavaZone 2010

Hadoop
divide and conquer petabyte-scale data

© Matthew McCullough, Ambient Ideas, LLC

Metadata

Matthew McCullough
Ambient Ideas, LLC
matthewm@ambientideas.com
http://ambientideas.com/blog
@matthewmccull
Code
http://github.com/matthewmccullough/hadoop-intro

http://delicious.com/matthew.mccullough/hadoop

ado op o ften.
I u se H
d my Tivo
the s oun
Th at’s I s kip a
very time
ma kes e
com mer cial.
rrency in Practice
Ja va Concu
n Goetz, author of
-Bria

u are using
mpu ter yo
The co
right now
ell hav e the
very w sor
may
GHz pro ces
fas test
u’ll ever own
yo

Talk Roadmap

History
MapReduce
Filesystem
DSLs

Today
0.21.0 current version

Dozens of companies contributing

Hundreds of companies using

VM Sources

Yahoo
True to the OSS distribution
Cloudera
Desktop tools
Both VMWare based

Set up the Environment

• Launch “cloudera-training 0.3.3” VMWare instance
• Open a terminal window
Lab

Log Files

•Attach tail to all the Hadoop logs
Lab

Processing Framework Data Storage

MapReduce HDFS
Hive QL Hive
Chuckwa HBase
Flume Avro
ZooKeeper
Hybrid
Pig

Hadoop Components
Tool Purpose
Common MapReduce

HDFS Filesystem

Pig Analyst MapReduce language

HBase Column-oriented data storage

Hive SQL-like language for HBase

ZooKeeper Workﬂow & distributed transactions

Chukwa Log ﬁle processing

the Players

Comm
C hukwa ZooKeeper on H Base
Hive HDFS

Original Goals

Web Indexer
Yahoo Search Engine

Pre-Hadoop
Shortcomings

Search engine update frequency
Storage costs
Expense of durability
Ad-hoc queries

Anti-Patterns Cured

RAM-heavy RDBMS boxes
Sharding
Archiving
Ever-smaller-range SQL queries

Near-Linear Hardware Scalability

Applications

Protein folding
pharmaceutical research

Search Engine Indexing
walking billions of web pages

Product Recommendations
based on other customer purchases

Sorting
terabytes to petabyes in size

Classification
government intelligence

SELECT reccProd.name, reccProd.id
FROM products reccProd
WHERE purchases.customerId =

(SELECT customerId
FROM customers
WHERE purchases.productId = thisProd)

LIMIT 5

30%
of Amazon sales are from
recommendations

ACID

ATOMICITY

CONSISTENCY

ISOLATION

DURABILITY

CAP

Consistency
Availability
Partition Tolerance

MapReduce: Simplified Data
Processing on Large Clusters

Jeffrey Dean and Sanjay Ghem
awat
jeff@google.com, sanjay@goog
le.com

Google, Inc.

Abstract
given day, etc. Most such com
MapReduce is a programmin putations are conceptu-
g model and an associ- ally straightforward. However
ated implementation for proc , the input data is usually
essing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mac
key/value pair to generate a set hines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, dist
values associated with the sam ribute the data, and handle
e intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this func As a reaction to this complex
tional style are automati- ity, we designed a new
cally parallelized and executed abstraction that allows us to exp
on a large cluster of com- ress the simple computa-
modity machines. The run-time tions we were trying to perform
system takes care of the but hides the messy de-
details of partitioning the inpu tails of parallelization, fault-tol
t data, scheduling the pro- erance, data distribution
gram’s execution across a set and load balancing in a libra
of machines, handling mary. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional lang
programmers without any uages. We realized that
experience with parallel and most of our computations invo
distributed systems to eas- lved applying a map op-
ily utilize the resources of a larg eration to each logical “record”
e distributed system. in our input in order to
Our implementation of Map compute a set of intermediat
Reduce runs on a large e key/value pairs, and then
cluster of commodity machine applying a reduce operation to
s and is highly scalable: all the values that shared
a typical MapReduce computa the same key, in order to com
tion processes many ter- bine the derived data ap-
abytes of data on thousands of propriately. Our use of a func
machines. Programmers tional model with user-
find the system easy to use: hun specified map and reduce ope
dreds of MapReduce pro- rations allows us to paral-
grams have been implemente lelize large computations easi
d and upwards of one thou- ly and to use re-execution
sand MapReduce jobs are exec as the primary mechanism for
uted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scale
computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clus
ters of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hun gives several examples. Sec
dreds of special-purpose tion 3 describes an imple-
computations that process larg mentation of the MapReduce
e amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deri scribes several refinements of
ved data, such as inverted the programming model
indices, various representatio that we have found useful. Sec
ns of the graph structure tion 5 has performance
of web documents, summaries measurements of our implem
of the number of pages entation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experie
nces in using it as the basis
To appear in OSDI 2004
1

mode l and
gram ming
“ A pro for
ment ation
imple
nd ge nera ting
pro cess ing a
dat a s ets ”
la rg e

MapReduce
a word counting conceptual example

The Goal

Provide the occurrence count
of each distinct word across
all documents

Raw Data
a folder of documents

mydoc1.txt mydoc2.txt mydoc3.txt

Map
break documents into words

Shuffle
physically group (relocate) by key

Reduce Again
sort occurrences alphabetically

package org.apache.hadoop.examples;

import java.util.Random;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton

public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex>
[<group>]");

import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton

public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex>
[<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}

Path tempDir =
new Path("grep-temp-"+
Integer.toString(new Random().nextInt
(Integer.MAX_VALUE)));

JobConf grepJob = new JobConf(getConf(), Grep.class);

try {

grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);

FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);

JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);

FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);




sortJob.setMapperClass(InverseMapper.class);

sortJob.setNumReduceTasks(1); // write a
single file
FileOutputFormat.setOutputPath(sortJob, new Path(args
[1]));
sortJob.setOutputKeyComparatorClass // sort by
decreasing freq
(LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob);
}




sortJob.setMapperClass(InverseMapper.class);

sortJob.setNumReduceTasks(1); // write a
single file
FileOutputFormat.setOutputPath(sortJob, new Path(args
[1]));
sortJob.setOutputKeyComparatorClass // sort by
decreasing freq
(LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob);
}
finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.hadoop.mapred.lib;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/** A {@link Mapper} that extracts text matching a regular expression.
*/
public class RegexMapper<K> extends MapReduceBase
implements Mapper<K, Text, Text, LongWritable> {

private Pattern pattern;
private int group;

public void configure(JobConf job) {
pattern = Pattern.compile(job.get("mapred.mapper.regex"));
group = job.getInt("mapred.mapper.regex.group", 0);
}

public void map(K key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String text = value.toString();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
output.collect(new Text(matcher.group(group)), new LongWritable
(1));
}
}

}

Have Code,
Will Travel

Code travels to the data
Opposite of traditional systems

uted file
alab le di strib
“sc ibuted
r large distr
s yst em fo icati ons.
nsive appl
d ata-inte ance
fa ult t oler
ovides
It pr
in expensive
e runn ing on
whil
ty h ardw are”
com modi

HDFS Basics

Open Source implementation of
Google BigTable
Replicated data store
Stored in 64MB blocks

HDFS

Rack location aware
Configurable redundancy factor
Self-healing
Looks almost like *NIX filesystem

Why HDFS?

Random reads
Parallel reads
Redundancy

Test HDFS

• Upload a ﬁle
• List directories
Lab

HDFS Upload

• Show contents of ﬁle in HDFS
• Show vocabulary of HDFS
Lab

HDFS Distcp

• Copy ﬁle to other nodes
• MapReduce job starts
Lab

HDFS Challenges

Writes are re-writes today
Append is planned
Block size, alignment
Small file inefficiencies
NameNode SPOF

FSCK

•Check ﬁlesystem
Lab

Structured

Semi-structured

Unstructured

NOSQL is...
Semi-structured

not death of the RDBMS
no JOINing
no Normalization
big-data tools
solving different issues than
RDBMSes

Scalable

Data storage is pipelined
Code travels to data
Near linear hardware scalability

Optimization

Preemptive execution
Maximizes use of faster hardware
New jobs trump P.E.

Fault Tolerant

Configurable data redundancy
Minimizes hardware failure impact
Automatic job retries
Self healing filesystem

Sproinnnng!

Bzzzt!

Poof!

server Funerals

No pagers go off when machines die
Report of dead machines once a week
Clean out the carcasses

ss attributes
Ro bustne eding
ed fro m ble
p revent
plication code
into ap
Data redundancy
Node death
Retries
Data geography
Parallelism
Scalability

NameNode
SecondaryNameNode
DataNode
JobTracker
TaskTracker

Robust, Primary
1 Hot Backup
0..1

NameNode
NFS Secondary
NameNode

JobTracker
JobTracker

Commodity Commodity
1..N 1..N

TaskTracker TaskTracker

DataNode DataNode

NameNode

Catalog of data blocks
Prefers large files
High memory consumption

Journaled file system
Saves state to disk

Robust, Primary
1

NameNode

JobTracker

Commodity
1..N

TaskTracker

DataNode

Secondary NameNode

Near-hot backup for NN
Snapshots
Missing journal entries
Data loss in non-optimal setup
Grid reconfig?
Big IP device

Robust, Primary
1 Hot Backup
0..1

NameNode
Secondary
NameNode

JobTracker
JobTracker

NameNode, Cont’d

Backup via SNN
Journal to NFS with replication?
Single point of failure
Failover plan
Big IP device, virtual IP

Robust, Primary
1 Hot Backup
0..1

NameNode
NFS Secondary
NameNode

JobTracker
JobTracker

Commodity
1..N

TaskTracker Virtual IP

DataNode

DataNode

Anonymous
Data block storage
“No identity” is positive trait
Commodity equipment

JobTracker
Singleton
Commonly co-located with NN
Coordinates TaskTracker
Load balances
FairPlay scheduling
Preemptive execution

TaskTracker

Runs MapReduce jobs
Reports back to JobTracker

Listing Nodes

•Use Java’sa JPS tool to get list of
nodes on box
•sudo jps -l
Lab

Processing Nodes

Anonymous
“No identity” is positive trait
Commodity equipment

Master Node

Master is a special machine
Use high quality hardware
Single point of failure
Recoverable

HBase Basics
Map-oriented storage
Key value pairs
Column families
Stores to HDFS
Fast
Usable for synchronous responses

Direct Comparison

Amazon SimpleDB
Google BigTable (Datastore)

Competitors

Facebook Cassandra
LinkedIn Project Voldemort

Shortcomings

HQL
WHERE clause limited to key
Other column filters are full-table
scans
No ad-hoc queries

hbase>
help

create 'mylittletable', 'mylittlecolumnfamily'
describe 'mylittletable'

put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x'

get 'mylittletable', 'r2'
scan 'mylittletable'

HBase

• Create column family
• Insert data
• Select data
Lab

Pig Basics

Yahoo-authored add-on
High-level language for authoring
data analysis programs
Console

PIG Questions
Ask big questions on unstructured
data
How many ___?
Should we ____?
Decide on the questions you want to
ask long after you’ve collected the
data.

Pig Data

999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com,
999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U
999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e
999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c
999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com,
999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG
999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato
999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep
999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail.
1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com

Pig Sample

Person = LOAD 'people.csv' using PigStorage(',');
Names = FOREACH Person GENERATE $2 AS name;
OrderedNames = ORDER Names BY name ASC;
GroupedNames = GROUP OrderedNames BY name;
NameCount = FOREACH GroupedNames
GENERATE group, COUNT(OrderedNames);
store NameCount into 'names.out';

Pig Scripting

•Re-column and sort data
Lab

Hive Basics
Authored by
SQL interface to HBase
Hive is low-level
Hive-specific metadata
Data warehousing

hive
-e 'select t1.x from mylittletable t1'

hive -S
-e 'select a.col from tab1 a'
> dump.txt

SELECT * FROM shakespeare
WHERE freq > 100
SORT BY freq ASC
LIMIT 10;

Load Hive Data

• Load and select Hive data
Lab

Sync, Async

RDBMS SQL is realtime
Hive is primarily asynchronous

Sqoop

A utility
Imports from RDBMS
Outputs plaintext, SequenceFile, or Hive

sqoop --connect jdbc:mysql://database.example.com/

Sync, Async

RDBMS is realtime
HBase is near realtime
Hive is asynchronous

Web Status Panels

NameNode
http://localhost:50070/
JobTracker
http://localhost:50030/

View Consoles

• Start job
• Watch consoles
• Browse HDFS via web interface
Lab

Single Node
start-all.sh
Needs XML config
ssh account setup
Multiple JVMs
List of processes
jps -l

Start and Stop Grid

•cd /usr/lib/hadoop-0.20/bin
•sudo -u hadoop ./stop-all.sh
•sudo jps -l
•tail
•sudo -u hadoop ./start-all.sh
•sudo jps -l
Lab

SSH Setup

create key
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Commodity
Robust, Primary
1 1..N

TaskTracker
start-all.sh NameNode start-dfs.sh
start-mapred.sh

JobTracker DataNode

Complete Grid

start-all.sh
Host names, ssh setup
Default out-of-box experience

Commodity
1..N

TaskTracker

h
dfs.s
Robust, Primary rt-
1 sta h DataNode
e d.s
-m apr
s tart

start-all.sh NameNode

Commodity
1..N
JobTracker sta
rt-d
fs.
sta sh
rt-
ma TaskTracker
pr
ed
.sh

DataNode

Sapir-WHorF Hypothesis...
Remixed

The ability to store and process
massive data influences what you
decide to store.

Hadoop at JavaZone 2010

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (19)

Semelhante a Hadoop at JavaZone 2010

Semelhante a Hadoop at JavaZone 2010 (20)

Mais de Matthew McCullough

Mais de Matthew McCullough (20)

Último

Último (20)

Hadoop at JavaZone 2010