Data mining 2011 09

Data-mining, Hadoop and the
Single Node

Map-Reduce
Input Output
Shuffle

MapR's Streaming Performance
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Hardware
MapR
HadoopMB
per
sec
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
11 x 7200rpm SATA 11 x 15Krpm SAS
Higher is better

Terasort on MapR
1.0 TB
0
10
20
30
40
50
60
3.5 TB
0
50
100
150
200
250
300
MapR
Hadoop
Elapsed
time
(mins)
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
Lower is better

Data Flow Expected Volumes
Node
Storage
6 x 1Gb/s =
600 MB / s
12 x 100MB/s =
900 MB / s

MUCH faster for some operations
# of files (millions)
Create
Rate
Same 10 nodes …

Cluster
Node
NFS
Server
Universal export to self
Task
Cluster Nodes

Cluster
Node
NFS
Server
Task
Cluster
Node
NFS
Server
Task
Cluster
Node
NFS
Server
Task
Nodes are identical

Sharded text Indexing
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Assign documents
to shards
Index text to local disk
and then copy index to
distributed file store
Copy to local disk
typically required before
index can be loaded

Conventional data flow
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Failure of a reducer
causes garbage to
accumulate in the
local disk
Failure of search
engine requires
another download
of the index from
clustered storage.

Search
Engine
Simplified NFS data flows
Map
Reducer
Input
documents
Clustered
index storage
Failure of a reducer
is cleaned up by
map-reduce
framework
Search engine
reads mirrored
index directly.
Index to task work
directory via NFS

Aggregate
new
centroids
K-means, the movie
Assign
to
Nearest
centroid
Centroids
I
n
p
u
t

Average
models
Parallel Stochastic Gradient Descent
Train
sub
model
Model
I
n
p
u
t

Update
model
Variational Dirichlet Assignment
Gather
sufficient
statistics
Model
I
n
p
u
t

Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n)
• Output to HDFS
Read from
HDFS to local disk
by distributed cache
Written by
map-reduce
Read from local disk
from distributed cache

Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n)
• Output to HDFS
MapR FS
Read
from
NFS
Written by
map-reduce

Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
18
while not done:
read and accumulate input models
for each input:
accumulate model
write model
synchronize
reset input format
emit summary

Mahout
• Scalable Data Mining for Everybody

What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)

What is Mahout?
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVM, frequent item-set, math)

Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training

So What?
• Online
training has
low
overhead for
small and
moderate
size data-sets
big starts here

And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?

Mahout’s SGD
• Learns on-line per example
– O(1) memory
– O(1) time per training example
• Sequential implementation
– fast, but not parallel

Special Features
• Hashed feature encoding
• Per-term annealing
– learn the boring stuff once
• Auto-magical learning knob turning
– learns correct learning rate, learns correct
learning rate for learning learning rate, ...

Learning Rate AnnealingLearningRate
# training examples seen

Per-term AnnealingLearningRate
# training examples seen
Common
Feature
Rare
Feature

General Structure
• OnlineLogisticRegression
– Traditional logistic regression
– Stochastic Gradient Descent
– Per term annealing
– Too fast (for the disk + encoder)

Next Level
• CrossFoldLearner
– contains multiple primitive learners
– online cross validation
– 5x more work

And again
• AdaptiveLogisticRegression
– 20 x CrossFoldLearner
– evolves good learning and regularization rates
– 100 x more work than basic learner
– still faster than disk + encoding

A comparison
• Traditional view
– 400 x (read + OLR)
• Revised Mahout view
– 1 x (read + mu x 100 x OLR) x eta
– mu = efficiency from killing losers early
– eta = efficiency from stopping early

Click modeling architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SGD
Learning
Map-reduce
Now via NFS

Click modeling architecture
Map-reduceMap-reduce
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SGD
Learning
Map-reduce
cooperates
with NFS
Sequential
SGD
Learning
Sequential
SGD
Learning
Sequential
SGD
Learning

Deployment
• Training
– ModelSerializer.writeBinary(..., model)
• Deployment
– m = ModelSerializer.readBinary(...)
– r = m.classifyScalar(featureVector)

The Upshot
• One machine can go fast
– SITM trains in 2 billion examples in 3 hours
• Deployability pays off big
– simple sample server farm

Data mining 2011 09

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data mining 2011 09

Semelhante a Data mining 2011 09 (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Data mining 2011 09