Talk given on September 2011 by Ted Dunning to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.
9. Sharded text Indexing
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Assign documents
to shards
Index text to local disk
and then copy index to
distributed file store
Copy to local disk
typically required before
index can be loaded
10. Conventional data flow
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Failure of a reducer
causes garbage to
accumulate in the
local disk
Failure of search
engine requires
another download
of the index from
clustered storage.
11. Search
Engine
Simplified NFS data flows
Map
Reducer
Input
documents
Clustered
index storage
Failure of a reducer
is cleaned up by
map-reduce
framework
Search engine
reads mirrored
index directly.
Index to task work
directory via NFS
16. Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n)
• Output to HDFS
Read from
HDFS to local disk
by distributed cache
Written by
map-reduce
Read from local disk
from distributed cache
17. Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n)
• Output to HDFS
MapR FS
Read
from
NFS
Written by
map-reduce
18. Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
18
while not done:
read and accumulate input models
for each input:
accumulate model
write model
synchronize
reset input format
emit summary
21. What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)
22. What is Mahout?
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVM, frequent item-set, math)
23. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
24. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
27. And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
28. Mahout’s SGD
• Learns on-line per example
– O(1) memory
– O(1) time per training example
• Sequential implementation
– fast, but not parallel
29. Special Features
• Hashed feature encoding
• Per-term annealing
– learn the boring stuff once
• Auto-magical learning knob turning
– learns correct learning rate, learns correct
learning rate for learning learning rate, ...
37. And again
• AdaptiveLogisticRegression
– 20 x CrossFoldLearner
– evolves good learning and regularization rates
– 100 x more work than basic learner
– still faster than disk + encoding
38. A comparison
• Traditional view
– 400 x (read + OLR)
• Revised Mahout view
– 1 x (read + mu x 100 x OLR) x eta
– mu = efficiency from killing losers early
– eta = efficiency from stopping early