2. Outline
• ML 101
– Basic formulation
– ML is not Data mining
Generalization and Optimality
• Issues using Hadoop for ML
– Iterations
– Sparseness
• Case Study:
– Learning URL Patterns for Webpage De-duplication, published in
WSDM 2010.
– PLANET: Massively Parallel Learning of Tree Ensembles with
MapReduce, VLDB 2009.
3. ML 101
• Basic problem:
– Matrix of data points and features.
– Each data point is labeled.
– Learn the labeling function and predict the labels of unseen data
points.
Numeric Label is regression else classification.
M features/Attributes
N Data points
Labels
NXM Table
4. Data Mining vs Machine Learning
• Machine learning is about finding a guaranteed generalized
approximation to the boundary separating the classes.
• Data-Mining is about describing the data in using simple algebra.
– Hadoop is perfect for data processing and Mining.
• An Example (Student: Marks Class (Pass/Fail) )
Student Course1 Course2 Course3 Course4 Course5 Course6 Course7 Class
R1 88 76 43 54 90 55 49 Pass
R2 60 45 32 51 80 53 60 Fail
… … .. .. ..
• A Hard problem
– All students who fail may not fail due to same course
– Finding the boundary per course is not easy (Lenient Courses/
evaluation)
5. How does a typical learning algorithm solve this?
• Intuition1: Courses in which every one fails or every one passes are
not of much use here (Comments ? Lets assume unknown range).
• Intuition 2: Courses in which 50% pass and fail? (Good. but can over-
fit if there is a big spread in marks).
• Overall Intuition: Courses which have high density of labels and good
separation are best.
• Optimality:
– Criteria:
Separability assumption – Convex guarantee (We don’t pass
some one who got low marks in a course based on
performance in other courses).
Metrics space of features ( Triangular in-equality)
– Approximation to optimality can be obtained by greedy iterations
or hill climbing.
7. How does ML work – continued?
• An Old class of learners – Tree induction.
– [Split] Choose attribute (subject) which can best describe the final
class with least encoding.
If the {attribute {=,≤,≥} value} can homogeneously describe the
outcome you are done.
Else for each {attribute {=,≤,≥} value} group choose another
attribute and iterate from above.
– Intuition: Look at the toughest course– who got low marks here
also fails the exam. Amongst the one who passed this course look
at which course they have failed and split on that (so on..).
– When do we stop? What do we mean by homogeneous?
– What is over-fit? How do we prune?
8. How would I implement this in Map-Reduce
• Series of Map-Reduces
• Each Stage:
– Map:
Collect stats
– {Attribute {=,≤,≥} value}, {#Class1,#Class2,….}
– Reducer:
Choose the best split (E.g.: Gain Ratio)
# c(k) = v
∀k ∈ K,IG(k) = Entropy(C) − ∑ Entropy(C | c(k) = v)
v ∈{c(k )} # c(k)
• How good is this?
– Pretty bad (3B data took well over 100 hours on 100Nodes).
€ Why?
Map Blows up the space (NXM) X number of maps.
– One quick solution : Combiners.
9. What else is bad?
• Data sparsity in the Internet:
– Any attribute we choose on the
internet follows power-law:
(80:20 rule of layman).
Lots of attribute values occurs
only once.
• Why is this bad? (Not a Blame Game).
Hadoop’s problem
– Too many files
– Each file is a map.
– Empty Reducers.
Our problem – Majority of the of
the splits are useless.
10. What tricks did we use?
• Observations:
– The first split is the hardest (You have to look at all the data).
In fact, difficult to beat the performance of a single box with
sampling.
– Most of the long tail can be grouped together.
• Tricks:
– Speculation helps
Not only Hadoop speculative execution
When doing the first split – you can choose the candidates for
the next few levels.
– At each split group all attribute values that are meaninglessly
small. (Also use Gnu Natural Hash).
11. Performance
• Our observations • Panda et al
25000
20000
Time Taken (S)
15000
Single Node (Sampling)
100 Node (No grouping)
100 Node (Grouping)
10000 100 Node(speculation)
5000
0
1 2 3 4 5 6 7 8 9 10
Depth of the Tree
12. To Conclude
• Hadoop is a great tool for data aggregations.
• With careful handling can obtain perfect scale-ups.
• Lots of research still needs to go on to build ML tools on Hadoop
– http://lucene.apache.org/mahout/
– Main Pieces to Build
Smart way to carry information across iterations.
Smart ways to avoid data sparsity.
– Small things Hadoop can help with
Avoid unnecessary small files (Maps across single file).
Automatic balanced distribution of keys into reducer.