The document discusses Mahout, an open source machine learning library. It describes Mahout's capabilities for recommendations, clustering, classification and other machine learning tasks. Specifically, it covers Mahout's online training algorithms like logistic regression which can train models on small to moderate sized datasets efficiently with low overhead. The document provides examples and details Mahout's approach to online learning and model training.
3. What is Mahout
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)
2
Wednesday, March 16, 2011
4. What is Mahout?
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
•Classification (learn decision
making from examples)
• Stuff (LDA, SVM, frequent item-set, math)
3
Wednesday, March 16, 2011
5. Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
4
Wednesday, March 16, 2011
6. Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
•Logistic Regression (aka SGD)
•fast on-line (sequential) training
5
Wednesday, March 16, 2011
19. And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual benefit.
I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's bank
account for our favor.
...
8
Wednesday, March 16, 2011
20. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
21. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
22. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
23. Mahout’s SGD
• Learns on-line per example
• O(1) memory
• O(1) time per training example
• Sequential implementation
• fast, but not parallel
9
Wednesday, March 16, 2011
24. Special Features
• Hashed feature encoding
• Per-term annealing
• learn the boring stuff once
• Auto-magical learning knob turning
• learns correct learning rate, learns
correct learning rate for learning learning
rate, ...
10
Wednesday, March 16, 2011
33. General Structure
• OnlineLogisticRegression
• Traditional logistic regression
• Stochastic Gradient Descent
• Per term annealing
• Too fast (for the disk + encoder)
16
Wednesday, March 16, 2011
34. Next Level
• CrossFoldLearner
• contains multiple primitive learners
• online cross validation
• 5x more work
17
Wednesday, March 16, 2011
35. And again
• AdaptiveLogisticRegression
• 20 x CrossFoldLearner
• evolves good learning and regularization
rates
• 100 x more work than basic learner
• still faster than disk + encoding
18
Wednesday, March 16, 2011
36. A comparison
• Traditional view
• 400 x (read + OLR)
• Revised Mahout view
• 1 x (read + mu x 100 x OLR) x eta
• mu = efficiency from killing losers early
• eta = efficiency from stopping early
19
Wednesday, March 16, 2011
38. The Upshot
• One machine can go fast
• SITM trains in 2 billion examples in 3
hours
• Deployability pays off big
• simple sample server farm
21
Wednesday, March 16, 2011