An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012
A brief introduction to the examples and links to more resources for further exploration.
2. Drew Farris
Committer to Apache Mahout since 2/2010
..not as active in the past year
Author: Taming Text
My Company: (and BarCamp DC Sponsor)
3. Mahout (as in hoot) or Mahout (as in trout)?
A scalable machine learning library
4. A scalable machine learning library
‘large’ data sets
Often Hadoop
..but sometimes not
5. A scalable machine learning library
Recommendation Mining
8. A scalable machine learning library
Recommendation Mining
Clustering
Classification
Association Mining
9. A scalable machine learning library
Recommendation Mining
Clustering
Classification
Association Mining
A reasonable linear algebra library
A reasonable library of collections
10. A scalable machine learning library
Recommendation Mining
Clustering
Classification
Association Mining
A reasonable linear algebra library
A reasonable library of collections
Other Stuff
11. Getting Started
Check out & build the code
▪ git clone git://git.apache.org/mahout.git
▪ mvn install –DskipTests=true
▪ The tests take a looong time to run, not needed for intial build
Or use the Cloudera Virtual Machine (http://bit.ly/MyBnFi)
12. Getting Started
Check out & build the code
Examples in examples/bin
13. Getting Started
Check out & build the code
Examples in examples/bin
Wiki (http://mahout.apache.org/)
14. Getting Started
Check out & build the code
Examples in examples/bin
Wiki (http://mahout.apache.org/)
Articles & Presentations
▪ Grant’s IBM Developerworks Article
▪ http://ibm.co/LUbptg (Nov 2011)
▪ Others @ http://bit.ly/IZ6PqE (wiki)
15. Getting Started
Check out & build the code
Examples in examples/bin
Wiki (http://mahout.apache.org/)
Articles & Publications (http://bit.ly/IZ6PqE)
Mailing Lists
▪ user-subscribe@mahout.apache.org
▪ (http://bit.ly/L1GSHB)
▪ dev-subscribe@mahout.apache.org
▪ (http://bit.ly/JPeNoE)
16. Getting Started
Check out & build the code
Examples in examples/bin
Wiki (http://mahout.apache.org/)
Articles & Presentations
Mailing Lists
Books!
▪ Mahout in Action: http://bit.ly/IWMvaz
▪ Taming Text: http://bit.ly/KkODZV
17. Kicking the Tires in examples/bin
classify-20newsgroups.sh
cluster-reuters.sh
cluster-syntheticcontrol.sh
asf-email-examples.sh
19. Kicking the Tires in examples/bin
cluster-reuters.sh
Premise: Group Related News Stories
Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
20. Kicking the Tires in examples/bin
cluster-syntheticcontrol.sh
▪ Premise: Cluster time series data
▪ normal, cyclic, increasing, decreasing, upward, downward shift
▪ Algorithms:
▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift
See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html
Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
22. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
23. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
Model Training
24. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
Model Training
Model Evaluation
25. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
Model Training
Model Evaluation
Lather, Rinse, Repeat
26. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
Model Training
Model Evaluation
Lather, Rinse, Repeat
Production
27. General Outline:
Data Transformation
▪ From Native format to…
▪ ..Sequence Files; Typed Key, Value pairs
▪ ..Labeled Vectors
Model Training
Model Evaluation
Lather, Rinse, Repeat
Production
Lather, Rinse, Repeat
28. mahout seq2sparse
Tokenize Documents
Count Words
Make Partial/Merge Vectors
TFIDF
Make Partial/Merge TFIDF Vectors
29. View Sequence Files with:
mahout seqdumper –i /path/to/sequence/file
Check out shortcuts in:
src/conf/driver.classes.props
Run classes with:
mahout org.apache.mahout.SomeCoolNewFeature …
Standalone vs. Distributed
Standalone mode is default
Set HADOOP_CONF_DIR to use Hadoop
MAHOUT_LOCAL will force standalone
30. asf-email-examples.sh (recommendation)
Premise: Recommend Interesting Threads
User based recommendation
Boolean preferences based on thread contribution
Implies boolean similarity measure – tanimoto, log-likelihood
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
31. Recommendation Steps
Convert Mail to Sequence Files
Convert Sequence Files to Preferences
Prepare Preference Matrix
Row Similarity Job
Recommender Job
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
32. asf-email-examples.sh (classification)
Premise: Predict project mailing lists for incoming messages
Data labeled based on the mailing list it arrived on
Hold back a random 20% of data for testing, the rest for
training.
Algorithms: Naïve Bayes (Standard, Complimentary), SGD
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
33. Classification Steps
Convert Mail to Sequence Files
Sequence Files to Sparse Vectors
Modify Sequence File Labels
Split into Training and Test Sets
Train the Model
Test the Model
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
34. asf-email-examples.sh (clustering)
Premise: Grouping Messages by Subject
Same Prep as Classification
Different Algorithms: (kmeans, dirichlet, minhash)
12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398
ms (Minutes: 342.95663333333334
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
35. Clustering Steps
Convert Mail to Sequence Files
Sequence Files to Sparse Vectors
Run Clustering (iterate)
Dump Results
37. Mahout in Action
Owen, Anil, Dunning and Friedman
http://bit.ly/IWMvaz
Taming Text
Ingersoll, Morton and Farris
http://bit.ly/KkODZV
Notas do Editor
We encounter recommendations everywhere today, from books, to music to people.
Clustering combines related items into groups, like text documents organized by topic.
Classification is assigning classes or categories to new data based on what we know about existing data.
Identifying items that frequently appear together, whether it be shopping cart contents or frequently co-occuring terms.
It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.
It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.