1) Mahout is an Apache project that builds a scalable machine learning library.
2) It aims to support a variety of machine learning tasks such as clustering, classification, and recommendation.
3) Mahout algorithms are implemented using MapReduce to scale linearly with large datasets.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
SDEC2011 Mahout - the what, the how and the why
1. The what, the why and the how
Speaker: Robin Anil, Apache Mahout PMC Member
SDEC, Seoul, Korea, June 2011
2. about:me
● Apache Mahout PMC member
● A ML Believer
● Author of Mahout in Action
● Software Engineer @ Google
*in no particular order
● Previous Life: Google Summer of Code student for 2 years.
3. about:agenda
● Introducing Mahout
● Why Mahout?
● Birds eye view of Mahout
● Classic Machine learning problems
● Short overview of Mahout clustering
5. Scale!
● Scale to large datasets
○ Hadoop MapReduce implementations that scales linearly
with data.
○ Fast sequential algorithms whose runtime doesn’t depend
on the size of the data
○ Goal: To be as fast as possible for any algorithm
● Scalable to support your business case
○ Apache Software License 2
● Scalable community
○ Vibrant, responsive and diverse
○ Come to the mailing list and find out more
6. about:why
● Lack community
● Lack scalability
● Lack documentations and examples
● Lack Apache licensing
● Are not well tested
● Are Research oriented
● Not built over existing production quality libraries
7. Birds eye view of Mahout
● If you want to:
○ Encode
○ Analyze
○ Predict
○ Get top best
8. about:encode
● Process data and convert to vectors
● Dictionary based v/s Randomizer based
● Get best signals for generating vectors
○ Collocation information (ngrams)
○ Lp Normalization
Data Engineering Camp
1:1.0 2:1.0 3:1.0
9. about:analyze
● Cluster and group data to
● Cluster data
○ K-Means
○ Fuzzy K-Means
○ Canopy
○ Mean Shift
○ Dirichlet process clustering
○ Spectral Clustering
● Co-cluster features / dimensionality reduction
○ Latent Dirichlet Allocation (LDA)
○ Singular Value Decomposition
11. about:lda
● Grouping similar or co-occurring features into a topic
○ Topic “Lol Cat”:
■ Cat
■ Meow
■ Purr
■ Haz
■ Cheeseburger
■ Lol
12. about:predict
● Classification and Recommendation
● Classification:
○ Use features learn model
○ Apply model on unknown
● Recommendation
○ Use pairwise(user-item) information to learn model
○ For a given user return highly likely items
14. about:classify
● Plenty of algorithms
○ Naïve Bayes
○ Complementary Naïve Bayes
○ Random Forests
○ Stocastic Gradient Descent(regression)
● Learn a model from a manually classified data
● Predict the class of a new object based on its
features and the learned model
15. about:recommend
● Predict what the user likes based on
○ His/Her historical behavior
○ Aggregate behavior of people similar to him
16. about:recommend
● Different types of recommenders
○ User based
○ Item based
○ Co-occurrence based
● Full framework for storage, online
online and offline computation of recommendations
● Like clustering, there is a notion of similarity in users or items
○ Cosine, Tanimoto, Pearson and LLR
19. about:parallel-fp-growth
● Identify the most commonly
occurring patterns from
○ Sales Transactions
buy “Milk, eggs and bread”
○ Query Logs
ipad -> apple, tablet, iphone
○ Spam Detection
Yahoo! http://www.slideshare.net/hadoopusergroup/mail-
antispam
20. summary:in-short
● Plenty of overlap. There is no one algorithm to fit all problems.
● Analyze and iterate fast
● MapReduce implementations makes these Fly!
21. Did you know?
● Apache Mahout uses Colt high-performance collections
○ Open HashMaps instead of Chained HashMaps
○ Arrays of Primitive types
○ Available as Mahout Math library
● Mahout Vector uses integer encoding techniques to reduce
space.
● Fastest classifier in Mahout doesn’t use MapReduce!
○ And it learns online
○ And It doesn’t look at all the data.
22. How to use mahout
● Command line launcher bin/mahout
● See the list of tools and algorithms by running bin/mahout
● Run any algorithm by its shortname:
○ bin/mahout kmeans –help
● By default runs locally
● export HADOOP_HOME = /pathto/hadoop-0.20.2/
○ Runs on the cluster configured as per the conf files in the
hadoop directory
● Use driver classes to launch jobs:
○ KMeansDriver.runjob(Path input, Path output …)
23. Clustering Walkthrough (tiny example)
● Input: set of text files in a directory
● Download Mahout and unzip
○ mvn install
○ bin/mahout seqdirectory –i <input> –o <seq-output>
○ bin/mahout seq2sparse –i seq-output –o <vector-output>
○ bin/mahout kmeans –i<vector-output>
-c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
24. Clustering Walkthrough (a bit more)
● Use bigrams: -ng 2
● Prune low frequency: –s 10
● Normalize: -n 2
● Use a distance measure : -dm org.apache.mahout.common.
distance.CosineDistanceMeasure
25. Clustering Walkthrough (viewing results)
● bin/mahout clusterdump
–s cluster-output/clusters-9/part-00000
-d vector-output/dictionary.file-*
-dt sequencefile -n 5 -b 100
● Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997
26. Road to Mahout v1.0
● Guiding Principles
○ Use the stable Hadoop API
○ Make vector the de-factor input format for all parts of code
○ Provide stable API for developers
27. Get Started
● http://mahout.apache.org
● dev@mahout.apache.org - Developer mailing list
● user@mahout.apache.org - User mailing list
● Check out the documentations and wiki for quickstart
● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
29. Thanks to
● Apache Foundation
● Mahout Committers
● Google Summer of Code Organizers
● And Students
● Open source!
● And NHN for hosting this at Seoul!
● and the wonderful engineers (present and future) in the room.