3. What is Machine Learning?
NOT!
QuickTimeª and a
decompressor QuickTimeª and a
are needed to see this picture. Or? decompressor
are needed to see this picture.
http://en.wikipedia.org/wiki/Image:Hal-9000.jpg
http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
6. Definition
• “Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
– Intro. To Machine Learning by E.
Alpaydin
• Subset of Artificial Intelligence
– Many other fields: comp sci., biology,
math, psychology, etc.
7. Characterizations
• Lots of Data
• Identifiable Features in that Data
• Too big/costly for people to handle
– People still can help
8. Types
• Supervised
– Using labeled training data, create
function that predicts output of unseen
inputs
• Unsupervised
– Using unlabeled data, create function
that predicts output
• Semi-Supervised
– Uses labeled and unlabeled data
9. Classification/Categorization
• Spam Filtering
• Named Entity Recognition
• Phrase Identification
• Sentiment Analysis
• Classification into a Taxonomy
10. Clustering
• Find Natural Groupings
– Documents
– Search Results
– People
– Genetic traits in groups
– Many, many more uses
11. Collaborative Filtering
• Recommend people and products
– User-User
• User likes X, you might too
– Item-Item
• People who bought X also bought Y
12. Info. Retrieval
• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking
13. Other
• Image Analysis
• Robotics
• Games
• Higher level natural language
processing
• Many, many others
14. What is Apache Mahout?
• A Mahout is an elephant
trainer/driver/keeper, hence…
QuickTimeª and a
decompressor
are needed to see this picture.
+ (and other distributed techniques)
Machine Learning
=
15. What?
• Hadoop brings:
– Map/Reduce API
– HDFS
– In other words, scalability and fault-
tolerance
• Thus, Mahout’s Goal is:
– Scalable Machine Learning with Apache
License
16. Why Mahout?
• Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented
• Personal: Learn more ML
• Intelligent Apps are the Present and Future
– See the Hadoop talks tomorrow and Friday!
• Goal: Overcome gaps the Apache Way!
17. Current Status
• Close to Initial release
– Focused on examples, docs, bug fixes
• What’s in it:
– Simple Matrix/Vector library
– Taste Collaborative Filtering
– Clustering
• Canopy/K-Means/Fuzzy K-Means/Mean-shift
– Classifiers
• Naïve Bayes
• Complementary NB
– Evolutionary
• Integration with Watchmaker for fitness function
19. Taste: Movie
Recommendations
• Given ratings by users of movies,
recommend other movies
• http://lucene.apache.org/mahout/taste
.html#demo
20. Clustering: Synthetic Control
Data
• http://archive.ics.uci.edu/ml/datasets/Synthetic+
• Each clustering impl. has an example
Job for running in
<MAHOUT_HOME>/examples
– o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
21. Classification: NB and CNB
Examples
• 20 Newsgroups
– http://cwiki.apache.org/confluence/display/MA
• Wikipedia
– http://cwiki.apache.org/confluence/display/MA
23. What’s Next?
• Release 0.1!
• Shared Amazon Images (others?)
• More Examples
• Winnow/Perceptron (MAHOUT-85)
• Hbase and HAMA support
• Normalize I/O format for data
• Solr Integration (SOLR-769)
• Other Algorithms: SVM, Linear Regression,
etc.
24. When, Where, Who
• When? Now!
– Mahout is growing
• Who? You!
– We want Java programmers who:
• Are comfortable with math
• Like to work on large, hard problems
• Where?
– http://lucene.apache.org/mahout
– http://cwiki.apache.org/MAHOUT
– mahout-{user|dev}@lucene.apache.org
25. Resources
• “Programming Collective Intelligence”
by Toby Segaran
• “Data Mining - Practical Machine
Learning Tools and Techniques” by
Ian H. Witten and Eibe Frank
• Hadoop - http://hadoop.apache.org
• http://mloss.org/software/