- Mahout is an Apache project that builds scalable machine learning libraries for large datasets. It includes algorithms for classification, clustering, recommendation, and other tasks.
- The Mahout recommender system uses collaborative filtering to recommend items to users based on their preferences and the preferences of similar users. It has item-based and user-based approaches.
- An example is described of using the Mahout recommender on a movie recommendation problem using the Netflix dataset, run on Hadoop. It produced recommendations for users in 16 minutes on the described hardware configuration.
2. “A lot of times, people don't know what they
want until you show it to them.”
Steve Jobs
“We don't make money when we sell things;
we make money when we help customers
make purchase decisions.”
Jeff Bezos, Amazon
Why recommendation is important ?
3. An Apache project to build scalable machine
learning libraries
●
Focused on large data sets
●
Adaption of standard machine learning algorithms
●
Run on Apache Hadoop (map/reduce paradigm)
… or on a non Hadoop node
4. Who is using Mahout ?
Source: https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
6. Mahout Recommender
Collaborative filtering
People often get the best recommendation from someone
with similar taste
●
People tend to like things that are similar to other things
they like
●
There are patterns in people likes and dislikes
John Bob
movie1 movie1
movie2
movie2
movie42
movie4
movie5
Will Bob like movie4? and
movie5?
7. Mahout Recommender
Available recommenders
●
Item based
●
User based
Execution modes
●
Taste: online but not distributed
●
Hadoop: offline (batch) but distributed
Parameters
●
Many coefficients to calculate user and item
similarity and neighborhood
●
Data model abstractions
9. 1st try!
Movie recommendation
Netflix base (http://www.netflixprize.com/)
●
# of user tastes: 2.817.131
●
# of movies: 17.770
●
# of users: 472891
Environment and performance
●
Hadoop pseudo-distributed
●
Computer
●
Intel® Core™ i5-3317U CPU @ 1.70GHz × 4
●
6Gb RAM
●
Total time: ~ 16 minutes
10. How to run ?
1. Copy the input file to HDFS (Hadoop distributed
file system)
hadoop fs -put qualifying.txt /netflix/input/data.txt
2. Run the recommender
hadoop jar core/target/mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=/netflix/input/data.txt
-Dmapred.output.dir=/netflix/output
--numRecommendations 10
--similarityClassname SIMILARITY_LOGLIKELIHOOD