Digital Transformation of the Enterprise. What IT leaders need to know!
Indic threads pune12-recommenders-apache-mahout
1. How to Build a Recommendation
Engine Using Apache Mahout
Viraj Paripatyadar
GS Lab
2. Contents
• A recommendation problem
• What is a recommender
• Building a recommender using Mahout
• Tips and tweaks
• Recommender considerations
2
3. A book store
• Sells books:
• By various authors
• Of various categories
• On different subjects
• From various publishers
• Readers/buyers are asked to rate
• Readers/buyers can provide reviews
You walk into the store
(buy something for a friend)
4. The store owner
• Asks you what:
• your friend reads (already owns)
• your friend usually likes more
• Has data on what:
• his customers buy
• his customers rate and review
• Uses a few strategies
5. 1 - Find similar books
Depending on which books your friend has, pick
books:
• by the same author
• on the same/similar subject/s
• in the same category
• from the same publication
(those with highest sales numbers)
6. 2 - Find books with similar readership
• Define some similarity
• e.g. two books are as similar as the number of readers
rating both of them
• Define some limit of relevance
• e.g. only consider books which are more than 4 readers
similar
• Look for all books which are similar to books
your friend owns
Pick books from this set that you friend doesn’t
own
7. 3 - Find people with similar tastes
• Define some similarity
• e.g. two people are as similar as the number of books
they like from the same category
• Define some limit of relevance
• e.g. only consider the 3 top people when ordered
according to how similar they are to your friend
• Look for users similar to your friend and see
what they read
Pick books which these people like and your
friend doesn’t own
8. Example data
1,101,5.0 3,101,2.5 4,106,4.0
1,102,3.0 3,104,4.0 5,101,4.0
1,103,2.5 3,105,4.5 5,102,3.0
2,101,2.0 3,107,5.0 5,103,2.0
2,102,2.5 4,101,5.0 5,104,4.0
2,103,5.0 4,103,3.0 5,105,3.5
2,104,2.0 4,104,4.5 5,106,4.0
• Your friend owns three books:
• Gave 5 stars to book 101 (likes hugely and talks about it all the time)
• Gave 3 stars to book 102 (has shown some liking to it)
• Gave 2.5 stars to book 103 (has read it, but didn’t say bad things about it)
Now, we need to recommend for your friend books he hasn’t seen
14. What is Apache Mahout
• Apache Mahout
• A machine learning library
• Works with Apache Hadoop
• Use cases:
• Recommenders
• Clustering
• Classification
15. Recommenders in Mahout
• Recommenders use data culled from user
behavior
• Recommending using Mahout
• Similarity between users or items
• Expressed as a number between 0-1
• Neighborhood of users/items
• Recommendation using this info and an algorithm
• Generic
• Specialized
16. Similarity
• Various algorithms:
• Euclidean distance
• Pearson correlation
• Cosine measure
• Spearman correlation
• Tanimoto coefficient
• Log-likelyhood
• Effectiveness dependent on the input data
• Influences running time and memory
17. Neighborhood
• Nearest N neighborhood (say, 4):
5 3
4
U
2 1
• Threshold neighborhood (say, > 0.8):
5 3
4
U
2 1
18. Recommender
• Recommenders
• Generic recommender
• User based
• Item based
• Slope-one recommender
• Singular Value Decomposition based
• Liner Interpolation based
• Cluster-based
• Recommender rescorer
• Recommender evaluator
19. A real-life Web application
• News aggregator-cum-reader
• Fetches news from a news service
• Shows the news in a uniform UI
• Lets readers read, like/dislike and comment on news
• Link social networks and share
• Make this a personalized newspaper
• Track user actions
• Derive and store preferences
• Generate recommendations
• Leverage social accounts, etc.
20. Overall design
Third party User, application
REST data (MySQL)
applications
News
Phone/tablet Controller
REST aggregation, stora
applications API (REST)
ge (Hbase)
Preferences, Reco
REST mmender
Web application (Mahout)
21. Recommender
REST service Recommender
Fetch recommendations (offline, run
REST Input user actions periodically)
(Grizzly,
Tomcat)
Input
Database table
dump
MySQL
22. How to extract data – one dimension
News article readership
10000
4299
1000
511
128
100
51
News article
readership
13
10
4 4
2
1
1
1 2 3 4 5 6 7 8 9
Number of News Articles
23. How to extract data – add dimensions
10000
1000
100
News article
readership
Topic
10 readership
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 42 44 46 51 57
Number of News articles / Topics
24. How more data helps
40
35
30
25 No. of readers
with x articles
20 each
No. of readers
15 with x topics
each
10
5
1
2
0
0 100 200 300 400 500 600 700 800
Number of news articles/topics
25. How more data helps
9
8
7
6
No. of readers
5
with x articles
each
4
No. of readers
3
with x topics
each
2
1
0
5 25 45 65 85
Number of news articles/topics
26. How more data helps
3.5
3
2.5
No. of readers
2 with x articles
each
1.5 No. of readers
with x topics
each
1
0.5
0
95 145 195 245 295 345 395
Number of news articles/topics
27. Learnings
• Know thy user
• Frequency of visits
• Preference logic wrt user
• Know thy items
• Should have enough items per user
• Maximize items per action
• Should have enough intersections
• Should not be transient
• Use tweaking abilities
• Sharpen the saw