A new version of the collaborative filtering talk:
- Presenting the Netflix Prize story
- Discussing User-based and Item-based collaborative filtering, and various similarity metrics
- Discussing how to Map-Reduce the calculation
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Collaborative filtering intro - Full
1. WE KNOW YOU WILL LIKE THIS
Introduction to Recommendation Engines
Monday, January 14, 13
2. ML
X X +Y
Supervised Unsupervised
Clustering
T + YT
X X +Y
Hierarchical Clustering
Regression Classification
Turnout Class
30 Spam
Y= (numeric) Y = Not Spam (Categorical)
12
25 Spam
Monday, January 14, 13
3. MarabooKarnaf Ima Adama
Liv
Idan 5 ? 3 ?
Shahar 4 3 ? 2
Gadi ? 1 ? 5
Content/Model-Based
(Agnostic, Behavioural)
(predicting the rating)
Recommendation
Monday, January 14, 13
15. Maraboo Karnaf Ima Adama Liv
Idan 1 ? 1 ?
Shahar 1 1 ? 1
Gadi ? 1 ? 1
Maraboo Karnaf Ima Adama Liv
Idan 5 ? 3 ?
Shahar 4 3 ? 2
Gadi ? 1 ? 5
Monday, January 14, 13
16. Maraboo Karnaf Ima Adama Liv
Idan 1 ? 1 ?
Shahar 1 1 ? 1
Gadi ? 1 ? 1
Monday, January 14, 13
17. Maraboo Karnaf Ima Adama Liv
Idan 5 ? 3 ?
Shahar 4 3 ? 2
Gadi ? 1 ? 5
Monday, January 14, 13
20. Jaccard Distance “We share 5 preferences out of 7!”
Euclidean Distance
Cosine Similiarity
Pearson’s
Correlation 1- “Our preferences go
Distance in the same direction!”
(but only 2 such preferences do...)
Log-Likelihood
Ratio
Measure of “Surprise” at correlation
Monday, January 14, 13
22. Case study: Amazon
100,000,000 users
2,000,000 items
Each user expresses preference for 10 items
Each item has 500 reviews
User-Based CF: Item-Based CF:
100,000,000 x 100,000,000 2,000,000 x 2,000,000 similarity
similarity matrix matrix
2,000,000 x 500 sum terms 2,000,000 x 10 sum terms
Monday, January 14, 13
23. Interpretability
“People who go to
La Colombe “Coffee Shop
Torrefaction & connoisseurs tend
FourSquare HQ tend to come here”
to go here”
Monday, January 14, 13
24. Evaluation
Rating Problem: Predictive accuracy (regression) metrics
RMSE, MAE, etc.
Preference (Binary) Problem: Classification accuracy (IR) metrics
Accuracy, Precision, Recall, F-1, ROC, etc.
Benchmark vs. ‘random’ and ‘popular’
Ranking accuracy metrics: Similarity of permutations
Pearson’s correlation, Spearman’s rho, Kendall’s tau
Monday, January 14, 13
26. Challenges
Cold-start problems (new item, new user)
“Black” and “Grey” sheep
Exploration-exploitation and reinforcement learning
Scale
Monday, January 14, 13
28. MapReduce Similarity Calculation
“User-based”
A ui
Maraboo Karnaf Ima Adama Liv Gadi Gadi
Idan
Shahar
1
1
?
1
1
?
?
1 * Maraboo
Karnaf
?
1
= Idan
Shahar
0
2
Gadi ? 1 ? 1 Ima Adama ? Gadi 2
Liv 1
User similarity vector
AT Aui T(Au )
Maraboo
Idan
1
Shahar Gadi
1 ?
* Idan
Gadi
0
= Maraboo
Gadi
2
A i
Karnaf ? 1 1 Shahar 2 Karnaf 4
Ima Adama 1 ? ? Gadi 2 Ima Adama 0
Liv ? 1 1 Liv 4
Monday, January 14, 13
29. MapReduce Similarity Calculation
“Item-Based”
A T A
Idan Shahar Gadi Maraboo Karnaf Ima Adama Liv Maraboo Karnaf Ima Adama Liv
Maraboo 1 1 ? Idan 1 ? 1 ?
Karnaf ? 1 1 * Shahar 1 1 ? 1 = Maraboo
Karnaf
2
1
1
2
1
0
1
2
Ima Adama 1 ? ? Gadi ? 1 ? 1 Ima Adama 1 0 1 0
Liv ? 1 1 Liv 1 2 0 2
Item similarity matrix
ATA ui
Maraboo Karnaf Ima Adama Liv Gadi Gadi
Maraboo 2 1 1 1 Maraboo ? Maraboo 2
=
* T
(A A)ui
Karnaf 1 2 0 2 Karnaf 1 Karnaf 4
Ima Adama 1 0 1 0 Ima Adama ? Ima Adama 0
Liv 1 2 0 2 Liv 1 Liv 4
Similarity of item x to item y is <ix,iy>
Monday, January 14, 13
30. MapReduce Similarity Calculation
Recall row outer-product matrix multiplication:
Maraboo Karnaf Ima Adama Liv
Maraboo 2 1 1 1
Karnaf 1 2 0 2
Ima Adama 1 0 1 0
Liv 1 2 0 2
=
Maraboo Karnaf Ima Adama Liv Maraboo Karnaf Ima Adama Liv Maraboo Karnaf Ima Adama Liv
Maraboo 1 0 1 0 Maraboo 1 1 0 1 Maraboo 0 0 0 0
Karnaf
Ima Adama
0
1
0
0
0
1
0
0 + Karnaf
Ima Adama
1
0
1
0
0
0
1
0
+ Karnaf
Ima Adama
0
0
1
0
0
0
1
0
Liv 0 0 0 0 Liv 1 1 0 1 Liv 0 1 0 1
uIdanuIdan T uShaharuShahar
T uGadiuGadi T
Only one user’s list of items is used every time!
Monday, January 14, 13
31. MapReduce Similarity Calculation
All of the classic similarity functions are
made up of 3 stages:
Preprocess (uses only one ELEMENT)
Norm (Can be done in reduce on one
VECTOR)
T
Similarity utilizes the A A matrix joined
with norm entries
Monday, January 14, 13
32. Bibliography
Google News Personalization: Scalable Online Collaborative Filtering - Das, Datar, Garg, Rajaram, WWW2007
Logistic Regression and Collaborative Filtering for Sponsored Search Term Recommendation - Bartz, Murthi, Sebastian, EC2006
Evaluating Collaborative Filtering Recommender Systems - Herlocker, Konstan, Tenveen, Riedl, ACM TIS2004
A Survey of Collaborative Filtering Techniques - Su, Khoshgoftaar, AAI2009
An Introduction to Information Retrieval - Manning, Raghavan, Schutze, Cambridge Press
Mahout in Action - Friedman, Dunning, Anil, Owen, Manning Publications
Lessons from the Netflix Prize Challenge - Bell, Koren, KDD2009
Factorization meets the Neighbourhood: a Multifaceted Collaborative Filtering Model - Koren, KDD2008
Accurate Methods for the Statistics of Surprise and Coincidence - Dunning, ACL1993
Item-Based Collaborative Filtering Recommendation Algorithms - Sarwar, Konstan, Karypis, Riedl, WWW2001
Matrix Factorization Techniques for Recommender Systems - Koren, Bell, Volinsky, IEEE2009
recommenderlab: A Framework for Developing and Testing Recommendation Algorithms - Hahsler, 2001
Scalable Similarity-Based Neighbourhood Methods with MapReduce - Schelter, Boden, Markl, RecSys2012
Monday, January 14, 13