1. Recommendations from the search
engine
Sesam Hackathon, Warsaw, 2014-03-23
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1
2. This whole presentation is about Ted Dunning’s
proposed approach to recommendations
Based on his 1993 paper (below)
– references at the end
Very simple method, dead easy to implement
– seems to work pretty well
2
Inspiration
3. Usually designed as prediction of ratings
– Dunning believes this is the wrong approach
– people’s ratings don’t necessarily reflect what they’ll
buy
– go by what people do rather than what they say
You don’t want to recommend Bob Dylan
– everyone’s already heard about him, and know what
they think
– you want to recommend things that are new to the user
You don’t want to recommend things everyone
likes
3
Thoughts on recommendations
4. Step 1
– work out which things tend to occur together
– that is, if you buy this, you’re likely to also buy this
– however, we only want pairs which are statistically
significant
Step 2
– index up the significant pairs in a search engine
– use search to produce the actual results
4
The actual approach
6. User Item
u1 i1
u1 i2
u2 i1
u3 i2
u3 i3
u3 i4
... ...
The starting point
Some kind of log of user actions
User has
– bought a movie | album | book | ...
– opened a document
– ...
From this raw material, we can work
out what things tend to go together
– and whether this is significant
9. k[0][0] = the number in the matrix on
previous slide
k[0][1] = the sum of that whole column
minus k[0][0]
k[1][0] = the sum of that whole row
minus k[0][0]
k[1][1] = the sum of the entire matrix
minus k[0][0] minus k[1][0] minus
k[0][1]
9
Producing the k 2x2 matrix
How to compute the k matrix for a given cell in the matrix
on the previous slide
If the output of LLR(k) is above some threshold, the pair is considered significant.
10. Check the Python code on
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
– this requires a lot of memory and CPU
Or just use Mahout
– RowSimilarityJob does exactly this
10
Doing it for real
12. Take all the items and index them up with the
search engine in the usual way
– that is, each title has an id, a title, a description, etc
Then, add a “magic” field
– put into it the IDs of all the items that appear in a
significant pair with this item
– let’s call this field “indicators”
Now we’re ready to do recommendations
12
Indexing with the search engine
13. Collect some set of items for which the user has
expressed a preference
– by buying them, looking at them, rating them, whatever
The IDs of these items are your query
– search the “indicators” field
– the search results are your recommendations
That’s it!
– pack up, go home
13
Doing recommendations
14. Imagine that you’re searching for movies, and you
type “the godfather”
– “the” appears in all documents, so documents matching that
get a low relevance score
– “godfather” appears in very few documents, so matches on
that get a high score
– this is basically TF/IDF in a nutshell
Now, imagine you liked two movies: “The Godfather”
and “The Daytrippers”
– nearly all movies have “The Godfather” as an indicator
– very few have “The Daytrippers”
– the second will therefore influence recommendations much
more
14
Why does it work?
16. Again, the code is on Github
– very simple webapp based on web.py and Lucene
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
The underlying data is the MovieLens dataset
– 10 million ratings of 10,000 movies by 72,000 users
– http://grouplens.org/datasets/movielens/
16
Real demo with real data
17. llr.py
– this chews the data, producing the significant pairs
– takes huge amount of memory and about 30 minutes
– have made absolutely no attempts to optimize it
llr_index.py
– reads output of previous script, makes Lucene index
recom-ui.py
– the actual web application
17
Three scripts
25. Tweak the parameters a bit to see what happens
Can we support a “Dislike” button?
Test it with more kinds of data
Learn how to do this with Mahout
25
Things left to do
29. The original 1993 paper
– http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14
.5962
Ebook with lots of background but little detail
– http://www.mapr.com/practical-machine-learning
Slides covering the same material
– www.slideshare.net/tdunning/building-multimodal-
recommendation-engines-using-search-engines
Blog post with actual equations
– http://tdunning.blogspot.com/2008/03/surprise-and-
coincidence.html
29
References