SlideShare a Scribd company logo
1 of 29
Recommendations from the search
engine
Sesam Hackathon, Warsaw, 2014-03-23
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1
This whole presentation is about Ted Dunning’s
proposed approach to recommendations
Based on his 1993 paper (below)
– references at the end
Very simple method, dead easy to implement
– seems to work pretty well
2
Inspiration
Usually designed as prediction of ratings
– Dunning believes this is the wrong approach
– people’s ratings don’t necessarily reflect what they’ll
buy
– go by what people do rather than what they say
You don’t want to recommend Bob Dylan
– everyone’s already heard about him, and know what
they think
– you want to recommend things that are new to the user
You don’t want to recommend things everyone
likes
3
Thoughts on recommendations
Step 1
– work out which things tend to occur together
– that is, if you buy this, you’re likely to also buy this
– however, we only want pairs which are statistically
significant
Step 2
– index up the significant pairs in a search engine
– use search to produce the actual results
4
The actual approach
Statistically significant co-
occurrence
Part the first
User Item
u1 i1
u1 i2
u2 i1
u3 i2
u3 i3
u3 i4
... ...
The starting point
Some kind of log of user actions
User has
– bought a movie | album | book | ...
– opened a document
– ...
From this raw material, we can work
out what things tend to go together
– and whether this is significant
7
i1 i2 i3 i4 i5 i6 i7
i1 23 42 0 0 5 7
i2 23 6 1 129 2 10
i3 42 6 3 0 492 1
i4 0 1 3 2 3 1
i5 0 129 0 2 94 2
i6 5 2 492 3 94 1
i7 7 10 1 1 2 1
8
Item-to-item matrix
k[0][0] = the number in the matrix on
previous slide
k[0][1] = the sum of that whole column
minus k[0][0]
k[1][0] = the sum of that whole row
minus k[0][0]
k[1][1] = the sum of the entire matrix
minus k[0][0] minus k[1][0] minus
k[0][1]
9
Producing the k 2x2 matrix
How to compute the k matrix for a given cell in the matrix
on the previous slide
If the output of LLR(k) is above some threshold, the pair is considered significant.
Check the Python code on
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
– this requires a lot of memory and CPU
Or just use Mahout
– RowSimilarityJob does exactly this
10
Doing it for real
Search engine as recommender
Part the second
Take all the items and index them up with the
search engine in the usual way
– that is, each title has an id, a title, a description, etc
Then, add a “magic” field
– put into it the IDs of all the items that appear in a
significant pair with this item
– let’s call this field “indicators”
Now we’re ready to do recommendations
12
Indexing with the search engine
Collect some set of items for which the user has
expressed a preference
– by buying them, looking at them, rating them, whatever
The IDs of these items are your query
– search the “indicators” field
– the search results are your recommendations
That’s it!
– pack up, go home
13
Doing recommendations
Imagine that you’re searching for movies, and you
type “the godfather”
– “the” appears in all documents, so documents matching that
get a low relevance score
– “godfather” appears in very few documents, so matches on
that get a high score
– this is basically TF/IDF in a nutshell
Now, imagine you liked two movies: “The Godfather”
and “The Daytrippers”
– nearly all movies have “The Godfather” as an indicator
– very few have “The Daytrippers”
– the second will therefore influence recommendations much
more
14
Why does it work?
Trying it out for real
Part the third
Again, the code is on Github
– very simple webapp based on web.py and Lucene
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
The underlying data is the MovieLens dataset
– 10 million ratings of 10,000 movies by 72,000 users
– http://grouplens.org/datasets/movielens/
16
Real demo with real data
llr.py
– this chews the data, producing the significant pairs
– takes huge amount of memory and about 30 minutes
– have made absolutely no attempts to optimize it
llr_index.py
– reads output of previous script, makes Lucene index
recom-ui.py
– the actual web application
17
Three scripts
18
19
20
Liked one movie
21
Liked two movies
Movies with highest llr scor
together with this movie
22
Liked three movies
Recommendations are actually now spot-on. At least for me.
class Movie:
def GET(self, movieid):
nocache()
doc = search.do_query('id', movieid)[0]
#recoms = search.do_query('indicators', movieid)
recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets]
if hasattr(session, 'liked'):
youlike = search.do_query('indicators', session.liked)
else:
youlike = []
return render.movie(doc, recoms, youlike)
23
Complete code for movie page
Further work
Winding up
Tweak the parameters a bit to see what happens
Can we support a “Dislike” button?
Test it with more kinds of data
Learn how to do this with Mahout
25
Things left to do
26
What is this?
From Ted Dunning’s slides
27
And this?
From Ted Dunning’s slides
28
And this?
From Ted Dunning’s slides
The original 1993 paper
– http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14
.5962
Ebook with lots of background but little detail
– http://www.mapr.com/practical-machine-learning
Slides covering the same material
– www.slideshare.net/tdunning/building-multimodal-
recommendation-engines-using-search-engines
Blog post with actual equations
– http://tdunning.blogspot.com/2008/03/surprise-and-
coincidence.html
29
References

More Related Content

What's hot

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonMark Conway
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahoutlucenerevolution
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Machine Learning in the age of Big Data
Machine Learning in the age of Big DataMachine Learning in the age of Big Data
Machine Learning in the age of Big DataDaniel Sârbe
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewKevin Watters
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 

What's hot (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahout
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Machine Learning in the age of Big Data
Machine Learning in the age of Big DataMachine Learning in the age of Big Data
Machine Learning in the age of Big Data
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 

Viewers also liked

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques sun9413
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Rajul Kukreja
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal RecommenderPat Ferrel
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudDeanta
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsRoger Chen
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Report Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sectionsReport Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sectionsSherrie Lee
 

Viewers also liked (11)

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the Cloud
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Report Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sectionsReport Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sections
 

Similar to Using the search engine as recommendation engine

Software Development is Upside Down
Software Development is Upside DownSoftware Development is Upside Down
Software Development is Upside Downallan kelly
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00tDefcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00tpseudor00t overflow
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]David Przybilla
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solvingFlowa Oy
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Machine learning
Machine learningMachine learning
Machine learningAshok Masti
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendationAravindharamanan S
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendationAravindharamanan S
 
Recommendation Systems Roadtrip
Recommendation Systems RoadtripRecommendation Systems Roadtrip
Recommendation Systems RoadtripThe Real Dyl
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 

Similar to Using the search engine as recommendation engine (20)

Software Development is Upside Down
Software Development is Upside DownSoftware Development is Upside Down
Software Development is Upside Down
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00tDefcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Machine learning
Machine learningMachine learning
Machine learning
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Recommendation Systems Roadtrip
Recommendation Systems RoadtripRecommendation Systems Roadtrip
Recommendation Systems Roadtrip
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 

More from Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Using the search engine as recommendation engine

  • 1. Recommendations from the search engine Sesam Hackathon, Warsaw, 2014-03-23 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. This whole presentation is about Ted Dunning’s proposed approach to recommendations Based on his 1993 paper (below) – references at the end Very simple method, dead easy to implement – seems to work pretty well 2 Inspiration
  • 3. Usually designed as prediction of ratings – Dunning believes this is the wrong approach – people’s ratings don’t necessarily reflect what they’ll buy – go by what people do rather than what they say You don’t want to recommend Bob Dylan – everyone’s already heard about him, and know what they think – you want to recommend things that are new to the user You don’t want to recommend things everyone likes 3 Thoughts on recommendations
  • 4. Step 1 – work out which things tend to occur together – that is, if you buy this, you’re likely to also buy this – however, we only want pairs which are statistically significant Step 2 – index up the significant pairs in a search engine – use search to produce the actual results 4 The actual approach
  • 6. User Item u1 i1 u1 i2 u2 i1 u3 i2 u3 i3 u3 i4 ... ... The starting point Some kind of log of user actions User has – bought a movie | album | book | ... – opened a document – ... From this raw material, we can work out what things tend to go together – and whether this is significant
  • 7. 7
  • 8. i1 i2 i3 i4 i5 i6 i7 i1 23 42 0 0 5 7 i2 23 6 1 129 2 10 i3 42 6 3 0 492 1 i4 0 1 3 2 3 1 i5 0 129 0 2 94 2 i6 5 2 492 3 94 1 i7 7 10 1 1 2 1 8 Item-to-item matrix
  • 9. k[0][0] = the number in the matrix on previous slide k[0][1] = the sum of that whole column minus k[0][0] k[1][0] = the sum of that whole row minus k[0][0] k[1][1] = the sum of the entire matrix minus k[0][0] minus k[1][0] minus k[0][1] 9 Producing the k 2x2 matrix How to compute the k matrix for a given cell in the matrix on the previous slide If the output of LLR(k) is above some threshold, the pair is considered significant.
  • 10. Check the Python code on – https://github.com/larsga/py- snippets/tree/master/machine-learning/llr – this requires a lot of memory and CPU Or just use Mahout – RowSimilarityJob does exactly this 10 Doing it for real
  • 11. Search engine as recommender Part the second
  • 12. Take all the items and index them up with the search engine in the usual way – that is, each title has an id, a title, a description, etc Then, add a “magic” field – put into it the IDs of all the items that appear in a significant pair with this item – let’s call this field “indicators” Now we’re ready to do recommendations 12 Indexing with the search engine
  • 13. Collect some set of items for which the user has expressed a preference – by buying them, looking at them, rating them, whatever The IDs of these items are your query – search the “indicators” field – the search results are your recommendations That’s it! – pack up, go home 13 Doing recommendations
  • 14. Imagine that you’re searching for movies, and you type “the godfather” – “the” appears in all documents, so documents matching that get a low relevance score – “godfather” appears in very few documents, so matches on that get a high score – this is basically TF/IDF in a nutshell Now, imagine you liked two movies: “The Godfather” and “The Daytrippers” – nearly all movies have “The Godfather” as an indicator – very few have “The Daytrippers” – the second will therefore influence recommendations much more 14 Why does it work?
  • 15. Trying it out for real Part the third
  • 16. Again, the code is on Github – very simple webapp based on web.py and Lucene – https://github.com/larsga/py- snippets/tree/master/machine-learning/llr The underlying data is the MovieLens dataset – 10 million ratings of 10,000 movies by 72,000 users – http://grouplens.org/datasets/movielens/ 16 Real demo with real data
  • 17. llr.py – this chews the data, producing the significant pairs – takes huge amount of memory and about 30 minutes – have made absolutely no attempts to optimize it llr_index.py – reads output of previous script, makes Lucene index recom-ui.py – the actual web application 17 Three scripts
  • 18. 18
  • 19. 19
  • 21. 21 Liked two movies Movies with highest llr scor together with this movie
  • 22. 22 Liked three movies Recommendations are actually now spot-on. At least for me.
  • 23. class Movie: def GET(self, movieid): nocache() doc = search.do_query('id', movieid)[0] #recoms = search.do_query('indicators', movieid) recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets] if hasattr(session, 'liked'): youlike = search.do_query('indicators', session.liked) else: youlike = [] return render.movie(doc, recoms, youlike) 23 Complete code for movie page
  • 25. Tweak the parameters a bit to see what happens Can we support a “Dislike” button? Test it with more kinds of data Learn how to do this with Mahout 25 Things left to do
  • 26. 26 What is this? From Ted Dunning’s slides
  • 27. 27 And this? From Ted Dunning’s slides
  • 28. 28 And this? From Ted Dunning’s slides
  • 29. The original 1993 paper – http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14 .5962 Ebook with lots of background but little detail – http://www.mapr.com/practical-machine-learning Slides covering the same material – www.slideshare.net/tdunning/building-multimodal- recommendation-engines-using-search-engines Blog post with actual equations – http://tdunning.blogspot.com/2008/03/surprise-and- coincidence.html 29 References