SlideShare uma empresa Scribd logo
1 de 27
Collaborative Filtering at Scale London Hadoop User Group Sean Owen 14 April 2011
Apache Mahout is … Machine learning … Collaborative filtering (recommenders) Clustering Classification Frequent item set mining and more … at scale Much implemented on Hadoop Efficient data structures Collaborative Filtering at Scale
Collaborative Filtering is … Given a user’spreferences for items, guess which other items would be highly preferred Only needs preferences;users and items opaque Many algorithms! Collaborative Filtering at Scale
Collaborative Filtering is … Collaborative Filtering at Scale Sean likes “Scarface” a lot Robin likes “Scarface” somewhat Grant likes “The Notebook” not at all … (123,654,5.0) (789,654,3.0) (345,876,1.0) … (Magic) (345,654,4.5) … Grant may like “Scarface” quite a bit …
Recommending people food Collaborative Filtering at Scale
Item-Based Algorithm Recommend items similar to a user’s highly-preferred items Collaborative Filtering at Scale
Item-Based Algorithm Have user’s preference for items Know all items and can compute weighted average to estimate user’s preference What is the item – item similarity notion? Collaborative Filtering at Scale for every item i that u has no preference for yet   for every item j that u has a preference for     compute a similarity s between i and j     add u's preference for j, weighted by s,               to a running average  return top items, ranked by weighted average
Item-Item Similarity Could be based on content… Two foods similar if both sweet, both cold BUT: In collaborative filtering, based only on preferences (numbers) Pearson correlation between ratings ? Log-likelihood ratio ? Simple co-occurrence:Items similar when appearing often in the same user’s set of preferences Collaborative Filtering at Scale
Estimating preference Collaborative Filtering at Scale Preference Co-occurrence 9 5 16 5 5 2 5•9 + 5•16 + 2•5 4.5 = 135 = 9 + 16 + 5 30
As matrix math User’s preferences are a vector Each dimension corresponds to one item Dimension value is the preference value Item-item co-occurrences are a matrix Row i / column j is count of item i / j co-occurrence Estimating preferences:co-occurrence matrix × preference (column) vector Collaborative Filtering at Scale
As matrix math Collaborative Filtering at Scale 16 animals ate both hot dogs and ice cream 10 animals ate blueberries
A different way to multiply Normal: for each row of matrix Multiply (dot) row with column vector Yields scalar: one final element of recommendation vector Inside-out: for each element of column vector Multiply (scalar) with corresponding matrix column Yield column vector: parts of final recommendation vector Sum those to get result Can skip for zero vector elements! Collaborative Filtering at Scale
As matrix math, again Collaborative Filtering at Scale 5 5 2
What is MapReduce? 1 Input is a series of key-value pairs: (K1,V1) 2 map() function receives these, outputs 0 or more (K2, V2) 3 All values for each K2 are collected together 4 reduce() function receives these, outputs 0 or more (K3,V3) Very distributable and parallelizable Most large-scale problems can be chopped into a series of such MapReduce jobs Collaborative Filtering at Scale
Build user vectors (mapper) Input is text file: user,item,preference Mapper receives K1 = file position (ignored) V1 = line of text file Mapper outputs, for each line K2 = user ID V2 = (item ID, preference) Collaborative Filtering at Scale 1
Build user vectors (reducer) Reducer receives K2 = user ID V2,… = (item ID, preference), … Reducer outputs K3 = user ID V3 = Mahout Vector implementation Mahout provides custom Writable implementations for efficient Vector storage Collaborative Filtering at Scale 1
Count co-occurrence (mapper) Mapper receives K1 = user ID V1 = user Vector Mapper outputs, for each pair of items K2 = item ID V2 = other item ID Collaborative Filtering at Scale 2
Count co-occurrence (reducer) Reducer receives K2 = item ID V2,… = other item ID, … Reducer tallies each other item;creates a Vector Reducer outputs K3 = item ID V3 = column of co-occurrence matrix           as Vector Collaborative Filtering at Scale 2
Partial multiply (mapper #1) Mapper receives K1 = user ID V1 = user Vector Mapper outputs, for each item K2 = item ID V2 = (user ID,  preference) Collaborative Filtering at Scale 3
Partial multiply (mapper #2) Mapper receives K1 = item ID V1 = co-occurrence matrix column Vector Mapper outputs K2 = item ID V2 = co-occurrence matrix column Vector Collaborative Filtering at Scale 3
Partial multiply (reducer) Reducer receives K2 = item ID V2,… = (user ID, preference), …and co-occurrence matrix column Vector Reducer outputs, for each item ID K3 = item ID V3 = column vector and (user ID, preference)  	pairs Collaborative Filtering at Scale 3
Aggregate (mapper) Mapper receives K1 = item ID V1 = column vector and (user ID, preference)  	pairs Mapper outputs, for each user ID K2 = user ID V2 = column vector times preference Collaborative Filtering at Scale 4
Aggregate (reducer) Reducer receives K2 = user ID V2,… = partial recommendation vectors Reducer sums to make recommendation Vector and finds top n values Reducer outputs, for top value K3 = user ID V3 = (item ID, value) Collaborative Filtering at Scale 4
Reality is more complex Collaborative Filtering at Scale
Ready to try Obtain and build Mahout from Subversionhttp://mahout.apache.org/versioncontrol.html Set up, run Hadoop in local pseudo-distributed mode Copy input into local HDFS hadoop jar mahout-0.5-SNAPSHOT.job  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob  -Dmapred.input.dir=input  -Dmapred.output.dir=output  Collaborative Filtering at Scale
Mahout in Action Action-packed coverage of: Recommenders Clustering Classification Final-ish PDF available now;final print copy in May 2011 http://www.manning.com/owen/ Collaborative Filtering at Scale
Questions? Gmail: srowen user@mahout.apache.org http://mahout.apache.org ? Collaborative Filtering at Scale

Mais conteúdo relacionado

Mais procurados

Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
Item basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsItem basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsAravindharamanan S
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFYusuke Yamamoto
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filteringNeha Kulkarni
 
Collaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopCollaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filteringD Yogendra Rao
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Ernesto Mislej
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System ExplainedCrossing Minds
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds
 
Recommender system
Recommender systemRecommender system
Recommender systemSaiguru P.v
 
Movie lens recommender systems
Movie lens recommender systemsMovie lens recommender systems
Movie lens recommender systemsKapil Garg
 
Hybrid recommender systems
Hybrid recommender systemsHybrid recommender systems
Hybrid recommender systemsrenataghisloti
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systemsFalitokiniaina Rabearison
 

Mais procurados (20)

Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
Item basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsItem basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithms
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Collaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopCollaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on Hadoop
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filtering
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time Deployment
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Movie lens recommender systems
Movie lens recommender systemsMovie lens recommender systems
Movie lens recommender systems
 
Mahout part1
Mahout part1Mahout part1
Mahout part1
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Hybrid recommender systems
Hybrid recommender systemsHybrid recommender systems
Hybrid recommender systems
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems
 

Destaque

Incremental Item-based Collaborative Filtering
Incremental Item-based Collaborative FilteringIncremental Item-based Collaborative Filtering
Incremental Item-based Collaborative Filteringjnvms
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systemsyoualab
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Hypothesis-Based Collaborative Filtering
Hypothesis-Based Collaborative FilteringHypothesis-Based Collaborative Filtering
Hypothesis-Based Collaborative FilteringAmancio Bouza
 
Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInLili Wu
 
Collaborative Filtering in Map/Reduce
Collaborative Filtering in Map/ReduceCollaborative Filtering in Map/Reduce
Collaborative Filtering in Map/ReduceOle-Martin Mørk
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 

Destaque (11)

Incremental Item-based Collaborative Filtering
Incremental Item-based Collaborative FilteringIncremental Item-based Collaborative Filtering
Incremental Item-based Collaborative Filtering
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systems
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Hypothesis-Based Collaborative Filtering
Hypothesis-Based Collaborative FilteringHypothesis-Based Collaborative Filtering
Hypothesis-Based Collaborative Filtering
 
Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedIn
 
Collaborative Filtering in Map/Reduce
Collaborative Filtering in Map/ReduceCollaborative Filtering in Map/Reduce
Collaborative Filtering in Map/Reduce
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 

Semelhante a Collaborative filtering at scale

Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopPranab Ghosh
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshBigDataCloud
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevJavaDayUA
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlightsSandra Garcia
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
oop lecture framework,list,maps,collection
oop lecture framework,list,maps,collectionoop lecture framework,list,maps,collection
oop lecture framework,list,maps,collectionssuseredfbe9
 
Coherence SIG: Advanced usage of indexes in coherence
Coherence SIG: Advanced usage of indexes in coherenceCoherence SIG: Advanced usage of indexes in coherence
Coherence SIG: Advanced usage of indexes in coherencearagozin
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsMohamed Loey
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor AnalysesNeerav Shivhare
 
Python’s filter() function An Introduction to Iterable Filtering
Python’s filter() function An Introduction to Iterable FilteringPython’s filter() function An Introduction to Iterable Filtering
Python’s filter() function An Introduction to Iterable FilteringInexture Solutions
 

Semelhante a Collaborative filtering at scale (20)

Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by Hadoop
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab Ghosh
 
Collections and generics
Collections and genericsCollections and generics
Collections and generics
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
oop lecture framework,list,maps,collection
oop lecture framework,list,maps,collectionoop lecture framework,list,maps,collection
oop lecture framework,list,maps,collection
 
Coherence SIG: Advanced usage of indexes in coherence
Coherence SIG: Advanced usage of indexes in coherenceCoherence SIG: Advanced usage of indexes in coherence
Coherence SIG: Advanced usage of indexes in coherence
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Immutable js reactmeetup_local_ppt
Immutable js reactmeetup_local_pptImmutable js reactmeetup_local_ppt
Immutable js reactmeetup_local_ppt
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor Analyses
 
Python’s filter() function An Introduction to Iterable Filtering
Python’s filter() function An Introduction to Iterable FilteringPython’s filter() function An Introduction to Iterable Filtering
Python’s filter() function An Introduction to Iterable Filtering
 
UNEC__1683196273.pptx
UNEC__1683196273.pptxUNEC__1683196273.pptx
UNEC__1683196273.pptx
 
Java script basics
Java script basicsJava script basics
Java script basics
 
Collections
CollectionsCollections
Collections
 

Mais de huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

Mais de huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Collaborative filtering at scale

  • 1. Collaborative Filtering at Scale London Hadoop User Group Sean Owen 14 April 2011
  • 2. Apache Mahout is … Machine learning … Collaborative filtering (recommenders) Clustering Classification Frequent item set mining and more … at scale Much implemented on Hadoop Efficient data structures Collaborative Filtering at Scale
  • 3. Collaborative Filtering is … Given a user’spreferences for items, guess which other items would be highly preferred Only needs preferences;users and items opaque Many algorithms! Collaborative Filtering at Scale
  • 4. Collaborative Filtering is … Collaborative Filtering at Scale Sean likes “Scarface” a lot Robin likes “Scarface” somewhat Grant likes “The Notebook” not at all … (123,654,5.0) (789,654,3.0) (345,876,1.0) … (Magic) (345,654,4.5) … Grant may like “Scarface” quite a bit …
  • 5. Recommending people food Collaborative Filtering at Scale
  • 6. Item-Based Algorithm Recommend items similar to a user’s highly-preferred items Collaborative Filtering at Scale
  • 7. Item-Based Algorithm Have user’s preference for items Know all items and can compute weighted average to estimate user’s preference What is the item – item similarity notion? Collaborative Filtering at Scale for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return top items, ranked by weighted average
  • 8. Item-Item Similarity Could be based on content… Two foods similar if both sweet, both cold BUT: In collaborative filtering, based only on preferences (numbers) Pearson correlation between ratings ? Log-likelihood ratio ? Simple co-occurrence:Items similar when appearing often in the same user’s set of preferences Collaborative Filtering at Scale
  • 9. Estimating preference Collaborative Filtering at Scale Preference Co-occurrence 9 5 16 5 5 2 5•9 + 5•16 + 2•5 4.5 = 135 = 9 + 16 + 5 30
  • 10. As matrix math User’s preferences are a vector Each dimension corresponds to one item Dimension value is the preference value Item-item co-occurrences are a matrix Row i / column j is count of item i / j co-occurrence Estimating preferences:co-occurrence matrix × preference (column) vector Collaborative Filtering at Scale
  • 11. As matrix math Collaborative Filtering at Scale 16 animals ate both hot dogs and ice cream 10 animals ate blueberries
  • 12. A different way to multiply Normal: for each row of matrix Multiply (dot) row with column vector Yields scalar: one final element of recommendation vector Inside-out: for each element of column vector Multiply (scalar) with corresponding matrix column Yield column vector: parts of final recommendation vector Sum those to get result Can skip for zero vector elements! Collaborative Filtering at Scale
  • 13. As matrix math, again Collaborative Filtering at Scale 5 5 2
  • 14. What is MapReduce? 1 Input is a series of key-value pairs: (K1,V1) 2 map() function receives these, outputs 0 or more (K2, V2) 3 All values for each K2 are collected together 4 reduce() function receives these, outputs 0 or more (K3,V3) Very distributable and parallelizable Most large-scale problems can be chopped into a series of such MapReduce jobs Collaborative Filtering at Scale
  • 15. Build user vectors (mapper) Input is text file: user,item,preference Mapper receives K1 = file position (ignored) V1 = line of text file Mapper outputs, for each line K2 = user ID V2 = (item ID, preference) Collaborative Filtering at Scale 1
  • 16. Build user vectors (reducer) Reducer receives K2 = user ID V2,… = (item ID, preference), … Reducer outputs K3 = user ID V3 = Mahout Vector implementation Mahout provides custom Writable implementations for efficient Vector storage Collaborative Filtering at Scale 1
  • 17. Count co-occurrence (mapper) Mapper receives K1 = user ID V1 = user Vector Mapper outputs, for each pair of items K2 = item ID V2 = other item ID Collaborative Filtering at Scale 2
  • 18. Count co-occurrence (reducer) Reducer receives K2 = item ID V2,… = other item ID, … Reducer tallies each other item;creates a Vector Reducer outputs K3 = item ID V3 = column of co-occurrence matrix as Vector Collaborative Filtering at Scale 2
  • 19. Partial multiply (mapper #1) Mapper receives K1 = user ID V1 = user Vector Mapper outputs, for each item K2 = item ID V2 = (user ID, preference) Collaborative Filtering at Scale 3
  • 20. Partial multiply (mapper #2) Mapper receives K1 = item ID V1 = co-occurrence matrix column Vector Mapper outputs K2 = item ID V2 = co-occurrence matrix column Vector Collaborative Filtering at Scale 3
  • 21. Partial multiply (reducer) Reducer receives K2 = item ID V2,… = (user ID, preference), …and co-occurrence matrix column Vector Reducer outputs, for each item ID K3 = item ID V3 = column vector and (user ID, preference) pairs Collaborative Filtering at Scale 3
  • 22. Aggregate (mapper) Mapper receives K1 = item ID V1 = column vector and (user ID, preference) pairs Mapper outputs, for each user ID K2 = user ID V2 = column vector times preference Collaborative Filtering at Scale 4
  • 23. Aggregate (reducer) Reducer receives K2 = user ID V2,… = partial recommendation vectors Reducer sums to make recommendation Vector and finds top n values Reducer outputs, for top value K3 = user ID V3 = (item ID, value) Collaborative Filtering at Scale 4
  • 24. Reality is more complex Collaborative Filtering at Scale
  • 25. Ready to try Obtain and build Mahout from Subversionhttp://mahout.apache.org/versioncontrol.html Set up, run Hadoop in local pseudo-distributed mode Copy input into local HDFS hadoop jar mahout-0.5-SNAPSHOT.job org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=output Collaborative Filtering at Scale
  • 26. Mahout in Action Action-packed coverage of: Recommenders Clustering Classification Final-ish PDF available now;final print copy in May 2011 http://www.manning.com/owen/ Collaborative Filtering at Scale
  • 27. Questions? Gmail: srowen user@mahout.apache.org http://mahout.apache.org ? Collaborative Filtering at Scale