SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Distributed Itembased Collaborative
Filtering with Apache Mahout

Sebastian Schelter
ssc@apache.org
twitter.com/sscdotopen

7. October 2010
Overview


1. What is Apache Mahout?
2. Introduction to Collaborative Filtering
3. Itembased Collaborative Filtering
4. Computing similar items with Map/Reduce
5. Implementations in Mahout
6. Further information




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   2
What is Apache Mahout?

A scalable Machine Learning library


       • scalable to reasonably large datasets (core algorithms
         implemented in Map/Reduce, runnable on Hadoop)
       • scalable to support your business case (Apache License)
       • scalable community


Usecases


       •   Clustering (group items that are topically related)
       •   Classification (learn to assign categories to documents)
       •   Frequent Itemset Mining (find items that appear together)
       •   Recommendation Mining (find items a user might like)




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   3
Recommendation Mining

= Help users find items they might like




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   4
Users, Items, Preferences


Terminology


       • users interact with items (books, videos, news, other users,...)
       • preferences of each user towards a small subset of the items
         known (numeric or boolean)


Algorithmic problems


       • Prediction: Estimate the preference of a user towards an item
         he/she does not know
       • Use Prediction for Top-N-recommendation: Find the N items a
         user might like best




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   5
Explicit and Implicit Ratings


Where do the preferences come from?


Explicit Ratings


       • users explictly express their preferences (e.g. ratings with stars)
       • willingness of the users required


Implicit Ratings


       • interactions with items are interpreted as expressions of
         preference (e.g. purchasing a book, reading a news article)
       • interactions must be detectable




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   6
Collaborative Filtering


How does it work?


       • the past predicts the future: all predictions are derived from
         historical data (the preferences you already know)
       • completely content agnostic
       • very popular (e.g. used by Amazon, Google News)


Mathematically


       • user-item-matrix is created from the preference data
       • task is to predict missing entries by finding patterns
         in the known entries




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   7
A sample user-item-matrix


                                 The Matrix                     Alien                  Inception




       Alice                              5                        1                      4

       Bob                               ?                         2                      5

       Peter                              4                        3                      2




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout               8
Itembased Collaborative Filtering


Algorithm


       • neighbourhood-based approach
       • works by finding similarly rated items in the user-item-matrix
       • estimates a user's preference towards an item by looking at
         his/her preferences towards similar items



Highly scalable


       • item similarities tend to be relatively static,
         can be precomputed offline periodically
       • less items than users in most scenarios
       • looking at a small number of similar items is sufficient

Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   9
Example


Similarity of „The Matrix“ and „Inception“


       • rating vector of „The Matrix“:                               (5,-,4)
       • rating vector of „Inception“:                                (4,5,2)
                                                                                       5   4
       • isolate all cooccurred ratings (all cases
         where a user rated both items)
       • pick a similarity measure to compute a
                                                                                       -   5
         similarity value between -1 and 1
         e.g. Pearson-Correlation                                                      4   2

                     ∑u∈U  Ru , i − Ri  Ru , j − R j 
                                                   
   corr i , j=                                                 =0.47
                  ∑u ∈U  Ru , i − Ri   ∑u∈U  Ru , j − R j 
                                                          


Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout           10
Example


Prediction: Estimate Bob's preference towards „The Matrix“


       • look at all items that
            a) are similar to „The Matrix“
            b) have been rated by Bob
         => „Alien“, „Inception“

       • estimate the unknown preference with a weighted sum


                       s Matrix , Alien∗r Bob , Aliens Matrix , Inception∗r Bob , Inception
      P Bob , Matrix =                                                                       =1.5
                                       ∣s Matrix , Alien∣∣s Matrix , Inception∣



Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout                11
Algorithm in Map/Reduce

How can we compute the similarities
efficiently with Map/Reduce?


Key ideas


       • we can ignore pairs of items without
         a cooccurring rating
                                                                                       5   4
       • we need to see all cooccurring ratings
         for each pair of items in the end                                             -   5

Inspired by an algorithm designed to compute                                           4   2
the pairwise similarity of text documents


Mahout's implementation is more generalized to be usable with other
similarity measures, see DistributedVectorSimilarity and
RowSimilarityJob for more details

Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout           12
Algorithm in Map/Reduce - Pass 1

Map - make user the key
(Alice,Matrix,5)                                           Alice (Matrix,5)
(Alice,Alien,1)                                            Alice (Alien,1)
(Alice,Inception,4)                                        Alice (Inception,4)
(Bob,Alien,2)                                              Bob (Alien,2)
(Bob,Inception,5)                                          Bob (Inception,2)
(Peter,Matrix,4)                                           Peter (Matrix,4)
(Peter,Alien,3)                                            Peter (Alien,3)
(Peter,Inception,2)                                        Peter (Inception,2)

Reduce - create inverted index
Alice (Matrix,5)                                            Alice (Matrix,5)(Alien,1)(Inception,4)
Alice (Alien,1)
Alice (Inception,4)
Bob (Alien,2)                                               Bob (Alien,2)(Inception,5)
Bob (Inception,5)
Peter (Matrix,4)                                            Peter (Matrix,4)(Alien,3)(Inception,2)
Peter (Alien,3)
Peter (Inception,2)


Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout                 13
Algorithm in Map/Reduce - Pass 2

Map - emit all cooccurred ratings

Alice (Matrix,5)(Alien,1)                                         Matrix,Alien (5,1)
(Inception,4)                                                     Matrix,Inception (5,4)
                                                                  Alien,Inception (1,4)
Bob (Alien,2)(Inception,5)                                        Alien,Inception (2,5)


Peter (Matrix,4)(Alien,3)                                         Matrix,Alien (4,3)
(Inception,2)                                                     Matrix,Inception (4,2)
                                                                  Alien,Inception(3,2)



Reduce - compute similarities
Matrix,Alien (5,1)                                                Matrix,Alien (­0.47)
Matrix,Alien (4,3)

Matrix,Inception (5,4)                                            Matrix,Inception (0.47)
Matrix,Inception (4,2)

Alien,Inception (1,4)                                             Alien,Inception (­0.63)
Alien,Inception (2,5)
Alien,Inception (3,2)


Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout        14
Implementations in Mahout

ItemSimilarityJob


     • computes all item similarities
     • various configuration options:
        • similarity measure to use (e.g. cosine, Pearson-Correlation,
          Tanimoto-Coefficient, your own implementation)
        • maximum number of similar items per item
        • maximum number of cooccurrences considered
        • ...

     • Input: preference data as CSV file, each line represents a single
       preference in the form userID,itemID,value
     • Output: pairs of itemIDs with their associated similarity value




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   15
Implementations in Mahout

RecommenderJob


     • Distributed Itembased Recommender
     • various configuration options:
        • similarity measure to use
        • number of recommendations per user
        • filter out some users or items
        • ...

     • Input: the preference data as CSV file, each line contains a
       preference in the form userID,itemID,value
     • Output: userIDs with associated recommended itemIDs and their
       scores




Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   16
Further information


Mahout's website, wiki and mailinglist
   • http://mahout.apache.org
   • user@mahout.apache.org


Mahout in Action, available through
Manning's Early Access Program
   • http://manning.com/owen


B. Sarwar et al: „Itembased collaborative filtering
recommendation algorithms“, 2001


T. Elsayed et al: „Pairwise document similarity in
large collections with MapReduce“, 2008



Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout   17

Mais conteúdo relacionado

Destaque

Matriz rmg. (rc) (kg) (ae)
Matriz rmg. (rc) (kg) (ae)Matriz rmg. (rc) (kg) (ae)
Matriz rmg. (rc) (kg) (ae)Roselyncd
 
Branding Strategy Project by FMS Team
Branding Strategy Project by FMS TeamBranding Strategy Project by FMS Team
Branding Strategy Project by FMS TeamAnurag Jain
 
China engineering consultation industry development prospects and investment ...
China engineering consultation industry development prospects and investment ...China engineering consultation industry development prospects and investment ...
China engineering consultation industry development prospects and investment ...Qianzhan Intelligence
 
Purity Staffing 2016
Purity Staffing 2016Purity Staffing 2016
Purity Staffing 2016Sophie Cusack
 
China jewelry industry consumption demand and market competition and investme...
China jewelry industry consumption demand and market competition and investme...China jewelry industry consumption demand and market competition and investme...
China jewelry industry consumption demand and market competition and investme...Qianzhan Intelligence
 
China midwestern cement industry production and marketing demand and investme...
China midwestern cement industry production and marketing demand and investme...China midwestern cement industry production and marketing demand and investme...
China midwestern cement industry production and marketing demand and investme...Qianzhan Intelligence
 
Presentación videoconferencia Mendoza
Presentación  videoconferencia MendozaPresentación  videoconferencia Mendoza
Presentación videoconferencia MendozaMarisela Valenzuela
 
China tourism industry market forecast and investment strategy planning, 2013...
China tourism industry market forecast and investment strategy planning, 2013...China tourism industry market forecast and investment strategy planning, 2013...
China tourism industry market forecast and investment strategy planning, 2013...Qianzhan Intelligence
 
Trey Gordon_Resume 2016
Trey Gordon_Resume 2016Trey Gordon_Resume 2016
Trey Gordon_Resume 2016Trey Gordon
 
Renuncia de Felipe Bulnes - Agente Chileno en La Haya
Renuncia de Felipe Bulnes - Agente Chileno en La Haya Renuncia de Felipe Bulnes - Agente Chileno en La Haya
Renuncia de Felipe Bulnes - Agente Chileno en La Haya Christian Pino Lanata
 
China organosilicon industry market demand prospects and investment strategy ...
China organosilicon industry market demand prospects and investment strategy ...China organosilicon industry market demand prospects and investment strategy ...
China organosilicon industry market demand prospects and investment strategy ...Qianzhan Intelligence
 
China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...Qianzhan Intelligence
 
World Business Chicago Annual Report 2010
World Business Chicago Annual Report 2010World Business Chicago Annual Report 2010
World Business Chicago Annual Report 2010World Business Chicago
 

Destaque (15)

Matriz rmg. (rc) (kg) (ae)
Matriz rmg. (rc) (kg) (ae)Matriz rmg. (rc) (kg) (ae)
Matriz rmg. (rc) (kg) (ae)
 
Branding Strategy Project by FMS Team
Branding Strategy Project by FMS TeamBranding Strategy Project by FMS Team
Branding Strategy Project by FMS Team
 
China engineering consultation industry development prospects and investment ...
China engineering consultation industry development prospects and investment ...China engineering consultation industry development prospects and investment ...
China engineering consultation industry development prospects and investment ...
 
Purity Staffing 2016
Purity Staffing 2016Purity Staffing 2016
Purity Staffing 2016
 
China jewelry industry consumption demand and market competition and investme...
China jewelry industry consumption demand and market competition and investme...China jewelry industry consumption demand and market competition and investme...
China jewelry industry consumption demand and market competition and investme...
 
Forex Basics
Forex BasicsForex Basics
Forex Basics
 
China midwestern cement industry production and marketing demand and investme...
China midwestern cement industry production and marketing demand and investme...China midwestern cement industry production and marketing demand and investme...
China midwestern cement industry production and marketing demand and investme...
 
Presentación videoconferencia Mendoza
Presentación  videoconferencia MendozaPresentación  videoconferencia Mendoza
Presentación videoconferencia Mendoza
 
China tourism industry market forecast and investment strategy planning, 2013...
China tourism industry market forecast and investment strategy planning, 2013...China tourism industry market forecast and investment strategy planning, 2013...
China tourism industry market forecast and investment strategy planning, 2013...
 
Trey Gordon_Resume 2016
Trey Gordon_Resume 2016Trey Gordon_Resume 2016
Trey Gordon_Resume 2016
 
ara
araara
ara
 
Renuncia de Felipe Bulnes - Agente Chileno en La Haya
Renuncia de Felipe Bulnes - Agente Chileno en La Haya Renuncia de Felipe Bulnes - Agente Chileno en La Haya
Renuncia de Felipe Bulnes - Agente Chileno en La Haya
 
China organosilicon industry market demand prospects and investment strategy ...
China organosilicon industry market demand prospects and investment strategy ...China organosilicon industry market demand prospects and investment strategy ...
China organosilicon industry market demand prospects and investment strategy ...
 
China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...
 
World Business Chicago Annual Report 2010
World Business Chicago Annual Report 2010World Business Chicago Annual Report 2010
World Business Chicago Annual Report 2010
 

Semelhante a mahout-cf

Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptxShree Shree
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
06-Clustering.pptx
06-Clustering.pptx06-Clustering.pptx
06-Clustering.pptxShree Shree
 
The application of artificial intelligence
The application of artificial intelligenceThe application of artificial intelligence
The application of artificial intelligencePallavi Vashistha
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php applicationMichele Orselli
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scalehuguk
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Developer testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to IntegrateDeveloper testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to IntegrateLB Denker
 
Recommender Systems at Scale
Recommender Systems at ScaleRecommender Systems at Scale
Recommender Systems at ScaleEoin Hurrell, PhD
 
Java Arrays and DateTime Functions
Java Arrays and DateTime FunctionsJava Arrays and DateTime Functions
Java Arrays and DateTime FunctionsJamsher bhanbhro
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
Javascript classes and scoping
Javascript classes and scopingJavascript classes and scoping
Javascript classes and scopingPatrick Sheridan
 

Semelhante a mahout-cf (20)

Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
 
06-Clustering.pptx
06-Clustering.pptx06-Clustering.pptx
06-Clustering.pptx
 
The application of artificial intelligence
The application of artificial intelligenceThe application of artificial intelligence
The application of artificial intelligence
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php application
 
Weka
WekaWeka
Weka
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scale
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
What Do You Mean By NUnit
What Do You Mean By NUnitWhat Do You Mean By NUnit
What Do You Mean By NUnit
 
Java
JavaJava
Java
 
sets and maps
 sets and maps sets and maps
sets and maps
 
Developer testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to IntegrateDeveloper testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to Integrate
 
Recommender Systems at Scale
Recommender Systems at ScaleRecommender Systems at Scale
Recommender Systems at Scale
 
Java Arrays and DateTime Functions
Java Arrays and DateTime FunctionsJava Arrays and DateTime Functions
Java Arrays and DateTime Functions
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Javascript classes and scoping
Javascript classes and scopingJavascript classes and scoping
Javascript classes and scoping
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 

Mais de sscdotopen

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Mais de sscdotopen (9)

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

mahout-cf

  • 1. Distributed Itembased Collaborative Filtering with Apache Mahout Sebastian Schelter ssc@apache.org twitter.com/sscdotopen 7. October 2010
  • 2. Overview 1. What is Apache Mahout? 2. Introduction to Collaborative Filtering 3. Itembased Collaborative Filtering 4. Computing similar items with Map/Reduce 5. Implementations in Mahout 6. Further information Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 2
  • 3. What is Apache Mahout? A scalable Machine Learning library • scalable to reasonably large datasets (core algorithms implemented in Map/Reduce, runnable on Hadoop) • scalable to support your business case (Apache License) • scalable community Usecases • Clustering (group items that are topically related) • Classification (learn to assign categories to documents) • Frequent Itemset Mining (find items that appear together) • Recommendation Mining (find items a user might like) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 3
  • 4. Recommendation Mining = Help users find items they might like Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 4
  • 5. Users, Items, Preferences Terminology • users interact with items (books, videos, news, other users,...) • preferences of each user towards a small subset of the items known (numeric or boolean) Algorithmic problems • Prediction: Estimate the preference of a user towards an item he/she does not know • Use Prediction for Top-N-recommendation: Find the N items a user might like best Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 5
  • 6. Explicit and Implicit Ratings Where do the preferences come from? Explicit Ratings • users explictly express their preferences (e.g. ratings with stars) • willingness of the users required Implicit Ratings • interactions with items are interpreted as expressions of preference (e.g. purchasing a book, reading a news article) • interactions must be detectable Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 6
  • 7. Collaborative Filtering How does it work? • the past predicts the future: all predictions are derived from historical data (the preferences you already know) • completely content agnostic • very popular (e.g. used by Amazon, Google News) Mathematically • user-item-matrix is created from the preference data • task is to predict missing entries by finding patterns in the known entries Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 7
  • 8. A sample user-item-matrix The Matrix Alien Inception Alice 5 1 4 Bob ? 2 5 Peter 4 3 2 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 8
  • 9. Itembased Collaborative Filtering Algorithm • neighbourhood-based approach • works by finding similarly rated items in the user-item-matrix • estimates a user's preference towards an item by looking at his/her preferences towards similar items Highly scalable • item similarities tend to be relatively static, can be precomputed offline periodically • less items than users in most scenarios • looking at a small number of similar items is sufficient Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 9
  • 10. Example Similarity of „The Matrix“ and „Inception“ • rating vector of „The Matrix“: (5,-,4) • rating vector of „Inception“: (4,5,2) 5 4 • isolate all cooccurred ratings (all cases where a user rated both items) • pick a similarity measure to compute a - 5 similarity value between -1 and 1 e.g. Pearson-Correlation 4 2 ∑u∈U  Ru , i − Ri  Ru , j − R j    corr i , j= =0.47  ∑u ∈U  Ru , i − Ri   ∑u∈U  Ru , j − R j    Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 10
  • 11. Example Prediction: Estimate Bob's preference towards „The Matrix“ • look at all items that a) are similar to „The Matrix“ b) have been rated by Bob => „Alien“, „Inception“ • estimate the unknown preference with a weighted sum s Matrix , Alien∗r Bob , Aliens Matrix , Inception∗r Bob , Inception P Bob , Matrix = =1.5 ∣s Matrix , Alien∣∣s Matrix , Inception∣ Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 11
  • 12. Algorithm in Map/Reduce How can we compute the similarities efficiently with Map/Reduce? Key ideas • we can ignore pairs of items without a cooccurring rating 5 4 • we need to see all cooccurring ratings for each pair of items in the end - 5 Inspired by an algorithm designed to compute 4 2 the pairwise similarity of text documents Mahout's implementation is more generalized to be usable with other similarity measures, see DistributedVectorSimilarity and RowSimilarityJob for more details Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 12
  • 13. Algorithm in Map/Reduce - Pass 1 Map - make user the key (Alice,Matrix,5) Alice (Matrix,5) (Alice,Alien,1) Alice (Alien,1) (Alice,Inception,4) Alice (Inception,4) (Bob,Alien,2) Bob (Alien,2) (Bob,Inception,5) Bob (Inception,2) (Peter,Matrix,4) Peter (Matrix,4) (Peter,Alien,3) Peter (Alien,3) (Peter,Inception,2) Peter (Inception,2) Reduce - create inverted index Alice (Matrix,5) Alice (Matrix,5)(Alien,1)(Inception,4) Alice (Alien,1) Alice (Inception,4) Bob (Alien,2) Bob (Alien,2)(Inception,5) Bob (Inception,5) Peter (Matrix,4) Peter (Matrix,4)(Alien,3)(Inception,2) Peter (Alien,3) Peter (Inception,2) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 13
  • 14. Algorithm in Map/Reduce - Pass 2 Map - emit all cooccurred ratings Alice (Matrix,5)(Alien,1) Matrix,Alien (5,1) (Inception,4) Matrix,Inception (5,4) Alien,Inception (1,4) Bob (Alien,2)(Inception,5) Alien,Inception (2,5) Peter (Matrix,4)(Alien,3) Matrix,Alien (4,3) (Inception,2) Matrix,Inception (4,2) Alien,Inception(3,2) Reduce - compute similarities Matrix,Alien (5,1) Matrix,Alien (­0.47) Matrix,Alien (4,3) Matrix,Inception (5,4) Matrix,Inception (0.47) Matrix,Inception (4,2) Alien,Inception (1,4) Alien,Inception (­0.63) Alien,Inception (2,5) Alien,Inception (3,2) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 14
  • 15. Implementations in Mahout ItemSimilarityJob • computes all item similarities • various configuration options: • similarity measure to use (e.g. cosine, Pearson-Correlation, Tanimoto-Coefficient, your own implementation) • maximum number of similar items per item • maximum number of cooccurrences considered • ... • Input: preference data as CSV file, each line represents a single preference in the form userID,itemID,value • Output: pairs of itemIDs with their associated similarity value Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 15
  • 16. Implementations in Mahout RecommenderJob • Distributed Itembased Recommender • various configuration options: • similarity measure to use • number of recommendations per user • filter out some users or items • ... • Input: the preference data as CSV file, each line contains a preference in the form userID,itemID,value • Output: userIDs with associated recommended itemIDs and their scores Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 16
  • 17. Further information Mahout's website, wiki and mailinglist • http://mahout.apache.org • user@mahout.apache.org Mahout in Action, available through Manning's Early Access Program • http://manning.com/owen B. Sarwar et al: „Itembased collaborative filtering recommendation algorithms“, 2001 T. Elsayed et al: „Pairwise document similarity in large collections with MapReduce“, 2008 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 17