SlideShare uma empresa Scribd logo
1 de 18
6    42   8

78   14   98

1    7    8

               Simple Matrix Factorization for
               Recommendation
               Sean Owen • Apache Mahout
Apache Mahout
•   Scalable machine learning
•   (Mostly) Hadoop-based
•   Clustering, classification and
    recommender engines


•   Nearest-neighbor
     •   User-based                  mahout.apache.org
     •   Item-based
     •   Slope-one
     •   Clustering-based

•   Latent factor
     •   SVD-based
     •   ALS
     •   More!
Matrix = Associations
                               Things are associated
        Rose   Navy   Olive
                                Like people to colors

Alice    0      +4     0       Associations have strengths
                                Like preferences and dislikes
Bob      0      0      +2
                               Can quantify associations
                                Alice loves navy = +4,
Carol    -1     0      -2       Carol dislikes olive = -2

Dave    +3      0      0       We don’t know all
                                associations
                                Many implicit zeroes
From One Matrix, Two
 Like numbers, matrices can               n
  be factored

 m•n matrix = m•k times k•n

 Associations can
                                   m       P
                                                   =
  decompose into others
                                       k               n
 Alice likes navy =

                                           •
  Alice loves blues, and
                                               k   Y’
  blues includes navy          m       X
In Terms of Few Features
 Can explain associations by appealing to underlying
  intermediate features (e.g. “blue-ness”)

 Relatively few (one “blue-ness”, but many shades)


                              (Blue)
       (Alice)




                                                      (Navy)
Losing Information is Helpful
 When k (= features) is small, information is lost

 Factorization is approximate
  (Alice appears to like blue-ish periwinkle too)


                                 (Blue)
        (Alice)

                                                      (Periwinkle)

                                                      (Navy)
How to Compute?
     n            k           n


                      •   k   Y’

           =
m    P      m     X
Skip the Singular Value
    Decomposition for now …
        n        k                n


                     •   Σ   •k   T’

             =
m       A    m   S
Alternating Least Squares
 Collaborative Filtering for Implicit Feedback Datasets
  www2.research.att.com/~yifanhu/PUB/cf.pdf
 R = matrix of user-item interactions “strengths”
 P = R reduced to 0 and 1
 Factor as approximate P ≈ X•Y’
   Start with random Y
   Compute X such that X•Y’ best approximates P
    (Frobenius / L2 norm)            (Least Squares)
   Repeat for Y         (Alternating)
   Iterate, Iterate, Iterate

 Large values in X•Y’ are good recommendations
Example


    1   4   3           1   1   1   0   0
            3           0   0   1   0   0
        4       3   2   0   1   0   1   1
R                                           P
    5       2       3   1   0   1   0   1
                5       0   0   0   1   0
    2   4               1   1   0   0   0
k = 3, λ=2, α=40
            1 iteration


1   1   1    0   0       2.18   -0.01   0.35        0.43    0.48    0.48    0.16    0.10



0   0   1    0   0       1.83   -0.11   -0.68       -0.27   0.39    -0.13   0.03    0.05




                     ≈
0   1   0    1   1       0.79   1.15    -1.80       -0.03   -0.09   -0.13   -0.47   -0.47



1   0   1    0   1       0.97   -1.90   -2.12
                                                                                      Y’
0   0   0    1   0       1.01   -0.25   -1.77



1   1   0    0   0       2.33   -8.00   1.06
                                                X
k = 3, λ=2, α=40
            1 iteration


1   1   1    0   0
                         0.94   1.00    1.00   0.18    0.07



0   0   1    0   0       0.84   0.89    0.99   0.60    0.50




                     ≈
0   1   0    1   1       0.07   0.99    0.46   1.01    0.98

                                                               X•Y’
1   0   1    0   1       1.00   -0.09   1.00   1.08    0.99



0   0   0    1   0       0.55   0.54    0.75   0.98    0.92



1   1   0    0   0       1.01   0.99    0.98   -0.13   -0.25
k = 3, λ=2, α=40
            10 iterations


1   1   1    0   0
                         0.96   0.99   0.99    0.38    0.93



0   0   1    0   0       0.44   0.39   0.98    -0.11   0.39




                     ≈
0   1   0    1   1       0.70   0.99   0.42    0.98    0.98

                                                              X•Y’
1   0   1    0   1       1.00   1.04   0.99    0.44    0.98



0   0   0    1   0       0.11   0.51   -0.13   1.00    0.57



1   1   0    0   0       0.97   1.00   0.68    0.47    0.91
Interesting Because…



 This is all very
 parallelizable
by row, column
BONUS: Folding in New Data
 Model building takes time       Apply some right inverse:
                                       ⌃
                                   X•Y’•(Y’)-1 = Q•(Y’)-1 = so
 Sometimes need                   X = Q•(Y’)-1
  immediate, if approximate,
  updates for new data            OK, what is (Y’)-1?

 For new user U, need new        Of course (Y’•Y)•(Y’•Y)-1 = I
  row, XU•Y’ = QU, but have PU
                                  So Y’•(Y•(Y’•Y)-1) = I and
 What is XU?                      right inverse is Y•(Y’•Y)-1

                                  Xu = QU•Y•(Y’•Y)-1 and so
                                   Xu ≈ Pu•Y•(Y’•Y)-1
In Mahout
 org.apache.mahout.cf.          MAHOUT-737
  taste.hadoop.als.
  ParallelALSFactorizationJob     Alternate implementation
   Alternating least squares      of alternating least
                                   squares
   Distributed, Hadoop-
    based                        And more…
 org.apache.mahout.cf.           DistributedLanczosSolver
  taste.impl.recommender.         SequentialOutOfCoreSvd
  svd.SVDRecommender
                                  …
   SVD-based
   Non-distributed, not
    Hadoop
 Complete product
            Real-time Serving Layer
Myrrix      Hadoop-based
             Computation Layer
            Tuned, documented

          Free / open: Serving Layer,
           for small data

          Commercial: add
           Computation Layer for big
           data; Hosting

          Matrix factorization-based,
           attractive properties

          http://myrrix.com
Thank You
srowen at myrrix.com
mahout.apache.org

Mais conteúdo relacionado

Mais procurados

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Paolo Cremonesi
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationLuis Serrano
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseHasan H Topcu
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems BasicsJarin Tasnim Khan
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative ModelsMijung Kim
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filteringD Yogendra Rao
 

Mais procurados (20)

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems Basics
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filtering
 

Semelhante a Simple Matrix Factorization for Recommendation in Apache Mahout

Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial IntelligenceManoj Harsule
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationSilvio Cesare
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsViet-Trung TRAN
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)jillmitchell8778
 
Lesson31 Higher Dimensional First Order Difference Equations Slides
Lesson31   Higher Dimensional First Order Difference Equations SlidesLesson31   Higher Dimensional First Order Difference Equations Slides
Lesson31 Higher Dimensional First Order Difference Equations SlidesMatthew Leingang
 
Normal distribution and hypothesis testing
Normal distribution and hypothesis testingNormal distribution and hypothesis testing
Normal distribution and hypothesis testingLorelyn Turtosa-Dumaug
 
Signal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoverySignal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoveryGabriel Peyré
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)마이캠퍼스
 
Class 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsClass 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsDavid Evans
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplifiedLovelyn Rose
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmeticinside-BigData.com
 
STATISTICS: Normal Distribution
STATISTICS: Normal Distribution STATISTICS: Normal Distribution
STATISTICS: Normal Distribution jundumaug1
 

Semelhante a Simple Matrix Factorization for Recommendation in Apache Mahout (20)

talk9.ppt
talk9.ppttalk9.ppt
talk9.ppt
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Taylor problem
Taylor problemTaylor problem
Taylor problem
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)
 
Lesson31 Higher Dimensional First Order Difference Equations Slides
Lesson31   Higher Dimensional First Order Difference Equations SlidesLesson31   Higher Dimensional First Order Difference Equations Slides
Lesson31 Higher Dimensional First Order Difference Equations Slides
 
Normal distribution and hypothesis testing
Normal distribution and hypothesis testingNormal distribution and hypothesis testing
Normal distribution and hypothesis testing
 
1010n3a
1010n3a1010n3a
1010n3a
 
Signal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoverySignal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse Recovery
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Class 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsClass 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and Politics
 
Class10
Class10Class10
Class10
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
 
STATISTICS: Normal Distribution
STATISTICS: Normal Distribution STATISTICS: Normal Distribution
STATISTICS: Normal Distribution
 

Mais de Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 

Mais de Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Simple Matrix Factorization for Recommendation in Apache Mahout

  • 1. 6 42 8 78 14 98 1 7 8 Simple Matrix Factorization for Recommendation Sean Owen • Apache Mahout
  • 2. Apache Mahout • Scalable machine learning • (Mostly) Hadoop-based • Clustering, classification and recommender engines • Nearest-neighbor • User-based mahout.apache.org • Item-based • Slope-one • Clustering-based • Latent factor • SVD-based • ALS • More!
  • 3. Matrix = Associations  Things are associated Rose Navy Olive Like people to colors Alice 0 +4 0  Associations have strengths Like preferences and dislikes Bob 0 0 +2  Can quantify associations Alice loves navy = +4, Carol -1 0 -2 Carol dislikes olive = -2 Dave +3 0 0  We don’t know all associations Many implicit zeroes
  • 4. From One Matrix, Two  Like numbers, matrices can n be factored  m•n matrix = m•k times k•n  Associations can m P = decompose into others k n  Alice likes navy = • Alice loves blues, and k Y’ blues includes navy m X
  • 5. In Terms of Few Features  Can explain associations by appealing to underlying intermediate features (e.g. “blue-ness”)  Relatively few (one “blue-ness”, but many shades) (Blue) (Alice) (Navy)
  • 6. Losing Information is Helpful  When k (= features) is small, information is lost  Factorization is approximate (Alice appears to like blue-ish periwinkle too) (Blue) (Alice) (Periwinkle) (Navy)
  • 7. How to Compute? n k n • k Y’ = m P m X
  • 8. Skip the Singular Value Decomposition for now … n k n • Σ •k T’ = m A m S
  • 9. Alternating Least Squares  Collaborative Filtering for Implicit Feedback Datasets www2.research.att.com/~yifanhu/PUB/cf.pdf  R = matrix of user-item interactions “strengths”  P = R reduced to 0 and 1  Factor as approximate P ≈ X•Y’  Start with random Y  Compute X such that X•Y’ best approximates P (Frobenius / L2 norm) (Least Squares)  Repeat for Y (Alternating)  Iterate, Iterate, Iterate  Large values in X•Y’ are good recommendations
  • 10. Example 1 4 3 1 1 1 0 0 3 0 0 1 0 0 4 3 2 0 1 0 1 1 R P 5 2 3 1 0 1 0 1 5 0 0 0 1 0 2 4 1 1 0 0 0
  • 11. k = 3, λ=2, α=40 1 iteration 1 1 1 0 0 2.18 -0.01 0.35 0.43 0.48 0.48 0.16 0.10 0 0 1 0 0 1.83 -0.11 -0.68 -0.27 0.39 -0.13 0.03 0.05 ≈ 0 1 0 1 1 0.79 1.15 -1.80 -0.03 -0.09 -0.13 -0.47 -0.47 1 0 1 0 1 0.97 -1.90 -2.12 Y’ 0 0 0 1 0 1.01 -0.25 -1.77 1 1 0 0 0 2.33 -8.00 1.06 X
  • 12. k = 3, λ=2, α=40 1 iteration 1 1 1 0 0 0.94 1.00 1.00 0.18 0.07 0 0 1 0 0 0.84 0.89 0.99 0.60 0.50 ≈ 0 1 0 1 1 0.07 0.99 0.46 1.01 0.98 X•Y’ 1 0 1 0 1 1.00 -0.09 1.00 1.08 0.99 0 0 0 1 0 0.55 0.54 0.75 0.98 0.92 1 1 0 0 0 1.01 0.99 0.98 -0.13 -0.25
  • 13. k = 3, λ=2, α=40 10 iterations 1 1 1 0 0 0.96 0.99 0.99 0.38 0.93 0 0 1 0 0 0.44 0.39 0.98 -0.11 0.39 ≈ 0 1 0 1 1 0.70 0.99 0.42 0.98 0.98 X•Y’ 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 0 0 0 1 0 0.11 0.51 -0.13 1.00 0.57 1 1 0 0 0 0.97 1.00 0.68 0.47 0.91
  • 14. Interesting Because… This is all very parallelizable by row, column
  • 15. BONUS: Folding in New Data  Model building takes time  Apply some right inverse: ⌃ X•Y’•(Y’)-1 = Q•(Y’)-1 = so  Sometimes need X = Q•(Y’)-1 immediate, if approximate, updates for new data  OK, what is (Y’)-1?  For new user U, need new  Of course (Y’•Y)•(Y’•Y)-1 = I row, XU•Y’ = QU, but have PU  So Y’•(Y•(Y’•Y)-1) = I and  What is XU? right inverse is Y•(Y’•Y)-1  Xu = QU•Y•(Y’•Y)-1 and so Xu ≈ Pu•Y•(Y’•Y)-1
  • 16. In Mahout  org.apache.mahout.cf.  MAHOUT-737 taste.hadoop.als. ParallelALSFactorizationJob  Alternate implementation  Alternating least squares of alternating least squares  Distributed, Hadoop- based  And more…  org.apache.mahout.cf.  DistributedLanczosSolver taste.impl.recommender.  SequentialOutOfCoreSvd svd.SVDRecommender  …  SVD-based  Non-distributed, not Hadoop
  • 17.  Complete product  Real-time Serving Layer Myrrix  Hadoop-based Computation Layer  Tuned, documented  Free / open: Serving Layer, for small data  Commercial: add Computation Layer for big data; Hosting  Matrix factorization-based, attractive properties  http://myrrix.com
  • 18. Thank You srowen at myrrix.com mahout.apache.org