SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Dictionary Learning for
Massive Matrix Factorization
Arthur Mensch, Julien Mairal
Ga¨el Varoquaux, Bertrand Thirion
Inria Parietal, Inria Thoth
October 6, 2016
Introduction
Why am I here ?
Inria Parietal: machine learning for neuro-imaging
(fMRI data)
Matrix factorization: major ingredient in fMRI analysis
Very large datasets (2 TB): we designed faster algorithms
These algorithms can be used in collaborative filtering
D AX
Voxels
Time
=
k spatial maps Time
x
1
Work presented at ICML 2016
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
Collaborative filtering
Collaborative platform
n users rate a fraction of
p items
e.g movies, restaurants
Estimate ratings for
recommendation
Use the ratings of other users for recommendation
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
How to predict ratings ?
Credit: [Bell and Koren, 2007]
Joe like We were
soldiers, Black Hawk
down.
Bob and Alice like the
same films, and also
like Saving private
Ryan.
Joe should watch
Saving private Ryan,
because all of them
indeed likes war films.
Need to uncover topics in items
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Matrix factorization
1Ω X AD
p
n k
n
=
1
X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n
Constraints / penalty on factors D and A
We only observe 1Ω X — Ω set of ratings provided by users
Recommender systems : millions of users, millions of items
How to scale matrix factorization to very large datasets ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
Formalism
Finding representation in Rk for items and users:
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
= 1Ω (X − DA) 2
2 + λ( D 2
F + A 2
F ) 1Ω set of knownratings
2 reconstruction loss — 2 penalty for generalization
Existing methods
Alternated minimization
Stochastic gradient descent
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
Existing methods
Alternated minimization
Minimize over A, D: alternate between
D = min
D∈Rp×k
(i,j)∈Ω
(xij − di αj
)2
+ λ D 2
F
A = min
A∈Rk×n
(i,j)∈Ω
(xij − di αj
)2
+ λ A 2
F
No hyperparameters
Slow and memory expensive: use all ratings at each iteration
a.k.a. coordinate descent (variation in parameter update order)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
Existing methods
Stochastic gradient descent
min
A,D
(i,j)∈Ω
fij (A, B)
def
= (xij − di αj
)2
+
1
cj
λ αj 2
2 +
1
ci
λ di
2
2
Gradient step for each rating:
(At, Dt) ← (At−1, Dt−1)−
1
ct
(A,D)fij (At−1, Dt−1)
Fast and memory efficient – won the Netflix prize
Very sensitive to step sizes (ct) – need to cross-validate
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
Towards a new algorithm
Best of both worlds ?
Fast and memory efficient algorithm
Little sensitive to hyperparameter setting
Subsampled online dictionary learning
Builds upon the online dictionary learning algorithm
popular in computer vision and interpretable learning (fMRI)
Adapt it to handle missing values efficiently
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
Dictionary learning
Recall: recommender system formalism
Non-masked matrix factorization with 2 penalty:
min
D∈Rp×k
A∈Rk×n
n
j=1
(xj
− D αj
)2
+ λ( D 2
F + A 2
F )
Penalties can be richer, and made into constraints
Dictionary learning
Learn the left side factor [Olshausen and Field, 1997]
min
D∈C
n
j=1
xj
−Dαj 2
2 +λΩ(αj
) αj
= argmin
α∈Rk
xi
−Dα 2
2 +λΩ(α)
Naive approach: alternated minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
Online dictionary learning [Mairal et al., 2010]
At iteration t, select xt in {xj }j (user ratings), improve D
Single iteration complexity ∝ sample dimension O(p)
(Dt)t converges in a few epochs (one for large n)
xt αtD
p
n k n
=Stream
1
Very efficient in computer vision / networks / fMRI /
hyperspectral images
Can we use it efficiently for recommender systems ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
In short: Handling missing values
X
p
n
xt
Steam
Handle large n
n
Handle missing values
Online → online + partial
Batch →
online
Mtxt
Stream
Ignore
Unknown
Unaccessed
1
Leverage streaming + partial access to samples
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
In detail: online dictionary learning
Objective function involves latent codes (right side factor)
min
D∈C
1
t
t
i=1
xi − Dα∗
i (D) 2
2, α∗
i (D) = argmin
α
1
2
xi − Dα 2
2 + λΩ(α)
Replace latent codes by codes computed with old dictionaries
Build an upper-bounding surrogate function
min
1
t
t
i=1
xi −Dαi
2
2 αi = argmin
α
1
2
xi −Di−1α 2
2+λΩ(α)
Minimize surrogate — updateable online at low cost
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code – xt → complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function – Complexity in O(p)
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate – Complexity in O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
Specification for a new algorithm
Mtxt
Stream
Ignore
p
n
1
Constrained : use only known
ratings from Ω
Efficient: single iteration in O(s),
# of ratings provided by user t
Principled: follows the online
matrix factorization algorithm as
much as possible
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Code
computation
2 Surrogate
update
3 Surrogate
minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Subsampled online dictionary learning
Check out paper !
Original online MF
1 Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xt αt − Bt−1)
3 Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At )j,j
(DAj
t −Bj
t ))
Our algorithm
1 Code computation: masked loss
αt = argmin
α∈Rk
Mt (xt − Dt−1α) 2
2
+ λ
rk Mt
p
Ω(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
i=1 Mi
(Mt xt αt − Mt Bt−1)
3 Surrogate minimization
Mt Dj
← p⊥
Cj
(Mt Dj
−
1
(At )j,j
Mt (D(Aj
t − (Bj
t ))
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
Experiments
Validation : Test RMSE (rating prediction) vs CPU time
Baseline : Coordinate descent solver [Yu et al., 2012] for
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
Fastest solver available apart from SGD — hyperparameters
↑ Our method has a learning rate with little influence
Datasets : Movielens, Netflix
Publicly available
Larger one in the industry...
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
Results
Scalable algorithm: speed-up improves with size
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
Performance
Dataset Test RMSE Convergence time Speed
CD SODL CD SODL -up
ML 1M 0.872 0.866 6 s 8 s ×0.75
ML 10M 0.802 0.799 223 s 60 s ×3.7
NF (140M) 0.938 0.934 1714 s 256 s ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Simple model: RMSE is not state-of-the-art
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
Robustness to learning rate
Learning rate in algorithm to be set in [0.75, 1] (← theory)
In practice: Just set it in [0.8, 1]
1 10 40Epoch
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
RMSEontestset
Learning rate β0.75
0.78
0.81
0.83
0.86
0.89
0.92
0.94
0.97
1.00
MovieLens 10M
.1 1 10 20
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Netflix
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Appendix: Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
1
24 epoch
Proposed method
10 h run time
1
2 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
Bibliography I
[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).
Lessons from the Netflix prize challenge.
ACM SIGKDD Explorations Newsletter, 9(2):75–79.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization for
recommender systems.
In Proceedings of the International Conference on Data Mining, pages
765–774. IEEE.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28

Mais conteúdo relacionado

Mais procurados

Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 

Mais procurados (19)

Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
 
Review_Cibe Sridharan
Review_Cibe SridharanReview_Cibe Sridharan
Review_Cibe Sridharan
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 

Semelhante a Massive Matrix Factorization : Applications to collaborative filtering

Talk icml
Talk icmlTalk icml
Talk icml
Bo Li
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
Eric Zhang
 
slides-defense-jie
slides-defense-jieslides-defense-jie
slides-defense-jie
jie ren
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 

Semelhante a Massive Matrix Factorization : Applications to collaborative filtering (20)

Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Talk icml
Talk icmlTalk icml
Talk icml
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
slides-defense-jie
slides-defense-jieslides-defense-jie
slides-defense-jie
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
 
AlgorithmAnalysis2.ppt
AlgorithmAnalysis2.pptAlgorithmAnalysis2.ppt
AlgorithmAnalysis2.ppt
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 

Último

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 

Último (20)

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 

Massive Matrix Factorization : Applications to collaborative filtering

  • 1. Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal Ga¨el Varoquaux, Bertrand Thirion Inria Parietal, Inria Thoth October 6, 2016
  • 2. Introduction Why am I here ? Inria Parietal: machine learning for neuro-imaging (fMRI data) Matrix factorization: major ingredient in fMRI analysis Very large datasets (2 TB): we designed faster algorithms These algorithms can be used in collaborative filtering D AX Voxels Time = k spatial maps Time x 1 Work presented at ICML 2016 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
  • 3. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
  • 4. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
  • 5. Collaborative filtering Collaborative platform n users rate a fraction of p items e.g movies, restaurants Estimate ratings for recommendation Use the ratings of other users for recommendation Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
  • 6. How to predict ratings ? Credit: [Bell and Koren, 2007] Joe like We were soldiers, Black Hawk down. Bob and Alice like the same films, and also like Saving private Ryan. Joe should watch Saving private Ryan, because all of them indeed likes war films. Need to uncover topics in items Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
  • 7. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 8. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 9. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Learning problem: estimate D and A with known ratings Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 10. Matrix factorization 1Ω X AD p n k n = 1 X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n Constraints / penalty on factors D and A We only observe 1Ω X — Ω set of ratings provided by users Recommender systems : millions of users, millions of items How to scale matrix factorization to very large datasets ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
  • 11. Formalism Finding representation in Rk for items and users: min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) = 1Ω (X − DA) 2 2 + λ( D 2 F + A 2 F ) 1Ω set of knownratings 2 reconstruction loss — 2 penalty for generalization Existing methods Alternated minimization Stochastic gradient descent Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
  • 12. Existing methods Alternated minimization Minimize over A, D: alternate between D = min D∈Rp×k (i,j)∈Ω (xij − di αj )2 + λ D 2 F A = min A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ A 2 F No hyperparameters Slow and memory expensive: use all ratings at each iteration a.k.a. coordinate descent (variation in parameter update order) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
  • 13. Existing methods Stochastic gradient descent min A,D (i,j)∈Ω fij (A, B) def = (xij − di αj )2 + 1 cj λ αj 2 2 + 1 ci λ di 2 2 Gradient step for each rating: (At, Dt) ← (At−1, Dt−1)− 1 ct (A,D)fij (At−1, Dt−1) Fast and memory efficient – won the Netflix prize Very sensitive to step sizes (ct) – need to cross-validate Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
  • 14. Towards a new algorithm Best of both worlds ? Fast and memory efficient algorithm Little sensitive to hyperparameter setting Subsampled online dictionary learning Builds upon the online dictionary learning algorithm popular in computer vision and interpretable learning (fMRI) Adapt it to handle missing values efficiently Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
  • 15. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
  • 16. Dictionary learning Recall: recommender system formalism Non-masked matrix factorization with 2 penalty: min D∈Rp×k A∈Rk×n n j=1 (xj − D αj )2 + λ( D 2 F + A 2 F ) Penalties can be richer, and made into constraints Dictionary learning Learn the left side factor [Olshausen and Field, 1997] min D∈C n j=1 xj −Dαj 2 2 +λΩ(αj ) αj = argmin α∈Rk xi −Dα 2 2 +λΩ(α) Naive approach: alternated minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
  • 17. Online dictionary learning [Mairal et al., 2010] At iteration t, select xt in {xj }j (user ratings), improve D Single iteration complexity ∝ sample dimension O(p) (Dt)t converges in a few epochs (one for large n) xt αtD p n k n =Stream 1 Very efficient in computer vision / networks / fMRI / hyperspectral images Can we use it efficiently for recommender systems ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
  • 18. In short: Handling missing values X p n xt Steam Handle large n n Handle missing values Online → online + partial Batch → online Mtxt Stream Ignore Unknown Unaccessed 1 Leverage streaming + partial access to samples Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
  • 19. In detail: online dictionary learning Objective function involves latent codes (right side factor) min D∈C 1 t t i=1 xi − Dα∗ i (D) 2 2, α∗ i (D) = argmin α 1 2 xi − Dα 2 2 + λΩ(α) Replace latent codes by codes computed with old dictionaries Build an upper-bounding surrogate function min 1 t t i=1 xi −Dαi 2 2 αi = argmin α 1 2 xi −Di−1α 2 2+λΩ(α) Minimize surrogate — updateable online at low cost Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
  • 20. In detail: online dictionary learning Algorithm outline 1 Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 21. In detail: online dictionary learning Algorithm outline 1 Compute code – xt → complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function – Complexity in O(p) gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate – Complexity in O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 22. Specification for a new algorithm Mtxt Stream Ignore p n 1 Constrained : use only known ratings from Ω Efficient: single iteration in O(s), # of ratings provided by user t Principled: follows the online matrix factorization algorithm as much as possible Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
  • 23. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 24. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Adaptation to make Modify all parts of the algorithm to obtain O(s) complexity 1 Code computation 2 Surrogate update 3 Surrogate minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 25. Subsampled online dictionary learning Check out paper ! Original online MF 1 Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xt αt − Bt−1) 3 Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At )j,j (DAj t −Bj t )) Our algorithm 1 Code computation: masked loss αt = argmin α∈Rk Mt (xt − Dt−1α) 2 2 + λ rk Mt p Ω(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t i=1 Mi (Mt xt αt − Mt Bt−1) 3 Surrogate minimization Mt Dj ← p⊥ Cj (Mt Dj − 1 (At )j,j Mt (D(Aj t − (Bj t )) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
  • 26. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
  • 27. Experiments Validation : Test RMSE (rating prediction) vs CPU time Baseline : Coordinate descent solver [Yu et al., 2012] for min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) Fastest solver available apart from SGD — hyperparameters ↑ Our method has a learning rate with little influence Datasets : Movielens, Netflix Publicly available Larger one in the industry... Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
  • 28. Results Scalable algorithm: speed-up improves with size Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
  • 29. Performance Dataset Test RMSE Convergence time Speed CD SODL CD SODL -up ML 1M 0.872 0.866 6 s 8 s ×0.75 ML 10M 0.802 0.799 223 s 60 s ×3.7 NF (140M) 0.938 0.934 1714 s 256 s ×6.8 Outperform coordinate descent beyond 10M ratings Same prediction performance Speed-up 6.8× on Netflix Simple model: RMSE is not state-of-the-art Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
  • 30. Robustness to learning rate Learning rate in algorithm to be set in [0.75, 1] (← theory) In practice: Just set it in [0.8, 1] 1 10 40Epoch 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 RMSEontestset Learning rate β0.75 0.78 0.81 0.83 0.86 0.89 0.92 0.94 0.97 1.00 MovieLens 10M .1 1 10 20 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Netflix Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
  • 31. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 32. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Questions ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 33. Appendix: Resting-state fMRI Online dictionary learning 235 h run time 1 full epoch 10 h run time 1 24 epoch Proposed method 10 h run time 1 2 epoch, reduction r=12 Qualitatively, usable maps are obtained 10× faster Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
  • 34. Bibliography I [Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007). Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79. [Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60. [Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325. [Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012). Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Proceedings of the International Conference on Data Mining, pages 765–774. IEEE. Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28