SlideShare a Scribd company logo
1 of 56
Download to read offline
May 9, 2014
Collaborative
Filtering with Spark
Chris Johnson
@MrChrisJohnson
Friday, May 9, 14
Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a graduate student at UT Austin
Friday, May 9, 14
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
4
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
Collaborative Filtering - “The Netflix Prize”
5
Friday, May 9, 14
Collaborative Filtering
6
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Friday, May 9, 14
7
Collaborative Filtering at Spotify
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
Friday, May 9, 14
Section name 8
Friday, May 9, 14
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
Friday, May 9, 14
• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie
• = user latent factor vector
• = item latent factor vector
X Y
Friday, May 9, 14
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Replace Stream counts with binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of stream counts as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• =i tem latent factor vector
X Y
Friday, May 9, 14
Alternating Least Squares 12
• Initialize user and item vectors to random noise
• Fix item vectors and solve for optimal user vectors
– Take the derivative of loss function with respect to user’s vector, set
equal to 0, and solve
– Results in a system of linear equations with closed form solution!
• Fix user vectors and solve for optimal item vectors
• Repeat until convergence
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Alternating Least Squares 13
• Note that:
• Then, we can pre-compute once per iteration
– and only contain non-zero elements for tracks that
the user streamed
– Using sparse matrix operations we can then compute each user’s
vector efficiently in time where is the number of
tracks the user streamed
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
14
Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Section name 15
Friday, May 9, 14
Scaling up Implicit Matrix Factorization
with Hadoop
16
Friday, May 9, 14
Hadoop at Spotify 2009
17
Friday, May 9, 14
Hadoop at Spotify 2014
18
700 Nodes in our London data center
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
19
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
20
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Friday, May 9, 14
Hadoop suffers from I/O overhead
21
IO Bottleneck
Friday, May 9, 14
Spark to the rescue!!
22
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Friday, May 9, 14
Section name 23
Friday, May 9, 14
First Attempt
24
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
First Attempt
25
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
26
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
27
Friday, May 9, 14
First Attempt
28
•Issues:
–Unnecessarily sending multiple copies of item vector to each node
–Unnecessarily shuffling data across cluster at each iteration
–Not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
Second Attempt
29
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
30
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
31
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
32
Friday, May 9, 14
Second Attempt
33
Friday, May 9, 14
Second Attempt
34
•Issues:
–Still Unnecessarily shuffling data across cluster at each iteration
–Still not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
So, what are we missing?...
35
•Partitioner: Defines how the elements in a key-value pair RDD are
partitioned across the cluster.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
36
•partitionBy(partitioner): Partitions all elements of the same key to the
same node in the cluster, as defined by the partitioner.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
37
•mapPartitions(func): Similar to map, but runs separately on each
partition (block) of the RDD, so func must be of type Iterator[T] =>
Iterator[U] when running on an RDD of type T.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
function()
function()
function()
Friday, May 9, 14
So, what are we missing?...
38
• persist(storageLevel): Set this RDD's storage level to persist (cache)
its values across operations after the first time it is computed.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
Third Attempt
39
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
40
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
41
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
42
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
43
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
44
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third attempt
45
Friday, May 9, 14
Third attempt
46
Friday, May 9, 14
Third attempt
47
Friday, May 9, 14
ALS Running Times
48
Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Friday, May 9, 14
Section name 49
Fin
Friday, May 9, 14
Section name 50
Friday, May 9, 14
Section name 51
Friday, May 9, 14
Section name 52
Friday, May 9, 14
Section name 53
Friday, May 9, 14
Section name 54
Friday, May 9, 14
Section name 55
Friday, May 9, 14
Section name 56
Friday, May 9, 14

More Related Content

What's hot

The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal RecommenderPat Ferrel
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksZimin Park
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at SpotifyRohan Agrawal
 
Advanced Schema Design Patterns
Advanced Schema Design Patterns Advanced Schema Design Patterns
Advanced Schema Design Patterns MongoDB
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...Databricks
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 

What's hot (20)

The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networks
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Advanced Schema Design Patterns
Advanced Schema Design Patterns Advanced Schema Design Patterns
Advanced Schema Design Patterns
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 

Viewers also liked

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2njit-ronbrown
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Domonkos Tikk
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationrubyyc
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix FactorizationTatsuya Yokota
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومfaradars
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Viewers also liked (17)

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دوم
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Similar to Collaborative Filtering with Spark

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Sc Huang
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfAlphaIssaghaDiallo
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptxJasonTuran2
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Ire presentation
Ire presentationIre presentation
Ire presentationRaj Patel
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINERaj Patel
 

Similar to Collaborative Filtering with Spark (20)

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Ire presentation
Ire presentationIre presentation
Ire presentation
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
 

Recently uploaded

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 

Recently uploaded (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 

Collaborative Filtering with Spark

  • 1. May 9, 2014 Collaborative Filtering with Spark Chris Johnson @MrChrisJohnson Friday, May 9, 14
  • 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a graduate student at UT Austin Friday, May 9, 14
  • 3. 3 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 4. 4 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 5. Collaborative Filtering - “The Netflix Prize” 5 Friday, May 9, 14
  • 6. Collaborative Filtering 6 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson Friday, May 9, 14
  • 7. 7 Collaborative Filtering at Spotify • Discover (personalized recommendations) • Radio • Related Artists • Now Playing Friday, May 9, 14
  • 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies Friday, May 9, 14
  • 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X Y Friday, May 9, 14
  • 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Replace Stream counts with binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X Y Friday, May 9, 14
  • 12. Alternating Least Squares 12 • Initialize user and item vectors to random noise • Fix item vectors and solve for optimal user vectors – Take the derivative of loss function with respect to user’s vector, set equal to 0, and solve – Results in a system of linear equations with closed form solution! • Fix user vectors and solve for optimal item vectors • Repeat until convergence code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 13. Alternating Least Squares 13 • Note that: • Then, we can pre-compute once per iteration – and only contain non-zero elements for tracks that the user streamed – Using sparse matrix operations we can then compute each user’s vector efficiently in time where is the number of tracks the user streamed code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 14. 14 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 16. Scaling up Implicit Matrix Factorization with Hadoop 16 Friday, May 9, 14
  • 17. Hadoop at Spotify 2009 17 Friday, May 9, 14
  • 18. Hadoop at Spotify 2014 18 700 Nodes in our London data center Friday, May 9, 14
  • 19. Implicit Matrix Factorization with Hadoop 19 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson Friday, May 9, 14
  • 20. Implicit Matrix Factorization with Hadoop 20 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson Friday, May 9, 14
  • 21. Hadoop suffers from I/O overhead 21 IO Bottleneck Friday, May 9, 14
  • 22. Spark to the rescue!! 22 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop Friday, May 9, 14
  • 24. First Attempt 24 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 25. First Attempt 25 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 26. First Attempt 26 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 28. First Attempt 28 •Issues: –Unnecessarily sending multiple copies of item vector to each node –Unnecessarily shuffling data across cluster at each iteration –Not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 29. Second Attempt 29 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 30. Second Attempt 30 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 31. Second Attempt 31 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 34. Second Attempt 34 •Issues: –Still Unnecessarily shuffling data across cluster at each iteration –Still not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 35. So, what are we missing?... 35 •Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 36. So, what are we missing?... 36 •partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 37. So, what are we missing?... 37 •mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 function() function() function() Friday, May 9, 14
  • 38. So, what are we missing?... 38 • persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 39. Third Attempt 39 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 40. Third Attempt 40 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 41. Third Attempt 41 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 42. Third Attempt 42 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 43. Third Attempt 43 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 44. Third Attempt 44 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 48. ALS Running Times 48 Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf Friday, May 9, 14