1. Ne#lix
Recommenda/ons
Beyond
the
5
Stars
ACM
SF-‐Bay
Area
October
22,
2012
Xavier
Amatriain
Personaliza?on
Science
and
Engineering
-‐
NeDlix
@xamat
2. Outline
1. The Netflix Prize & the Recommendation
Problem
2. Anatomy of Netflix Personalization
3. Data & Models
4. And…
a) Consumer (Data) Science
b) Or Software Architectures
4. SVD
What we were interested in:
§ High quality recommendations
Proxy question: Results
§ Accuracy in predicted rating • Top 2 algorithms still in
production
§ Improve by 10% = $1million!
RBM
5. What about the final prize ensembles?
§ Our offline studies showed they were too computationally
intensive to scale
§ Expected improvement not worth the engineering effort
§ Plus…. Focus had already shifted to other issues that
had more impact than rating prediction.
5
14. Genre rows
§ Personalized genre rows focus on user interest
§ Also provide context and “evidence”
§ Important for member satisfaction – moving personalized rows to top on
devices increased retention
§ How are they generated?
§ Implicit: based on user’s recent plays, ratings, & other interactions
§ Explicit taste preferences
§ Hybrid:combine the above
§ Also take into account:
§ Freshness - has this been shown before?
§ Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
21. Similars
§ Displayed in
many different
contexts
§ In response to
user actions/
context (search,
queue add…)
§ More like… rows
22. Anatomy of a Personalization - Recap
§ Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity…
§ We strive to make it easy for the user, but…
§ We want the user to be aware and be involved in the
recommendation process
§ Deal with implicit/explicit and hybrid feedback
§ Add support/explanations for recommendations
§ Consider issues such as diversity or freshness
22
24. Big Data @Netflix § Almost 30M subscribers
§ Ratings: 4M/day
§ Searches: 3M/day
§ Plays: 30M/day
§ 2B hours streamed in Q4
2011
§ 1B hours in June 2012
24
25. Smart Models
§ Logistic/linear regression
§ Elastic nets
§ SVD and other MF models
§ Restricted Boltzmann Machines
§ Markov Chains
§ Different clustering approaches
§ LDA
§ Association Rules
§ Gradient Boosted Decision Trees
§ …
25
26. SVD
X[n x m] = U[n x r] S [ r x r] (V[m x r])T
§ X: m x n matrix (e.g., m users, n videos)
§ U: m x r matrix (m users, r concepts)
§ S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)
§ V: r x n matrix (n videos, r concepts)
27. Simon Funk’s SVD
§ One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
§ Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
27
28. SVD for Rating Prediction
f
§ User factor vectors pu ∈ ℜ f and item-factors vector qv ∈ ℜ
§ Baseline buv = µ + bu + bv (user & item deviation from average)
' T
§ Predict rating as ruv = buv + pu qv
§ SVD++ (Koren et. Al) asymmetric variation w. implicit feedback
$ −
1
−
1 '
' T
& R(u) 2
r = buv + q &
uv v ∑ (ruj − buj )x j + N(u) 2
∑ yj ) )
% (
§ Where
j∈R(u) j∈N (u)
§ qv , xv , yv ∈ ℜ f are three item factor vectors
§ Users are not parametrized, but rather represented by:
§ R(u): items rated by user u
§ N(u): items for which the user has given implicit preference (e.g. rated vs. not rated)
28
29. Artificial Neural Networks – 4 generations
§ 1st - Perceptrons (~60s)
§ Single layer of hand-coded features
§ Linear activation function
§ Fundamentally limited in what they can learn to do.
§ 2nd - Back-propagation (~80s)
§ Back-propagate error signal to get derivatives for learning
§ Non-linear activation function
§ 3rd - Belief Networks (~90s)
§ Directed acyclic graph composed of (visible & hidden) stochastic variables
with weighted connections.
§ Infer the states of the unobserved variables & learn interactions between
variables to make network more likely to generate observed data.
29
30. Restricted Boltzmann Machines
§ Restrict the connectivity to make learning easier.
§ Only one layer of hidden units.
§ Although multiple layers are possible hidden
§ No connections between hidden units.
j
§ Hidden units are independent given the visible
states..
§ So we can quickly get an unbiased sample from
the posterior distribution over hidden “causes” i
when given a data-vector
visible
§ RBMs can be stacked to form Deep Belief
Nets (DBN) – 4th generation of ANNs
32. Ranking Key algorithm, sorts titles in most
contexts
33. Ranking
§ Ranking = Scoring + Sorting + Filtering § Factors
bags of movies for presentation to a user § Accuracy
§ Goal: Find the best possible ordering of a § Novelty
set of videos for a user within a specific § Diversity
context in real-time § Freshness
§ Objective: maximize consumption § Scalability
§ Aspirations: Played & “enjoyed” titles have § …
best score
§ Akin to CTR forecast for ads/search results
34. Ranking
§ Popularity is the obvious baseline
§ Ratings prediction is a clear secondary data
input that allows for personalization
§ We have added many other features (and tried
many more that have not proved useful)
§ What about the weights?
§ Based on A/B testing
§ Machine-learned
35. Example: Two features, linear model
1
Predicted Rating
2
Final
Ranking
3
4
Linear
Model:
frank(u,v)
=
w1
p(v)
+
w2
r(u,v)
+
b
5
Popularity
35
40. Learning to rank
§ Machine learning problem: goal is to construct ranking
model from training data
§ Training data can have partial order or binary judgments
(relevant/not relevant).
§ Resulting order of the items typically induced from a
numerical score
§ Learning to rank is a key element for personalization
§ You can treat the problem as a standard supervised
classification problem
40
41. Learning to Rank Approaches
1. Pointwise
§ Ranking function minimizes loss function defined on individual
relevance judgment
§ Ranking score based on regression or classification
§ Ordinal regression, Logistic regression, SVM, GBDT, …
2. Pairwise
§ Loss function is defined on pair-wise preferences
§ Goal: minimize number of inversions in ranking
§ Ranking problem is then transformed into the binary classification
problem
§ RankSVM, RankBoost, RankNet, FRank…
42. Learning to rank - metrics DCG
NDCG =
IDCG
§ Quality of ranking measured using metrics as n
relevancei
DCG = relevance1 + ∑
§ Normalized Discounted Cumulative Gain 2 log 2 i
§ Mean Reciprocal Rank (MRR)
1 1
§ Fraction of Concordant Pairs (FCP) MRR =
H
∑ rank(h )
h∈H i
§ Others…
§ But, it is hard to optimize machine-learned ∑CP(x , x ) i j
models directly on these measures (they are FCP = i≠ j
n(n −1)
not differentiable) 2
§ Recent research on models that directly
optimize ranking measures
42
43. Learning to Rank Approaches
3. Listwise
a. Indirect Loss Function
§ RankCosine: similarity between ranking list and ground truth as loss function
§ ListNet: KL-divergence as loss function by defining a probability distribution
§ Problem: optimization of listwise loss function may not optimize IR metrics
b. Directly optimizing IR measures (difficult since they are not differentiable)
§ Directly optimize IR measures through Genetic Programming
§ Directly optimize measures with Simulated Annealing
§ Gradient descent on smoothed version of objective function (e.g. CLiMF
presented at Recsys 2012 or TFMAP at SIGIR 2012)
§ SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
§ AdaRank uses boosting to optimize NDCG
44. Similars
§ Different similarities computed
from different sources: metadata,
ratings, viewing data…
§ Similarities can be treated as
data/features
§ Machine Learned models
improve our concept of “similarity”
44
45. Data & Models - Recap
§ All sorts of feedback from the user can help generate better
recommendations
§ Need to design systems that capture and take advantage of
all this data
§ The right model is as important as the right data
§ It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
§ Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity…)
45
46. More data or better models?
Really?
Anand Rajaraman: Stanford & Senior VP at
Walmart Global eCommerce (former Kosmix) 46
47. More data or better models?
Sometimes, it’s not
about more data
47
48. More data or better models?
[Banko and Brill, 2001]
Norvig: “Google does not
have better Algorithms,
only more Data”
Many features/
low-bias models
48
49. More data or better models?
Model performance vs. sample size
(actual Netflix system)
0.09
0.08
0.07
0.06
0.05 Sometimes, it’s not
about more data
0.04
0.03
0.02
0.01
0
0 1000000 2000000 3000000 4000000 5000000 6000000
49
50. More data or better models?
Data without a sound approach = noise 50
52. Consumer Science
§ Main goal is to effectively innovate for customers
§ Innovation goals
§ “If you want to increase your success rate, double
your failure rate.” – Thomas Watson, Sr., founder of
IBM
§ The only real failure is the failure to innovate
§ Fail cheaply
§ Know why you failed/succeeded
52
53. Consumer (Data) Science
1. Start with a hypothesis:
§ Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2. Design a test
§ Develop a solution or prototype
§ Think about dependent & independent variables, control,
significance…
3. Execute the test
4. Let data speak for itself
53
54. Offline/Online testing process
days Weeks to months
Offline Online A/B Rollout
Feature to
testing [success]
testing [success] all users
[fail]
54
55. Offline testing
§ Optimize algorithms offline
§ Measure model performance, using metrics such as:
§ Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…
§ Offline performance used as an indication to make informed
decisions on follow-up A/B tests
§ A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
§ Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)
55
56. Executing A/B tests
§ Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
§ Think about significance and hypothesis testing
§ Our tests usually have thousands of members and 2-20 cells
§ A/B Tests allow you to try radical ideas or test many
approaches at the same time.
§ We typically have hundreds of customer A/B tests running
§ Decisions on the product always data-driven
56
57. What to measure
§ OEC: Overall Evaluation Criteria
§ In an AB test framework, the measure of success is key
§ Short-term metrics do not always align with long term
goals
§ E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
§ Use long term metrics such as LTV (Life time value)
whenever possible
§ In Netflix, we use member retention
57
58. What to measure
§ Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
§ At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
§ But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See “Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained” [Kohavi et. Al. KDD
12]
58
59. Consumer Data Science - Recap
§ Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
§ This is mainly done through online AB Testing
§ However, we can speed up innovation by experimenting
offline
§ But, both for online and offline experimentation, it is
important to chose the right metric and experimental
framework
59
64. Event & Data Distribution
• UI devices should broadcast many
different kinds of user events
• Clicks
• Presentations
• Browsing events
• …
• Events vs. data
• Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
• Others need to be processed and
“turned into” data (higher latency,
higher information quality).
• And… there are many in between
• Real-time event flow managed
through internal tool (Manhattan)
• Data flow mostly managed through
Hadoop.
64
66. Offline Jobs
• Two kinds of offline jobs
• Model training
• Batch offline computation of
recommendations/
intermediate results
• Offline queries either in Hive or
PIG
• Need a publishing mechanism
that solves several issues
• Notify readers when result of
query is ready
• Support different repositories
(s3, cassandra…)
• Handle errors, monitoring…
• We do this through Hermes
66
68. Computation
• Two ways of computing personalized
results
• Batch/offline
• Online
• Each approach has pros/cons
• Offline
+ Allows more complex computations
+ Can use more data
- Cannot react to quick changes
- May result in staleness
• Online
+ Can respond quickly to events
+ Can use most recent data
- May fail because of SLA
- Cannot deal with “complex”
computations
• It’s not an either/or decision
• Both approaches can be combined
68
70. Signals & Models
• Both offline and online algorithms are
based on three different inputs:
• Models: previously trained from
existing data
• (Offline) Data: previously
processed and stored information
• Signals: fresh data obtained from
live services
• User-related data
• Context data (session, date,
time…)
70
72. Results
• Recommendations can be serviced
from:
• Previously computed lists
• Online algorithms
• A combination of both
• The decision on where to service the
recommendation from can respond to
many factors including context.
• Also, important to think about the
fallbacks (what if plan A fails)
• Previously computed lists/intermediate
results can be stored in a variety of
ways
• Cache
• Cassandra
• Relational DB
72
73. Alerts and Monitoring
§ A non-trivial concern in large-scale recommender
systems
§ Monitoring: continuously observe quality of system
§ Alert: fast notification if quality of system goes below a
certain pre-defined threshold
§ Questions:
§ What do we need to monitor?
§ How do we know something is “bad enough” to alert
73
74. What to monitor
Did something go
§ Staleness wrong here?
§ Monitor time since last data update
74
75. What to monitor
§ Algorithmic quality
§ Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
75
76. What to monitor
§ Algorithmic quality
§ Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
Did something go
wrong here?
76
77. What to monitor
§ Algorithmic source for users
§ Monitor how users interact with different algorithms
Algorithm X
Did something go
wrong here?
New version
77
78. When to alert
§ Alerting thresholds are hard to tune
§ Avoid unnecessary alerts (the “learn-to-ignore problem”)
§ Avoid important issues being noticed before the alert happens
§ Rules of thumb
§ Alert on anything that will impact user experience significantly
§ Alert on issues that are actionable
§ If a noticeable event happens without an alert… add a new alert
for next time
78
80. The Personalization Problem
§ The Netflix Prize simplified the recommendation problem
to predicting ratings
§ But…
§ User ratings are only one of the many data inputs we have
§ Rating predictions are only part of our solution
§ Other algorithms such as ranking or similarity are very important
§ We can reformulate the recommendation problem
§ Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service
80
81. More data +
Better models +
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!
81