SlideShare uma empresa Scribd logo
1 de 75
Baixar para ler offline
Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
1. Introduction
Train/Valid split, underfitting and overfitting
Learning Curve in recommendation engines
2. Evaluation functions
Basic metrics for recommender engines (Precision, Recall, TPR, TNR...)
Regression, Classification, Ranking metrics
3. Loss functions
Optimization problems and Losses functions properties
Regression, Classification, Ranking losses
4. Practical recommendations
Regularization, HP optimization, Embeddings evaluations.
Introduction
Previous Meetup Recap: Recommendation Engine Types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning
Previous Meetup Recap: Recommendation Engine Models
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
Optimization Problem – Matrix Factorization Example
(or R)
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Ratings of User #1
Embedding of User #1
Embedding of Item #1
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Ratings of User #m
To Item #n
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
AVAILABLE DATASET
?
?
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
3. How are we going to solve the problem?
2. What properties are we looking in our outputs?
- Exact rating vs like/dislike vs ranking predictions
1. What type of data do we have?
Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
Business decisions
Technical decisions
Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
EVALUATION FUNCTIONS
LOSS FUNCTIONS
RANDOM SEARCH, GP
COMPARE METRICS
ML FOR RECOMMENDATION
Business decisions
Technical decisions
Objectives Types (from data point of view)
Classification
● clic/no-click
● like/dislike/missing
● estimated probability of like (e.g. watch time)
Regression
● absolute rating (e.g. from 1/5 to 5/5)
● number of interactions
Ranking
● estimated order of preference (e.g. watch time)
● pairwise comparisons
Unsupervised
● clustering of items
● clustering of users
Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
LOSS FUNCTION THAT PENALIZE MORE
ERRORS IN ALL-STAR RATING
RANKING LOSS FUNCTION
CLASSIFICATION LOSS FUNCTION
Cross Validation – In Traditional Machine Learning
1 2 3 4 4 1 2 3
3 4 1 2 2 3 4 1
Cross Validation – In Recommendation Engines
Dataset
Cross Validation – In Recommendation Engines
Split such as every user is present in train and valid
More stronger: split as every user have 80/20 train and valid
Dataset
Underfitting and Overfitting
Underfitting and Overfitting
Model fails to learn
relations in data
Model is a good fit
for the data
Model fails to
generalize
New samples New samples New samples
+ Complex
Underfitting and Overfitting
Validation
Sample
+ Complex
Underfitting and Overfitting
Validation
Sample
+ Complex
OverfittingUnderfitting
Underfitting and Overfitting
epoch
Loss
Function
or
Metric
Mini-Batch Gradient Descent
for epoch in n_epochs:
● shuffle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model
● plot error vs epoch
Underfitting and Overfitting
A very simple way of checking underfitting
Ground truth
Y
Model predictions
Model is predicting always the same
Predicted Y
Underfitting
Evaluation Functions
What do we want to evaluate?
Classification
● True Positive Rate (TPR)
● True Negative Rate (TNR)
● Precision
● F-measure
Regression
● Mean Square Error (MSE)
Ranking
● Recall@K
● Precision@K
● CG, DCG, nDCG
Ranking/Classification metrics
● AUC
Some common evaluation functions
Regression
Mean Square Error (MSE)
● Easy to compute
● Linear gradient
● Can also be used as loss function
Mean Absolute Error (MAE)
● Easy to compute
● Easy to interpret
● Discontinuous gradient
● Can’t be used as loss function
Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
User’s likes
User’s dislikes
Model recommendations
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
User’s likes
User’s dislikes
Model recommendations
Recall = 5/7
Precision = 5/5 = 1
Recall = 7/7 = 1
Precision = 7/9
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
Classification 1/2
True Positive Rate (a.k.a TPR, Recall, Sensitivity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TPR <=1 (higher is better)
True Negative Rate (a.k.a TNR, Selectivity, Specificity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TNR <=1 (higher is better)
Classification 2/2
Precision
● Easy to understand
● Useful for likes/dislikes datasets
● Measure quality of recommendation
● 0 <= Precision <=1 (higher is better)
F-measure
● Balance precision and recall
● Not good for recommendation, because
doesn’t take into account True Negatives
● 0 <= F-measure <= 1 (higher is better)
Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Total Positive = ?
Recall@K = ?
Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Total Positive = 4
Recall@K = 2 / 4
top 1
top 2
top 3
Ranking 1/3
Recall@K
● In math terms:
Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Recall@K = ?
Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Recall@K = 2 / 3
top 1
top 2
top 3
Ranking 2/3
Precision@K
● In math terms:
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
CG = ?
DCG = ?
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
CG = 1.0 + 0.9 - 0.2
DCG = 1/1 + 0.9/2 - 0.2/3
top 1
top 2
top 3
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
Hybrid Ranking/Classification
AUC
● Vary positive prediction threshold (not just 0)
● Compute TPR and FPR for all possible positive thresholds
● Build Receiver Operating Characteristic (ROC) curve
● Integrate Area Under the ROC Curve (AUC)
Loss functions
Loss Functions vs Evaluation Functions
Evaluation Metrics
● Expensive to evaluate
● Often not smooth
● Often not even derivable
Loss Functions
● Smooth approximations of your evaluation metric
● Well suited for SGD
Loss Functions: How we are going to solve the problem?
Classification loss
● Logistic
● Cross Entropy
● Kullback-Leibler Divergence
Regression loss
● Mean Square Error (MSE)
Ranking loss
● WARP
● BPR
Some common loss functions
Optimization Problems – Basic Formulation with RMSE
Goal: find U and I s.t. the difference between each datapoint in R and and the product between each
user and item is minimal
(or R)
Optimization Problems – General Formulation
Goal: find U and I s.t. the loss function J is minimized.
(or R)
Convex vs Non-Convex Optimization
Convex Non-convex
Convex Optimization
Non-Convex Optimization
Loss Functions – Regression
Mean Square Error
● Typical used loss function for regression. It’s a smooth function. It’s easy to understand.
Regularized Mean Square Error
● Mean square error plus regularization to avoid overfitting.
Loss Functions – Classification
Logistic
● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.
Loss Functions – Classification
Logistic
Loss Functions – Ranking
Weighted Approximate-Rank Pairwise (WARP)
● Approximates DCG-like evaluation metrics
● Smooth and tractable computation
Bayesian Personalised Ranking (BPR)
● Approximates AUC
● Smooth and tractable computation
● Requires binary comparisons (good for binary comparison feedback)
Practical Recommendations
Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
COMPARE WITH GLOBAL MODELS IT’S EASY
IF OVERFITTING, USE REGULARIZATION
GRID SEARCH OR GAUSSIAN PROCESS
TPR, TNR, PRECISION, ETC.
ITEM/ITEM SIMILARITIES
EVERYTHING IS ABOUT USER TASTE
(1) Always compute baseline metrics
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
(2) Always analyze underfitting vs overfitting
Model-based
● Dropout
● Bagging
Loss-based normalization
● norm: best approximation of sparsity-inducing norm
● norm: very smooth, easy to optimize
Data Augmentation
● Negative Sampling
(3) Always do hyperparameter optimization
Grid Search
Brute force over all the combinations of the parameters
Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs
Random Search
Uniformly sample combinations of the parameters
Very easy to implement, very useful in practice
Gaussian Process Optimization
Meta-learning of the validation error given hyper-parameters
Solve exploration/exploitation tradeoff
(3) Always do hyperparameter optimization
Metric to minimize
Metric to maximize
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
Items embeddings
● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS
● After getting the embeddings, we always compute Top-K similarities in well known items
● We use the items embeddings to create clusters and analyze how good they are
(5) Always ask for final users feedback
RECOMMENDATION IS ALL ABOUT USERS TASTE
ASK THEM FOR FEEDBACK!!
Conclusions
Losses and metrics summary table
Name Category loss eval batch-SGD support implicit Comments
MSE Regr ✓ ✓ ✓ ✓ linear gradient
MAE Regr ✓ ✓ easy to interpret
Logistic / XE / KL Classif ✓ ✓ ✓ ✓ flexible truth
Exponential Classif ✓ ✓ exploding gradient
Recall (global) Classif ✓ ✓ ✓ requires negative
Precision (global) Classif ✓ ✓ ✓ requires negative
F-measure
(global)
Classif ✓ ✓ ✓ requires negative
MRR Ranking ✓ considers only 1 item
nDCG Ranking ✓ requires rank
WARP Ranking ✓ for nDCG, p@k, r@k
AUC Hybrid ✓ ✓ ✓ requires negative
BPR Hybrid ✓ ✓ for AUC
Recall@k Hybrid ✓ requires≤k positives
Precision@k Hybrid ✓ requires ≥k positives
Questions
Thank YOU!
Negative Sampling
Problem
● Unary feedback: the best model will always predict “1” for each user and item.
● In general:
○ your model is used in real life to predict (user, item) outside sparse dataset.
○ can’t train on the full (#users x #items) dense matrix.
Negative Sampling Solution
● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing)
● sample strategy matters a lot (i.e. how to split train and valid)
● how many negative samples matters a lot
Negative Sampling
Negative Sampling
Split negative feedback in the same proportion
Underfitting and Overfitting – Take Home
(1) For doing cross-validation split data such as almost all users are in training and validation
(2) Use negative sampling to avoid overfitting in your models
(3) Always use learning curves to get more insights about underfitting vs overfitting
(4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting
Loss Functions – Classification
● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model)
● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability
● Often used for deep-learning based recommendation engines
● Smooth gradient around zero and steep for large errors
Logistic

Mais conteúdo relacionado

Mais procurados

Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
Georgian Micsa
 

Mais procurados (6)

Data models
Data modelsData models
Data models
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
Advanced data modeling
Advanced data modelingAdvanced data modeling
Advanced data modeling
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 

Semelhante a Recommender Systems from A to Z – Model Evaluation

Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
Vivek Sharma
 

Semelhante a Recommender Systems from A to Z – Model Evaluation (20)

Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Mr4 ms10
Mr4 ms10Mr4 ms10
Mr4 ms10
 
Evaluation metrics for binary classification - the ultimate guide
Evaluation metrics for binary classification - the ultimate guideEvaluation metrics for binary classification - the ultimate guide
Evaluation metrics for binary classification - the ultimate guide
 
Yelp Dataset Challenge
Yelp Dataset ChallengeYelp Dataset Challenge
Yelp Dataset Challenge
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
 
Agile estimation
Agile estimationAgile estimation
Agile estimation
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
 

Último

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 

Último (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 

Recommender Systems from A to Z – Model Evaluation

  • 1.
  • 2. Recommender Systems from A to Z Part 1: The Right Dataset Part 2: Model Training Part 3: Model Evaluation Part 4: Real-Time Deployment
  • 3. Recommender Systems from A to Z Part 1: The Right Dataset Part 2: Model Training Part 3: Model Evaluation Part 4: Real-Time Deployment
  • 4. 1. Introduction Train/Valid split, underfitting and overfitting Learning Curve in recommendation engines 2. Evaluation functions Basic metrics for recommender engines (Precision, Recall, TPR, TNR...) Regression, Classification, Ranking metrics 3. Loss functions Optimization problems and Losses functions properties Regression, Classification, Ranking losses 4. Practical recommendations Regularization, HP optimization, Embeddings evaluations.
  • 6. Previous Meetup Recap: Recommendation Engine Types Recommendation engine Content-based Collaborative-filtering Hybrid engine Memory-based Model-based Item-Item User-User User-Item Model When? Problem definition Solutions strategies Content-based Item Cold start Least Square, Deep Learning Item-Item n_users >> n_items Affinity Matrix User-User n_user << n_items KNN, Affinity Matrix User-Item Better performance Matrix Factorization, Deep Learning
  • 7. Previous Meetup Recap: Recommendation Engine Models Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear Time Complexity + + +++ ++++ ++++ ++++ ++ Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit Hyper-Params 0 0 0 1 2 2–3 many Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
  • 8. Optimization Problem – Matrix Factorization Example (or R)
  • 9. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings
  • 10. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) Ratings of User #1 Embedding of User #1 Embedding of Item #1 Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings Ratings of User #m To Item #n
  • 11. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) AVAILABLE DATASET ? ? Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings
  • 12. Optimization Problem – Matrix Factorization Example Our goal is to find U and I, such as the difference between each datapoint in R and and the product between each user and item is minimal. (or R)
  • 13. Optimization Problem – Matrix Factorization Example Our goal is to find U and I, such as the difference between each datapoint in R and and the product between each user and item is minimal. (or R) 3. How are we going to solve the problem? 2. What properties are we looking in our outputs? - Exact rating vs like/dislike vs ranking predictions 1. What type of data do we have?
  • 14. Ask the Right Questions (1) What type of data do we have? (2) What properties are we looking in our outputs? (3) How are we going to solve the problem? (4) Which hyper-parameters of my model are the best? (5) Which model is the best? Business decisions Technical decisions
  • 15. Ask the Right Questions (1) What type of data do we have? (2) What properties are we looking in our outputs? (3) How are we going to solve the problem? (4) Which hyper-parameters of my model are the best? (5) Which model is the best? EVALUATION FUNCTIONS LOSS FUNCTIONS RANDOM SEARCH, GP COMPARE METRICS ML FOR RECOMMENDATION Business decisions Technical decisions
  • 16. Objectives Types (from data point of view) Classification ● clic/no-click ● like/dislike/missing ● estimated probability of like (e.g. watch time) Regression ● absolute rating (e.g. from 1/5 to 5/5) ● number of interactions Ranking ● estimated order of preference (e.g. watch time) ● pairwise comparisons Unsupervised ● clustering of items ● clustering of users
  • 17. Choosing the Right Objective (from business point of view) Absolute Predictions vs Relative Predictions Does only the order of the predictions matter? Sensitivity vs Specificity Is false positive worst than false negative? Skewness Is misclassifying an all-star favorite worst than misclassifying a casual like?
  • 18. Choosing the Right Objective (from business point of view) Absolute Predictions vs Relative Predictions Does only the order of the predictions matter? Sensitivity vs Specificity Is false positive worst than false negative? Skewness Is misclassifying an all-star favorite worst than misclassifying a casual like? LOSS FUNCTION THAT PENALIZE MORE ERRORS IN ALL-STAR RATING RANKING LOSS FUNCTION CLASSIFICATION LOSS FUNCTION
  • 19. Cross Validation – In Traditional Machine Learning 1 2 3 4 4 1 2 3 3 4 1 2 2 3 4 1
  • 20. Cross Validation – In Recommendation Engines Dataset
  • 21. Cross Validation – In Recommendation Engines Split such as every user is present in train and valid More stronger: split as every user have 80/20 train and valid Dataset
  • 23. Underfitting and Overfitting Model fails to learn relations in data Model is a good fit for the data Model fails to generalize New samples New samples New samples + Complex
  • 25. Underfitting and Overfitting Validation Sample + Complex OverfittingUnderfitting
  • 26. Underfitting and Overfitting epoch Loss Function or Metric Mini-Batch Gradient Descent for epoch in n_epochs: ● shuffle the batches ● for batch in n_batches: ○ compute the predictions for the batch ○ compute the error for the batch ○ compute the gradient for the batch ○ update the parameters of the model ● plot error vs epoch
  • 27. Underfitting and Overfitting A very simple way of checking underfitting Ground truth Y Model predictions Model is predicting always the same Predicted Y Underfitting
  • 29. What do we want to evaluate? Classification ● True Positive Rate (TPR) ● True Negative Rate (TNR) ● Precision ● F-measure Regression ● Mean Square Error (MSE) Ranking ● Recall@K ● Precision@K ● CG, DCG, nDCG Ranking/Classification metrics ● AUC Some common evaluation functions
  • 30. Regression Mean Square Error (MSE) ● Easy to compute ● Linear gradient ● Can also be used as loss function Mean Absolute Error (MAE) ● Easy to compute ● Easy to interpret ● Discontinuous gradient ● Can’t be used as loss function
  • 31. Classification – Precision vs Recall TS = Toy Story KP = Kung Fu Panda TD = How to train your dragon A = Annabelle Model 1 Model 2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2 User’s likes User’s dislikes Model recommendations TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2
  • 32. Classification – Precision vs Recall TS = Toy Story KP = Kung Fu Panda TD = How to train your dragon A = Annabelle User’s likes User’s dislikes Model recommendations Recall = 5/7 Precision = 5/5 = 1 Recall = 7/7 = 1 Precision = 7/9 Model 1 Model 2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2
  • 33. Classification 1/2 True Positive Rate (a.k.a TPR, Recall, Sensitivity) ● Easy to understand ● Useful for likes/dislikes datasets ● Measure of global bias of a model ● 0 <= TPR <=1 (higher is better) True Negative Rate (a.k.a TNR, Selectivity, Specificity) ● Easy to understand ● Useful for likes/dislikes datasets ● Measure of global bias of a model ● 0 <= TNR <=1 (higher is better)
  • 34. Classification 2/2 Precision ● Easy to understand ● Useful for likes/dislikes datasets ● Measure quality of recommendation ● 0 <= Precision <=1 (higher is better) F-measure ● Balance precision and recall ● Not good for recommendation, because doesn’t take into account True Negatives ● 0 <= F-measure <= 1 (higher is better)
  • 35. Ranking 1/3 Recall@K ● Count the positive items of the top K items predicted for each user ● Divides that number by the number of positive items for each user ● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? TOP K Positive = ? Total Positive = ? Recall@K = ?
  • 36. Ranking 1/3 Recall@K ● Count the positive items of the top K items predicted for each user ● Divides that number by the number of positive items for each user ● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} TOP K Positive = {TS1, TS2} = 2 Total Positive = 4 Recall@K = 2 / 4 top 1 top 2 top 3
  • 38. Ranking 2/3 Precision@K ● Count the positive items of the top K items predicted for each user ● Divides that number by K for each user ● A perfect score is 1 if the user has K or more positive items and the top K only contains positives ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? TOP K Positive = ? Recall@K = ?
  • 39. Ranking 2/3 Precision@K ● Count the positive items of the top K items predicted for each user ● Divides that number by K for each user ● A perfect score is 1 if the user has K or more positive items and the top K only contains positives ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} TOP K Positive = {TS1, TS2} = 2 Recall@K = 2 / 3 top 1 top 2 top 3
  • 41. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings ● The bigger the score the better Movie Toy Story 1 1 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? CG = ? DCG = ?
  • 42. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings ● The bigger the score the better Movie Toy Story 1 1 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} CG = 1.0 + 0.9 - 0.2 DCG = 1/1 + 0.9/2 - 0.2/3 top 1 top 2 top 3
  • 43. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
  • 44. Hybrid Ranking/Classification AUC ● Vary positive prediction threshold (not just 0) ● Compute TPR and FPR for all possible positive thresholds ● Build Receiver Operating Characteristic (ROC) curve ● Integrate Area Under the ROC Curve (AUC)
  • 46. Loss Functions vs Evaluation Functions Evaluation Metrics ● Expensive to evaluate ● Often not smooth ● Often not even derivable Loss Functions ● Smooth approximations of your evaluation metric ● Well suited for SGD
  • 47. Loss Functions: How we are going to solve the problem? Classification loss ● Logistic ● Cross Entropy ● Kullback-Leibler Divergence Regression loss ● Mean Square Error (MSE) Ranking loss ● WARP ● BPR Some common loss functions
  • 48. Optimization Problems – Basic Formulation with RMSE Goal: find U and I s.t. the difference between each datapoint in R and and the product between each user and item is minimal (or R)
  • 49. Optimization Problems – General Formulation Goal: find U and I s.t. the loss function J is minimized. (or R)
  • 50. Convex vs Non-Convex Optimization Convex Non-convex
  • 53. Loss Functions – Regression Mean Square Error ● Typical used loss function for regression. It’s a smooth function. It’s easy to understand. Regularized Mean Square Error ● Mean square error plus regularization to avoid overfitting.
  • 54. Loss Functions – Classification Logistic ● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.
  • 55. Loss Functions – Classification Logistic
  • 56. Loss Functions – Ranking Weighted Approximate-Rank Pairwise (WARP) ● Approximates DCG-like evaluation metrics ● Smooth and tractable computation Bayesian Personalised Ranking (BPR) ● Approximates AUC ● Smooth and tractable computation ● Requires binary comparisons (good for binary comparison feedback)
  • 58. Practical Recommendations (1) Always compute baseline metrics (2) Always analyze underfitting vs overfitting (3) Always do hyperparameter optimization (4) Always compute multiple metrics for your models (5) Always analyze the clustering properties of the items/users (6) Always ask feedback from end users
  • 59. Practical Recommendations (1) Always compute baseline metrics (2) Always analyze underfitting vs overfitting (3) Always do hyperparameter optimization (4) Always compute multiple metrics for your models (5) Always analyze the clustering properties of the items/users (6) Always ask feedback from end users COMPARE WITH GLOBAL MODELS IT’S EASY IF OVERFITTING, USE REGULARIZATION GRID SEARCH OR GAUSSIAN PROCESS TPR, TNR, PRECISION, ETC. ITEM/ITEM SIMILARITIES EVERYTHING IS ABOUT USER TASTE
  • 60. (1) Always compute baseline metrics Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear Time Complexity + + +++ ++++ ++++ ++++ ++ Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit Hyper-Params 0 0 0 1 2 2–3 many Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
  • 61. (2) Always analyze underfitting vs overfitting Model-based ● Dropout ● Bagging Loss-based normalization ● norm: best approximation of sparsity-inducing norm ● norm: very smooth, easy to optimize Data Augmentation ● Negative Sampling
  • 62. (3) Always do hyperparameter optimization Grid Search Brute force over all the combinations of the parameters Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs Random Search Uniformly sample combinations of the parameters Very easy to implement, very useful in practice Gaussian Process Optimization Meta-learning of the validation error given hyper-parameters Solve exploration/exploitation tradeoff
  • 63. (3) Always do hyperparameter optimization Metric to minimize Metric to maximize
  • 64. (4) Always compute multiple metrics for your models
  • 65. (5) Always analyze the clustering properties of the items/users Items embeddings ● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS ● After getting the embeddings, we always compute Top-K similarities in well known items ● We use the items embeddings to create clusters and analyze how good they are
  • 66. (5) Always ask for final users feedback RECOMMENDATION IS ALL ABOUT USERS TASTE ASK THEM FOR FEEDBACK!!
  • 68. Losses and metrics summary table Name Category loss eval batch-SGD support implicit Comments MSE Regr ✓ ✓ ✓ ✓ linear gradient MAE Regr ✓ ✓ easy to interpret Logistic / XE / KL Classif ✓ ✓ ✓ ✓ flexible truth Exponential Classif ✓ ✓ exploding gradient Recall (global) Classif ✓ ✓ ✓ requires negative Precision (global) Classif ✓ ✓ ✓ requires negative F-measure (global) Classif ✓ ✓ ✓ requires negative MRR Ranking ✓ considers only 1 item nDCG Ranking ✓ requires rank WARP Ranking ✓ for nDCG, p@k, r@k AUC Hybrid ✓ ✓ ✓ requires negative BPR Hybrid ✓ ✓ for AUC Recall@k Hybrid ✓ requires≤k positives Precision@k Hybrid ✓ requires ≥k positives
  • 71. Negative Sampling Problem ● Unary feedback: the best model will always predict “1” for each user and item. ● In general: ○ your model is used in real life to predict (user, item) outside sparse dataset. ○ can’t train on the full (#users x #items) dense matrix. Negative Sampling Solution ● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing) ● sample strategy matters a lot (i.e. how to split train and valid) ● how many negative samples matters a lot
  • 73. Negative Sampling Split negative feedback in the same proportion
  • 74. Underfitting and Overfitting – Take Home (1) For doing cross-validation split data such as almost all users are in training and validation (2) Use negative sampling to avoid overfitting in your models (3) Always use learning curves to get more insights about underfitting vs overfitting (4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting
  • 75. Loss Functions – Classification ● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model) ● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability ● Often used for deep-learning based recommendation engines ● Smooth gradient around zero and steep for large errors Logistic