Setting Goals and Choosing Metrics for Recommender System Evaluations

Setting Goals and Choosing Metrics for Recommender
System Evaluations
Gunnar Schröder, Maik Thiele, Wolfgang Lehner

Gunnar Schröder UCERSTI 2 Workshop
T-Systems Multimedia Solutions at the 5th ACM Conference on
Dresden University of Technology Recommender Systems
Chicago, October 23th, 2011

How Do You Evaluate Recommender Systems?

RMSE
Precision
F1-Measure
Recall MAE
ROC Curves
Qualitative Techniques
Quantitative Techniques
User-Centric Evaluation

Mean Average Precision Area under the Curve

Accuracy Metrics Non-Accuracy Metrics

But why do you do it exactly this way?

Setting Goals and Choosing Metrics for Recommender System Evaluation - Gunnar Schröder

Some of the Issues This Paper Tries to Touch

 A large variety of metrics have been published
 Some metrics are highly correlated [Herlocker 2004]
 Little guidance for evaluating recommenders and choosing metrics

 Which aspects of the usage scenario and the data influence the choice?
 Which metrics are applicable?
 What do these metrics express?
 What are differences among them?
 Which metric represents our use-case best?
 How much do the metrics suffer from biases?


Factors That Influence the Choice of Evaluation Metrics

Objectives for recommender usage

Business goals User interests

Recommender task and interaction

Prediction Classification Ranking Similarity Presentation

Preference data

Explicit Implicit Unary Binary Numerical

Choice of metrics


Major Classes of Evaluation Metrics

 Prediction Accuracy Metrics
 Ranking Accuracy Metrics
 Classification Accuracy Metrics
 Non-Accuracy Metrics

5.0 4.8 4.7 4.3 3.8 3.2 2.4 2.1 1.6 1.2


Why Precision, Recall and F1-Measure May Fool You

 Ideal recommender (example a – f) vs. Worst-case recommender (ex. g – l )
 Four recommendations (R1 – R4) e.g. Precision@4
 Ten items with a varying ratio of relevant items (1 – 9 relevant items)

 Precision, recall and F1-measure are very sensitive to the ratio of relevant items Figure 3
 They fail to distinguish between an ideal recommender and a worst-case recommender if
the ratio of relevant items is varied


What is the Ideal Length for a Top-k Recommendation List?

 A typical ranking produced by a recommender on a set of ten item with four items being
relevant
 The length of the top-k recommendation list is varied in examples a (k=1) to j (k=10)

Figure 1



relevant

2.
1.
2.

2.
3.

part of
Figure 1



relevant

2. 3.
1. 2. 1.
3. 1. 2.
3. 3.

3.

part of
 Markedness = Precision + InvPrecision – 1 Figure 1
 Informedness = Recall + InvRecall – 1
 Matthew’s Correlation =
[Powers 2007]


From Simple Classification Measures to Partial Ranking Measures

 Moving a single relevant item among the recommenders ranking (examples a - j)

 Idea: Consider both classification and ranking for the top-k recommendations Figure 2

 Area under the Curve => Limited Area under the Curve

 Boolean Kendall’s Tau => Limited Boolean Kendall’s Tau


A Further More Complex Example to Study at Home

Figure 4
 Conclusions:
 For classification use markedness, informedness and Matthew’s correlation instead
of precision, recall and F1 measure
 Limited area under the curve and limited boolean Kendall’s tau are useful metrics for
top-k recommender evaluations


Conclusion and Contributions

 Important aspects that influence the metric choice
 Objectives for recommender usage
 Recommender task and interaction
 Aspects of preference data

 Some problems of Precision, Recall and F1-Measure
 The advantages of markedness, informedness and Matthew’s correlation

 Two new metrics that measure the ranking of a limited top-k list
 Limited area under the curve, limited boolean Kendall’s tau

 Guidelines for choosing a metric (See paper)


Thank You Very Much!

 Do not hesitate to contact me, if you have any
questions, comments or answers!

 Slides are available via e-mail or slideshare


Setting Goals and Choosing Metrics for Recommender System Evaluations

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Destaque

Destaque (6)

Semelhante a Setting Goals and Choosing Metrics for Recommender System Evaluations

Semelhante a Setting Goals and Choosing Metrics for Recommender System Evaluations (20)

Último

Último (20)

Setting Goals and Choosing Metrics for Recommender System Evaluations