SlideShare a Scribd company logo
1 of 59
Download to read offline
Evaluation 1: System-Oriented
Tetsuya Sakai
@tetsuyasakai
Waseda University
August 24, 2015@ASSIA 2015, Taipei.
About Tetsuya Sakai
• Professor – Department of Computer Science at Waseda University
• Associate Dean – IT Strategies Division of Waseda University
• Visiting professor – National Institute of Informatics
• Researcher in information retrieval, natural language processing,
interaction
• Editor-in-chief (Asia/Australasia) – Information Retrieval Journal (Springer)
• SIGIR 2013 PC co-chair
• SIGIR 2017 general co-chair
• NTCIR general co-chair
• Toshiba → Cambridge U → Toshiba → NewsWatch
→ Microsoft Research Asia → Waseda
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
• IR researchers’ goal: build systems
that satisfy the user’s information
needs.
• We cannot ask users all the time, so
we need measures as surrogates of
user satisfaction/performance.
• “If you cannot measure it, you
cannot improve it.”
http://zapatopi.net/kelvin/quotes/
system
system
system
Measure
Usersatisfaction
Improvements
Does it correlate with user
satisfaction?
Why measure?
Improvements that don’t add up [Armstrong09]
Armstrong et al. analysed 106 papers from SIGIR ’98-’08,
CIKM ’04-’08 that used TREC data, and reported:
• Researchers often use low baselines
• Researchers claim statistically significant improvements,
but the results are often not competitive with the best
TREC systems
• IR effectiveness has not really improved over a decade!
What we want What we’ve got?
The best IR system in the world
I’ve invented
an IR system
The best IR system in the world
I’ve invented
an IR system
A
I’ve built Test
Collection A
to evaluate it
The best IR system in the world
I’ve invented
an IR system
A
I’ve built Test
Collection A
to evaluate it
A
I’ve evaluated my
system with A and
it’s the best
The best IR system in the world
I’ve invented
an IR system
A
I’ve built Test
Collection A
to evaluate it
A
I’ve evaluated my
system with A and
it’s the best
I’ve invented
an IR system
B
I’ve built Test
collection B
to evaluate it
B
I’ve evaluated my
system with B and
it’s the best
A typical test collection
Topic
Relevance assessments
(relevant/nonrelevant documents)
Document collection
Topic
Relevance assessments
(relevant/nonrelevant documents)
Topic
Relevance assessments
(relevant/nonrelevant documents)
: :
Topic set
:
“Qrels”The Sakai Lab
home page
sakailab.com: relevant
www.f.waseda.jp/tetsuya/: relevant
http://tanabe-agency.co.jp/talent/sakai_masato/:
nonrelevant
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
Recall, Precision and E-measure
[vanRijsbergen79]
• E-measure = (|A∪B|-|A∩B|)/(|A|+|B|)
= 1 – 1/(0.5*(1/Prec) + 0.5*(1/Rec))
where Prec=|A∩B|/|B|, Rec=|A∩B|/|A|.
A generalised form
= 1 – 1/(α*(1/Prec) + (1-α)*(1/Rec))
= 1 – (β + 1)*Prec*Rec/(β *Prec+Rec)
where α = 1/(β + 1).
A: Relevant docs B: Retrieved docs
A ∩ B
2 2
2
F-measure
• F-measure = 1 – E-measure
= 1/(α*(1/Prec) + (1-α)*(1/Rec))
= (β + 1)*Prec*Rec/(β *Prec+Rec)
where α = 1/(β + 1).
• F with β=b is often expressed as Fβ Fb.
• F1 = 2*Prec*Rec/(Prec+Rec)
i.e. harmonic mean of Prec and Rec
2 2
2
User attaches
β times as much
importance to Rec
as Prec
(dE/dRec=dE/dPrec
when
Prec/Rec=β)
[vanRijsbergen79]
Harmonic vs. arithmetic mean
0
0.3
0.6
0.90
0.2
0.4
0.6
0.8
1
00.10.20.30.40.50.60.70.80.91
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2
0
0.3
0.6
0.90
0.2
0.4
0.6
0.8
1
00.10.20.30.40.50.60.70.80.91
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2
Prec=0, Rec=1
Prec=0.5, Rec=0.5
F1=0
F1=0.5
Prec=0.1, Rec=0.9
F1=0.18
(Prec+Rec)/2=0.5
(Prec+Rec)/2=0.5
(Prec+Rec)/2=0.5
Balance
important
Balance
NOT
important
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
Interpolated precision
relevant
nonrel
nonrel
relevant
relevant
nonrel
relevant
nonrel
1
2
3
4
5
6
7
8
Rec(r) Prec(r)
0.2 1
0.2 0.5
0.2 0.33
0.4 0.5
0.6 0.6
0.6 0.5
0.8 0.57
0.8 0.5
R=5
0 1
0.1 1
0.2 1
0.3 0.6
0.4 0.6
0.5 0.6
0.6 0.6
0.7 0.57
0.8 0.57
0.9 0
1 0
i IPiInterpolated Precision
IPi = max Prec(r)
r s.t. Rec(r)>=i
“The major issue addressed by interpolation is that it rarely happens that
any particular recall point is achieved.” [Buckley05, p.56]
Recall-precision graphs
0 1
0.1 1
0.2 1
0.3 0.6
0.4 0.6
0.5 0.6
0.6 0.6
0.7 0.57
0.8 0.57
0.9 0
1 0
i IPi
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
To draw a Rec-Prec curve for a set (T)
of topics, plot ΣT IPi / |T| for each i
Interpolated Precision
IPi = max Prec(r)
r s.t. Rec(r)>=i
Recall level i
Interpolatedprecisionati
Average Precision [Buckley05]
• Introduced at TREC-2 (1993), implemented in
trec_eval by Buckley
R: total number of relevant docs
r: document rank
I(r): flag indicating a relevant doc
C(r): number of relevant docs
within ranks [1,r]
Highly rel
Partially rel
Highly rel
Partially rel
Partially rel
Partially rel
=
Most widely-used
binary-relevance
IR metric since 1990s,
but cannot distinguish
between Systems A and B..
System A System B
A user model for AP [Robertson08]
• Different users stop scanning the
ranked list at different ranks. They
only stop at a relevant document.
• The user distribution is uniform
across all (R) relevant documents.
• At each stopping point, compute
utility (Prec).
• Hence AP is the expected utility for
the user population.
Normalised Discounted Cumulative Gain
[Jarvelin02]
• Introduced at ACM SIGIR 2000/TOIS 2012, a variant of
the sliding ratio [Pollack68]
• Popular “Microsoft version” [Burges05] :
Original definition [Jarvelin02] not recommended: a system that returns a relevant
document at rank 1 and one that returns a relevant document at rank b are treated as
equally effective, where b is the logarithm base (patience parameter). b’s cancel out in the
Burges definition.
md: document cutoff (e.g. 10)
g(r): gain value at rank r
e.g. 1 if doc is partially relevant
3 if doc is highly relevant
g*(r) gain value at rank r of an
ideal ranked list
nDCG: an example
Q-measure [Sakai05AIRS,Sakai07IPM]
• A graded relevance version of AP (see also Graded AP
[Robertson10]).
• Same user model as AP, but the utility is computed
using the blended ratio BR(r) instead of Prec(r).
•
where
β: patience parameter
(β=0 ⇒ BR=Prec,
hence Q=AP)
Combines Precision and normalised cumulative gain (nCG) [Jarvelin02]
Value of the first relevant document at rank r according to
BR(r) (binary relevance, R=5) [Sakai14PROMISE]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
β=0.1
β=1
β=10
r<=R ⇒
BR(r)=(1+β)/(r+βr)=1/r=P(r)
r>R ⇒
BR(r)=(1+β)/(r+βR)
rank
Large β ⇒ more
tolerance to relevant
docs at low ranks
Q: An example (with β=1)
Normalised Cumulative Utility [Sakai08EVIA]
• Generalises AP and Q
• NCU =
Σ Pr(r)*NU(r)
r
Normalised
Utility
Prec(r) or BR(r)
Stopping
probability
Rank-biased Graded-uniform
Expected Reciprocal Rank [Chapelle09]
• "ERR can be seen as a special case of Normalized Cumulative Utility
(NCU)“ [Chapelle09, p.625]
• No recall component
where
Probability that the user
is finally satisfied at r
Utility
at r
ERR’s diminishing return property
“Thus, if for example document two merely restates information
already gleaned from document one and hence is of no actual benefit
to this user, he may wish to assign it a negative document utility, no
matter how ‘relevant’ its content might have been to the original
information need.” [Cooper73, p.90]
“This is a diminishing return property which seems highly desirable for
most IR tasks: if we have already shown a lot of relevant documents,
there should be less added value in showing more relevant
documents.” [Chapelle11,p.582]
Rank-biased NCU [Sakai08EVIA] also has this property
Ranked retrieval measures: summary 1
(not exhaustive)
AP Q-measure ERR nDCG
Handling graded
relevance
Diminishing return
(navigational intent)
Discriminative power
[Sakai06SIGIR,07SIGIR]
Widely used
Used widely at NTCIR
NCU = [Sakai08EVIA]
f( stopping_probability_over_r, utility_at_r )
How many
statistically
significant
system pairs can
be obtained
(See Section 5)
There are a few graded-relevance versions, but AP
almost always means binary-relevance AP
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
Time-Biased Gain (TBG) [Smucker12]
Gain at rank r Discounting based on time to reach r
Value of information decays with time
Time to reach r:
reads (r-1) snippets, and possibly click
some docs and read them
Snippet reading
time: a constant
Doc reading time:
linear with doc length
U-measure [Sakai13SIGIR]
• U can be used not only for traditional IR, but also for various other
tasks such as session IR, aggregated search, summarisation, question
answering etc.
• While other measures are based on ranks, U abandons the notion of
rank. Focusses on the amount of text that the user has read within a
search session.
Instead of ranks, uses the positions of relevant pieces of information on a trailtext
Trailtext for U
Just concatenate all the texts that the
user has (probably) read.
For web search, one simple user model
would be to assume that users read all
snippets, plus parts of relevant
documents
If the nonrel at rank 2
(snippet) is replaced
with a rel
(snippet + full text),
the value of the rel at
rank 4 is always reduced
Satisfies
diminishing return
fixed-length snippets
Position-based discounting for U
Ranked retrieval measures: summary 2 (not exhaustive)
AP Q-measure ERR nDCG TBG U-measure
Handling graded relevance
Diminishing return (navigational
intent)
Discriminative power
[Sakai06SIGIR,07SIGIR]
Considers document lengths and
search engine snippets
Handles nonlinear traversal
[Sakai14PROMISE]
Widely used Users do NOT always
scan from top to
bottom!
TREC Contextual Suggestion
Diversified search – a new IR task
Since 2003 or so
• Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
• Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents
Diversity test collections
have relevance assessments
for each intent,
rather than for each topic
Diversified search measures summary
α-nDCG
[Clarke08]
ERR-IA
[Chapelle11]
D#-nDCG
[Sakai11SIGIR]
DIN#-nDCG, P+Q#
[Sakai12WWW]
U-IA
[Sakai13SIGIR]
Handling per-intent graded relevance
Handling intent probabilities
Handling both informational and
navigational intents
Per-intent diminishing return
Discriminative power
[Sakai06SIGIR,07SIGIR]
Concordance test [Sakai12WWW,13IRJ]
Considers document lengths and search
engine snippets
Widely used
Agree with simple measures?
M-measure@NTCIR MobileClick
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
So you used a test collection that has n=20 topics to
compute nDCG scores for two systems X and Y.
Which system is more effective?
Scores for X, Y:
Per-topic difference:
Sample mean of the differences:
Sample variance:
0.0750
0.0251
Population
distribution of X
Population
distribution of Y
Random sampling from normal distributions
Under the above assumptions,
obeys where
Population
mean
Population
variance
Population mean of the
difference
Under the above assumptions,
obeys where
Population mean
of the difference
Which system is more effective?
Or, which of these hypotheses is true?
If you look at the populations, X and Y are
equally effective
If you look at the populations, X and Y are
actually different
obeys
If you look at the populations, X and Y are
equally effective
If you look at the populations, X and Y are
actually different
Which of these hypotheses is true? All we have is
the sample data:
If H0 is true, this t statistic obeys a t distribution with
φ=(n-1) degrees of freedom.
Sum of
squares
Number of independent variables in a sum of
squares = accuracy of the sum
If H0 is true, this t statistic obeys a t distribution with
φ=(n-1) degrees of freedom.
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 Observed value of t0 computed from sample
P-value: area under curve =
probability of observing t0 or
something more extreme
IF H0 is true.
If H0 is true, the t statistic obeys a t distribution with
φ=(n-1) degrees of freedom.
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 t0
P-value: area under curve =
probability of observing t0 or
something more extreme
IF H0 is true.(1-α)
Significance level α: areas under curve =
a pre-determined probability (e.g. 5%) of observing something very rare
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 t0
α/2 α/2
(1-α)
p-value
If p-value <= α, then something highly unlikely (e.g. 5% chance)
under H0 has happened ⇒ H0 is probably wrong, with
(1-α)% (e.g. 95%) confidence!
We reject H0, and say that
the difference is statistically
significant at the significance level
of α. The population means
are probably different!
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 t0
α/2 α/2
(1-α) p-value
If p-value > α, then what we have observed is something
we expect under H0. We accept H0, and say that
the difference is NOT statistically
significant at the significance level
of α. This just means that we cannot
tell from data whether H0 is true.
Example: paired t-test using Excel
Significance level α = 0.05 (95% confidence)
Sample size n = 20
Degrees of freedom φ = 20-1 = 19
Sample mean
Sample variance
t statistic
p-value = T.DIST.2T( 2.116, 19 ) = 0.048 < α
X is statistically significantly better than Y at α=0.05.
Mean nDCG
over 20 topics
X 0.3450
Y 0.2700
Limitations of significance testing (1)
• Normality assumptions: computer-based alternatives (bootstrap
[Savoy97, Sakai06SIGIR], randomisation test [Smucker07]) that do not
rely on the assumptions are available. But the results are similar to
those obtained by the t-test.
• Dichotomous decision:
p-value = 0.049 < α ⇒ statistically significant! Publish a paper!
p-value = 0.051 > α ⇒ not statistically significant! Put it in the drawer!
Saying “p-value=0.049” is much more informative than
saying “significant at α=0.05”. Report the p-value! [Sakai14forum]
Limitations of significance testing (2)
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 t0
p-value
We get a statistically significant result
whenever p-value is small ⇔ t-value is large.
t-value is large when
(a) sample size n is large; or
(b) sample effect size
is large.
Difference measured in standard
deviation units
If n is large, you can get a statistically significant
result with ANYTHING!
Limitations of significance testing (3)
0
0.1
0.2
0.3
0.4
-5.0
-4.5
-4.0
-3.5
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
φ=4 φ=99 t0
p-value
t-value is large when
(a) sample size n is large; or
(b) sample effect size is large.
Don’t just report the p-value. Report
the sample effect size! [Sakai14forum]
0.4734 in the previous example.
This reflect how substantial the
difference may be.
What about the sample size n?
In significance testing, there are four important parameters. If three of
them are set, the fourth one is uniquely determined.
α: probability of Type I error
β: probability of Type II error
effect size:
magnitude of the difference
sample size n: number of topics
H0 is true H1 is true
H0 accepted 1-α β
H0 rejected α 1-β
Detecting a
nonexistent
difference
Missing a true
difference
While IR test collections typically have n=50 topics, it is possible to
determine the right n by setting α, β, and the minimum effect size
that you want to detect [Sakai15IRJ].
Statistical power: ability to detect a
true difference
Comparing more than two systems
• Conducting a t-test for every system pair is not good (though there
are exceptions [Sakai15IRJ]) - the familywise error rate problem.
• Use a proper multiple comparison procedure.
• Recommended: randomised Tukey HSD test
[Carterette12,Sakai14PROMISE].
• [Sakai14forum] says do an ANOVA (analysis of variance) test first,
followed by a Tukey HSD test. But this also causes a problem similar
to the familywise error rate. If you are interested in the difference
between every system pair, conduct Tukey without conducting
ANOVA.
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Summary
7. References
Summary
• Ranked retrieval measures as surrogates of user
satisfaction/performance, with different sets of assumptions. They
maybe compared using discriminative power [Sakai06SIGIR,07SIGIR],
concordance test [Sakai12WWW,13IRJ] etc. We want measures that
reliably measure what we want to measure!
• Principles and limitations of statistical significance testing, esp. paired
t-test. Report the p-values and effect sizes [Sakai14IRJ]. Type I errors,
Type II errors (1-power), effect sizes and sample sizes. A multiple
comparison procedure should be used for more than two systems.
Let’s write good IR papers!
Tools (by Tetsuya Sakai)
• NTCIREVAL (computes various evaluation measures)
http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html
• BOOTS (bootstrap hypothesis test as an alternative to the t-test)
http://research.nii.ac.jp/ntcir/tools/boots-en.html
• Discpower (randomisation test as an alternative to the t-test,
and randomised Tukey HSD test for comparing more than two systems)
http://research.nii.ac.jp/ntcir/tools/discpower-en.html
• Topic set size design Excel tools (how many topics do we need?):
http://www.f.waseda.jp/tetsuya/tools.html
LECTURE OUTLINE
1. Why evaluate?
2. Set retrieval evaluation measures
3. Ranked retrieval evaluation measures
4. More evaluation measures
5. Statistical significance, power, effect sizes
6. Reporting your results
7. Summary
8. References
References (1)
[Armstrong09] Armstrong, T.G., Moffat, A., Webber, W. and Zobel, J.: Improvements that Don’t Add Up: Ad-hoc Retrieval Results Since
1998, ACM CIKM 2009, pp.601-610, 2009.
[Buckley05] Buckley, C. and Voorhees, E.M.: Retrieval System Evaluation, In TREC: Experiment and Evaluation in Information Retrieval
(Voorhees, E.M. and Harman, D.K., eds.), Chapter 3, The MIT Press, 2005.
[Burges05] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to Rank Using Gradient
Descent, ICML 2005, pp.89-96, 2005.
[Chapelle09] Chapelle, O., Metzler, D., Zhang, Y., Grispan, P.: Expected Reciprocal Rank for Graded Relevance, ACM CIKM 2009,
pp.621-630, 2009.
[Chapelle11] Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based Diversification of Web Search Results: Metrics
and Algorithms, Information Retrieval, 14(6), pp.572-592, 2011.
[Clarke08] Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Buttcher, S. and MacKinnon, I.: Novelty and Diversity in
Information Retrieval Evaluation, ACM SIGIR 2008, pp.659-666, 2008.
[Cooper73] Cooper, W.S.: On Selecting a Measure of Retrieval Effectiveness, JASIS 24(2), pp.87–100, 1973.
References (2)
[Jarvelin02] Jarvelin, K. and Kekalainen, J.: Cumulated Gain-based Evaluation of IR Techniques, ACM TOIS, 20(4), p.422-446, 2002.
[Pollack68] Pollack, S.M.: Measures for the Comparison of Information Retrieval Systems, American Documentation, 19(4), pp.387-
397, 1968.
[Robertson08] Robertson, S.E.: A New Interpretation of Average Precision, ACM SIGIR 2008, pp.689-690, 2008.
[Robertson10] Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending Average Precision to Graded Relevance Judgments, ACM SIGIR 2010,
pp.603-610, 2010.
[Savoy97] Savoy, J.: Statistical Inference in Retrieval Effectiveness Evaluation, Information Processing and Management, 33(4), pp.495-
512, 1997.
References (3)
[Sakai05AIRS] Sakai, T.: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS 2004 (LNCS 3411), pp.251-262, 2005.
[Sakai06SIGIR] Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006, pp.525-532, 2006.
[Sakai07SIGIR] Sakai, T.: Alternatives to Bpref, ACM SIGIR 2007, pp.71-78, 2007.
[Sakai07IPM] Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and
Management, 43(2), pp.531-548, 2007.
[Sakai08EVIA] Sakai, T. and Robertson, S.: Modelling A User Population for Designing Information Retrieval Metrics, EVIA 2008, pp.30-
41, 2008.
[Sakai11SIGIR] Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-Intent Graded Relevance, ACM SIGIR 2011,
pp.1043-1052, 2011.
[Sakai12WWW] Sakai, T.: Evaluation with Informational and Navigational Intents, WWW 2012, pp.499-508, 2012.
[Sakai13SIGIR-U] Sakai, T., Dou, Z.: Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation,
ACM SIGIR 2013, pp.473-482, 2013.
[Sakai13IRJ] Sakai, T. and Song, R.: Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task, Information Retrieval, 16(4),
pp.504-529, Springer, 2013.
[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests, PROMISE Winter School 2013: Bridging between Information Retrieval and
Databases (LNCS 8173), Springer, pp.116-163, 2014.
[Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), pp.3-12, 2014.
[Sakai15IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, submitted.
References (4)
[Smucker07] Smucker, M.D., Allan, J. and Carterette, B.: A Comparison of Statistical Significance Tests for Information Retrieval
Evaluation, ACM CIKM 2007, pp.623-632, 2007.
[Smucker12] Smucker, M.D. and Clarke, C.L.A.: Time-based Calibration of Effectiveness Measures, ACM SIGIR 2012, pp. 95–104 , 2012.
[vanRijsbergen79] van Rijsbergen, C.J., Information Retrieval, Chapter 7, Butterworths, 1979.

More Related Content

What's hot (6)

Query processing System
Query processing SystemQuery processing System
Query processing System
 
Bs,qs,divide and conquer 1
Bs,qs,divide and conquer 1Bs,qs,divide and conquer 1
Bs,qs,divide and conquer 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...
 
3.8 quicksort 04
3.8 quicksort 043.8 quicksort 04
3.8 quicksort 04
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 

Similar to assia2015sakai

Trainer evaluation project
Trainer evaluation projectTrainer evaluation project
Trainer evaluation project
fgaragozlo
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 

Similar to assia2015sakai (20)

Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
 
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender Systems
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
sigir2019
sigir2019sigir2019
sigir2019
 
Apsec 2014 Presentation
Apsec 2014 PresentationApsec 2014 Presentation
Apsec 2014 Presentation
 
Validation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb RooneyValidation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb Rooney
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
 
TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)
 
Trainer evaluation project
Trainer evaluation projectTrainer evaluation project
Trainer evaluation project
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

assia2015sakai

  • 1. Evaluation 1: System-Oriented Tetsuya Sakai @tetsuyasakai Waseda University August 24, 2015@ASSIA 2015, Taipei.
  • 2. About Tetsuya Sakai • Professor – Department of Computer Science at Waseda University • Associate Dean – IT Strategies Division of Waseda University • Visiting professor – National Institute of Informatics • Researcher in information retrieval, natural language processing, interaction • Editor-in-chief (Asia/Australasia) – Information Retrieval Journal (Springer) • SIGIR 2013 PC co-chair • SIGIR 2017 general co-chair • NTCIR general co-chair • Toshiba → Cambridge U → Toshiba → NewsWatch → Microsoft Research Asia → Waseda
  • 3. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 4. • IR researchers’ goal: build systems that satisfy the user’s information needs. • We cannot ask users all the time, so we need measures as surrogates of user satisfaction/performance. • “If you cannot measure it, you cannot improve it.” http://zapatopi.net/kelvin/quotes/ system system system Measure Usersatisfaction Improvements Does it correlate with user satisfaction? Why measure?
  • 5. Improvements that don’t add up [Armstrong09] Armstrong et al. analysed 106 papers from SIGIR ’98-’08, CIKM ’04-’08 that used TREC data, and reported: • Researchers often use low baselines • Researchers claim statistically significant improvements, but the results are often not competitive with the best TREC systems • IR effectiveness has not really improved over a decade! What we want What we’ve got?
  • 6. The best IR system in the world I’ve invented an IR system
  • 7. The best IR system in the world I’ve invented an IR system A I’ve built Test Collection A to evaluate it
  • 8. The best IR system in the world I’ve invented an IR system A I’ve built Test Collection A to evaluate it A I’ve evaluated my system with A and it’s the best
  • 9. The best IR system in the world I’ve invented an IR system A I’ve built Test Collection A to evaluate it A I’ve evaluated my system with A and it’s the best I’ve invented an IR system B I’ve built Test collection B to evaluate it B I’ve evaluated my system with B and it’s the best
  • 10. A typical test collection Topic Relevance assessments (relevant/nonrelevant documents) Document collection Topic Relevance assessments (relevant/nonrelevant documents) Topic Relevance assessments (relevant/nonrelevant documents) : : Topic set : “Qrels”The Sakai Lab home page sakailab.com: relevant www.f.waseda.jp/tetsuya/: relevant http://tanabe-agency.co.jp/talent/sakai_masato/: nonrelevant
  • 11. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 12. Recall, Precision and E-measure [vanRijsbergen79] • E-measure = (|A∪B|-|A∩B|)/(|A|+|B|) = 1 – 1/(0.5*(1/Prec) + 0.5*(1/Rec)) where Prec=|A∩B|/|B|, Rec=|A∩B|/|A|. A generalised form = 1 – 1/(α*(1/Prec) + (1-α)*(1/Rec)) = 1 – (β + 1)*Prec*Rec/(β *Prec+Rec) where α = 1/(β + 1). A: Relevant docs B: Retrieved docs A ∩ B 2 2 2
  • 13. F-measure • F-measure = 1 – E-measure = 1/(α*(1/Prec) + (1-α)*(1/Rec)) = (β + 1)*Prec*Rec/(β *Prec+Rec) where α = 1/(β + 1). • F with β=b is often expressed as Fβ Fb. • F1 = 2*Prec*Rec/(Prec+Rec) i.e. harmonic mean of Prec and Rec 2 2 2 User attaches β times as much importance to Rec as Prec (dE/dRec=dE/dPrec when Prec/Rec=β) [vanRijsbergen79]
  • 14. Harmonic vs. arithmetic mean 0 0.3 0.6 0.90 0.2 0.4 0.6 0.8 1 00.10.20.30.40.50.60.70.80.91 0.8-1 0.6-0.8 0.4-0.6 0.2-0.4 0-0.2 0 0.3 0.6 0.90 0.2 0.4 0.6 0.8 1 00.10.20.30.40.50.60.70.80.91 0.8-1 0.6-0.8 0.4-0.6 0.2-0.4 0-0.2 Prec=0, Rec=1 Prec=0.5, Rec=0.5 F1=0 F1=0.5 Prec=0.1, Rec=0.9 F1=0.18 (Prec+Rec)/2=0.5 (Prec+Rec)/2=0.5 (Prec+Rec)/2=0.5 Balance important Balance NOT important
  • 15. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 16. Interpolated precision relevant nonrel nonrel relevant relevant nonrel relevant nonrel 1 2 3 4 5 6 7 8 Rec(r) Prec(r) 0.2 1 0.2 0.5 0.2 0.33 0.4 0.5 0.6 0.6 0.6 0.5 0.8 0.57 0.8 0.5 R=5 0 1 0.1 1 0.2 1 0.3 0.6 0.4 0.6 0.5 0.6 0.6 0.6 0.7 0.57 0.8 0.57 0.9 0 1 0 i IPiInterpolated Precision IPi = max Prec(r) r s.t. Rec(r)>=i “The major issue addressed by interpolation is that it rarely happens that any particular recall point is achieved.” [Buckley05, p.56]
  • 17. Recall-precision graphs 0 1 0.1 1 0.2 1 0.3 0.6 0.4 0.6 0.5 0.6 0.6 0.6 0.7 0.57 0.8 0.57 0.9 0 1 0 i IPi 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 To draw a Rec-Prec curve for a set (T) of topics, plot ΣT IPi / |T| for each i Interpolated Precision IPi = max Prec(r) r s.t. Rec(r)>=i Recall level i Interpolatedprecisionati
  • 18. Average Precision [Buckley05] • Introduced at TREC-2 (1993), implemented in trec_eval by Buckley R: total number of relevant docs r: document rank I(r): flag indicating a relevant doc C(r): number of relevant docs within ranks [1,r] Highly rel Partially rel Highly rel Partially rel Partially rel Partially rel = Most widely-used binary-relevance IR metric since 1990s, but cannot distinguish between Systems A and B.. System A System B
  • 19. A user model for AP [Robertson08] • Different users stop scanning the ranked list at different ranks. They only stop at a relevant document. • The user distribution is uniform across all (R) relevant documents. • At each stopping point, compute utility (Prec). • Hence AP is the expected utility for the user population.
  • 20. Normalised Discounted Cumulative Gain [Jarvelin02] • Introduced at ACM SIGIR 2000/TOIS 2012, a variant of the sliding ratio [Pollack68] • Popular “Microsoft version” [Burges05] : Original definition [Jarvelin02] not recommended: a system that returns a relevant document at rank 1 and one that returns a relevant document at rank b are treated as equally effective, where b is the logarithm base (patience parameter). b’s cancel out in the Burges definition. md: document cutoff (e.g. 10) g(r): gain value at rank r e.g. 1 if doc is partially relevant 3 if doc is highly relevant g*(r) gain value at rank r of an ideal ranked list
  • 22. Q-measure [Sakai05AIRS,Sakai07IPM] • A graded relevance version of AP (see also Graded AP [Robertson10]). • Same user model as AP, but the utility is computed using the blended ratio BR(r) instead of Prec(r). • where β: patience parameter (β=0 ⇒ BR=Prec, hence Q=AP) Combines Precision and normalised cumulative gain (nCG) [Jarvelin02]
  • 23. Value of the first relevant document at rank r according to BR(r) (binary relevance, R=5) [Sakai14PROMISE] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 β=0.1 β=1 β=10 r<=R ⇒ BR(r)=(1+β)/(r+βr)=1/r=P(r) r>R ⇒ BR(r)=(1+β)/(r+βR) rank Large β ⇒ more tolerance to relevant docs at low ranks
  • 24. Q: An example (with β=1)
  • 25. Normalised Cumulative Utility [Sakai08EVIA] • Generalises AP and Q • NCU = Σ Pr(r)*NU(r) r Normalised Utility Prec(r) or BR(r) Stopping probability Rank-biased Graded-uniform
  • 26. Expected Reciprocal Rank [Chapelle09] • "ERR can be seen as a special case of Normalized Cumulative Utility (NCU)“ [Chapelle09, p.625] • No recall component where Probability that the user is finally satisfied at r Utility at r
  • 27. ERR’s diminishing return property “Thus, if for example document two merely restates information already gleaned from document one and hence is of no actual benefit to this user, he may wish to assign it a negative document utility, no matter how ‘relevant’ its content might have been to the original information need.” [Cooper73, p.90] “This is a diminishing return property which seems highly desirable for most IR tasks: if we have already shown a lot of relevant documents, there should be less added value in showing more relevant documents.” [Chapelle11,p.582] Rank-biased NCU [Sakai08EVIA] also has this property
  • 28. Ranked retrieval measures: summary 1 (not exhaustive) AP Q-measure ERR nDCG Handling graded relevance Diminishing return (navigational intent) Discriminative power [Sakai06SIGIR,07SIGIR] Widely used Used widely at NTCIR NCU = [Sakai08EVIA] f( stopping_probability_over_r, utility_at_r ) How many statistically significant system pairs can be obtained (See Section 5) There are a few graded-relevance versions, but AP almost always means binary-relevance AP
  • 29. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 30. Time-Biased Gain (TBG) [Smucker12] Gain at rank r Discounting based on time to reach r Value of information decays with time Time to reach r: reads (r-1) snippets, and possibly click some docs and read them Snippet reading time: a constant Doc reading time: linear with doc length
  • 31. U-measure [Sakai13SIGIR] • U can be used not only for traditional IR, but also for various other tasks such as session IR, aggregated search, summarisation, question answering etc. • While other measures are based on ranks, U abandons the notion of rank. Focusses on the amount of text that the user has read within a search session. Instead of ranks, uses the positions of relevant pieces of information on a trailtext
  • 32. Trailtext for U Just concatenate all the texts that the user has (probably) read. For web search, one simple user model would be to assume that users read all snippets, plus parts of relevant documents
  • 33. If the nonrel at rank 2 (snippet) is replaced with a rel (snippet + full text), the value of the rel at rank 4 is always reduced Satisfies diminishing return fixed-length snippets Position-based discounting for U
  • 34. Ranked retrieval measures: summary 2 (not exhaustive) AP Q-measure ERR nDCG TBG U-measure Handling graded relevance Diminishing return (navigational intent) Discriminative power [Sakai06SIGIR,07SIGIR] Considers document lengths and search engine snippets Handles nonlinear traversal [Sakai14PROMISE] Widely used Users do NOT always scan from top to bottom! TREC Contextual Suggestion
  • 35. Diversified search – a new IR task Since 2003 or so • Given an ambiguous/underspecified query, produce a single Search Engine Result Page that satisfies different user intents! • Challenge: balancing relevance and diversity SERP(SearchEngineResultPage) Highly relevant near the top Give more space to popular intents? Give more space to informational intents? Cover many intents Diversity test collections have relevance assessments for each intent, rather than for each topic
  • 36. Diversified search measures summary α-nDCG [Clarke08] ERR-IA [Chapelle11] D#-nDCG [Sakai11SIGIR] DIN#-nDCG, P+Q# [Sakai12WWW] U-IA [Sakai13SIGIR] Handling per-intent graded relevance Handling intent probabilities Handling both informational and navigational intents Per-intent diminishing return Discriminative power [Sakai06SIGIR,07SIGIR] Concordance test [Sakai12WWW,13IRJ] Considers document lengths and search engine snippets Widely used Agree with simple measures? M-measure@NTCIR MobileClick
  • 37. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 38. So you used a test collection that has n=20 topics to compute nDCG scores for two systems X and Y. Which system is more effective? Scores for X, Y: Per-topic difference: Sample mean of the differences: Sample variance: 0.0750 0.0251
  • 39. Population distribution of X Population distribution of Y Random sampling from normal distributions Under the above assumptions, obeys where Population mean Population variance Population mean of the difference
  • 40. Under the above assumptions, obeys where Population mean of the difference Which system is more effective? Or, which of these hypotheses is true? If you look at the populations, X and Y are equally effective If you look at the populations, X and Y are actually different obeys
  • 41. If you look at the populations, X and Y are equally effective If you look at the populations, X and Y are actually different Which of these hypotheses is true? All we have is the sample data: If H0 is true, this t statistic obeys a t distribution with φ=(n-1) degrees of freedom. Sum of squares Number of independent variables in a sum of squares = accuracy of the sum
  • 42. If H0 is true, this t statistic obeys a t distribution with φ=(n-1) degrees of freedom. 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 Observed value of t0 computed from sample P-value: area under curve = probability of observing t0 or something more extreme IF H0 is true.
  • 43. If H0 is true, the t statistic obeys a t distribution with φ=(n-1) degrees of freedom. 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 t0 P-value: area under curve = probability of observing t0 or something more extreme IF H0 is true.(1-α) Significance level α: areas under curve = a pre-determined probability (e.g. 5%) of observing something very rare
  • 44. 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 t0 α/2 α/2 (1-α) p-value If p-value <= α, then something highly unlikely (e.g. 5% chance) under H0 has happened ⇒ H0 is probably wrong, with (1-α)% (e.g. 95%) confidence! We reject H0, and say that the difference is statistically significant at the significance level of α. The population means are probably different!
  • 45. 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 t0 α/2 α/2 (1-α) p-value If p-value > α, then what we have observed is something we expect under H0. We accept H0, and say that the difference is NOT statistically significant at the significance level of α. This just means that we cannot tell from data whether H0 is true.
  • 46. Example: paired t-test using Excel Significance level α = 0.05 (95% confidence) Sample size n = 20 Degrees of freedom φ = 20-1 = 19 Sample mean Sample variance t statistic p-value = T.DIST.2T( 2.116, 19 ) = 0.048 < α X is statistically significantly better than Y at α=0.05. Mean nDCG over 20 topics X 0.3450 Y 0.2700
  • 47. Limitations of significance testing (1) • Normality assumptions: computer-based alternatives (bootstrap [Savoy97, Sakai06SIGIR], randomisation test [Smucker07]) that do not rely on the assumptions are available. But the results are similar to those obtained by the t-test. • Dichotomous decision: p-value = 0.049 < α ⇒ statistically significant! Publish a paper! p-value = 0.051 > α ⇒ not statistically significant! Put it in the drawer! Saying “p-value=0.049” is much more informative than saying “significant at α=0.05”. Report the p-value! [Sakai14forum]
  • 48. Limitations of significance testing (2) 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 t0 p-value We get a statistically significant result whenever p-value is small ⇔ t-value is large. t-value is large when (a) sample size n is large; or (b) sample effect size is large. Difference measured in standard deviation units If n is large, you can get a statistically significant result with ANYTHING!
  • 49. Limitations of significance testing (3) 0 0.1 0.2 0.3 0.4 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 φ=4 φ=99 t0 p-value t-value is large when (a) sample size n is large; or (b) sample effect size is large. Don’t just report the p-value. Report the sample effect size! [Sakai14forum] 0.4734 in the previous example. This reflect how substantial the difference may be.
  • 50. What about the sample size n? In significance testing, there are four important parameters. If three of them are set, the fourth one is uniquely determined. α: probability of Type I error β: probability of Type II error effect size: magnitude of the difference sample size n: number of topics H0 is true H1 is true H0 accepted 1-α β H0 rejected α 1-β Detecting a nonexistent difference Missing a true difference While IR test collections typically have n=50 topics, it is possible to determine the right n by setting α, β, and the minimum effect size that you want to detect [Sakai15IRJ]. Statistical power: ability to detect a true difference
  • 51. Comparing more than two systems • Conducting a t-test for every system pair is not good (though there are exceptions [Sakai15IRJ]) - the familywise error rate problem. • Use a proper multiple comparison procedure. • Recommended: randomised Tukey HSD test [Carterette12,Sakai14PROMISE]. • [Sakai14forum] says do an ANOVA (analysis of variance) test first, followed by a Tukey HSD test. But this also causes a problem similar to the familywise error rate. If you are interested in the difference between every system pair, conduct Tukey without conducting ANOVA.
  • 52. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Summary 7. References
  • 53. Summary • Ranked retrieval measures as surrogates of user satisfaction/performance, with different sets of assumptions. They maybe compared using discriminative power [Sakai06SIGIR,07SIGIR], concordance test [Sakai12WWW,13IRJ] etc. We want measures that reliably measure what we want to measure! • Principles and limitations of statistical significance testing, esp. paired t-test. Report the p-values and effect sizes [Sakai14IRJ]. Type I errors, Type II errors (1-power), effect sizes and sample sizes. A multiple comparison procedure should be used for more than two systems. Let’s write good IR papers!
  • 54. Tools (by Tetsuya Sakai) • NTCIREVAL (computes various evaluation measures) http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html • BOOTS (bootstrap hypothesis test as an alternative to the t-test) http://research.nii.ac.jp/ntcir/tools/boots-en.html • Discpower (randomisation test as an alternative to the t-test, and randomised Tukey HSD test for comparing more than two systems) http://research.nii.ac.jp/ntcir/tools/discpower-en.html • Topic set size design Excel tools (how many topics do we need?): http://www.f.waseda.jp/tetsuya/tools.html
  • 55. LECTURE OUTLINE 1. Why evaluate? 2. Set retrieval evaluation measures 3. Ranked retrieval evaluation measures 4. More evaluation measures 5. Statistical significance, power, effect sizes 6. Reporting your results 7. Summary 8. References
  • 56. References (1) [Armstrong09] Armstrong, T.G., Moffat, A., Webber, W. and Zobel, J.: Improvements that Don’t Add Up: Ad-hoc Retrieval Results Since 1998, ACM CIKM 2009, pp.601-610, 2009. [Buckley05] Buckley, C. and Voorhees, E.M.: Retrieval System Evaluation, In TREC: Experiment and Evaluation in Information Retrieval (Voorhees, E.M. and Harman, D.K., eds.), Chapter 3, The MIT Press, 2005. [Burges05] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to Rank Using Gradient Descent, ICML 2005, pp.89-96, 2005. [Chapelle09] Chapelle, O., Metzler, D., Zhang, Y., Grispan, P.: Expected Reciprocal Rank for Graded Relevance, ACM CIKM 2009, pp.621-630, 2009. [Chapelle11] Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based Diversification of Web Search Results: Metrics and Algorithms, Information Retrieval, 14(6), pp.572-592, 2011. [Clarke08] Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Buttcher, S. and MacKinnon, I.: Novelty and Diversity in Information Retrieval Evaluation, ACM SIGIR 2008, pp.659-666, 2008. [Cooper73] Cooper, W.S.: On Selecting a Measure of Retrieval Effectiveness, JASIS 24(2), pp.87–100, 1973.
  • 57. References (2) [Jarvelin02] Jarvelin, K. and Kekalainen, J.: Cumulated Gain-based Evaluation of IR Techniques, ACM TOIS, 20(4), p.422-446, 2002. [Pollack68] Pollack, S.M.: Measures for the Comparison of Information Retrieval Systems, American Documentation, 19(4), pp.387- 397, 1968. [Robertson08] Robertson, S.E.: A New Interpretation of Average Precision, ACM SIGIR 2008, pp.689-690, 2008. [Robertson10] Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending Average Precision to Graded Relevance Judgments, ACM SIGIR 2010, pp.603-610, 2010. [Savoy97] Savoy, J.: Statistical Inference in Retrieval Effectiveness Evaluation, Information Processing and Management, 33(4), pp.495- 512, 1997.
  • 58. References (3) [Sakai05AIRS] Sakai, T.: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS 2004 (LNCS 3411), pp.251-262, 2005. [Sakai06SIGIR] Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006, pp.525-532, 2006. [Sakai07SIGIR] Sakai, T.: Alternatives to Bpref, ACM SIGIR 2007, pp.71-78, 2007. [Sakai07IPM] Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, 43(2), pp.531-548, 2007. [Sakai08EVIA] Sakai, T. and Robertson, S.: Modelling A User Population for Designing Information Retrieval Metrics, EVIA 2008, pp.30- 41, 2008. [Sakai11SIGIR] Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-Intent Graded Relevance, ACM SIGIR 2011, pp.1043-1052, 2011. [Sakai12WWW] Sakai, T.: Evaluation with Informational and Navigational Intents, WWW 2012, pp.499-508, 2012. [Sakai13SIGIR-U] Sakai, T., Dou, Z.: Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, ACM SIGIR 2013, pp.473-482, 2013. [Sakai13IRJ] Sakai, T. and Song, R.: Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task, Information Retrieval, 16(4), pp.504-529, Springer, 2013. [Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests, PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), Springer, pp.116-163, 2014. [Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), pp.3-12, 2014. [Sakai15IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, submitted.
  • 59. References (4) [Smucker07] Smucker, M.D., Allan, J. and Carterette, B.: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation, ACM CIKM 2007, pp.623-632, 2007. [Smucker12] Smucker, M.D. and Clarke, C.L.A.: Time-based Calibration of Effectiveness Measures, ACM SIGIR 2012, pp. 95–104 , 2012. [vanRijsbergen79] van Rijsbergen, C.J., Information Retrieval, Chapter 7, Butterworths, 1979.