Evia2017wcw

Evaluating Evaluation Measures
with
Worst‐Case Confidence Interval
Widths
Tetsuya Sakai
tetsuyasakai@acm.org
Waseda University, Japan
December 5, 2017@EVIA 2017, NII

Which evaluation measures should we use?
An example of existing work:
This paper says that we don’t need
to report P@10 if we report AP

TALK OUTLINE
1. Popular methods for evaluating evaluation
measures
2. Topic set size design
3. Evaluating evaluation measures with WCW
4. EVIA reviewers’ comments
5. Conclusions and future work

Rank correlation
Does not tell us which measure is better.
Merely tells us whether one measure is similar to
another.
System A
System B
System C
System A
System C
System B
Ranking by Measure M1 Ranking by Measure M2
Kendall’s τ
τ AP
Spearman’s ρ

Preference agreement [Sanderson+10]
SERP1 SERP2
query search
SERP1 > SERP2
Measure A
Measure B
SERP1 > SERP2
SERP1 < SERP2 Disagreement
Agreement
This is great but:
‐ assessor
≠ searcher
‐ many assessors
required to
ensure reliability
‐ many assessors
= high cost

Swap method [Voorhees+02]
Topic set
Split half 1 Split half 2
Do a random split
B times
System1 System2
System1 System3
System2 System3
<
<
>
System1 System2
System1 System3
System2 System3
<
<
<Swap!
Stable measures give us consistent results regardless of the topic set
Given 50 topics, we can only discuss the case with 25 topics directly
unless bootstrap topic sets are used [Sakai06]
For a given measure M…

Discriminative power [Sakai06,07]
Topic set
System1 System2
System1 System3
System2 System3
(Randomised)
Tukey HSD
test etc.
Measure 1 Measure 2
1<2
(p=0.003)
1<3
(p=0.035)
2>3
(p=0.071)
Stable measures give us more confidence given a topic set size
1<2
(p=0.201)
1<3
(p=0.523)
2>3
(p=0.721)
With Measure 1, we conclude 1<2, 1<3. With Measure 2, there is no conclusion.
Measure 1 is more discriminative.

Topic set size design [Sakai16]
• Applies sample size design from statistics [Nagata03]
• Based on statistical requirements and a variance
estimate of a particular evaluation measure, obtain the
right topic set size
• Sakai’s three tools
‐ t‐test‐based tool
http://www.f.waseda.jp/tetsuya/BOOK/samplesizeTTEST2.xlsx
‐ one‐way ANOVA‐based tool
http://www.f.waseda.jp/tetsuya/BOOK/samplesizeANOVA2.xlsx
‐ confidence interval‐based tool
http://www.f.waseda.jp/tetsuya/BOOK/samplesizeCI2.xlsx

samplesizeANOVA2.xlsx
(one‐way ANOVA statistical power)
INPUT:
α (Type I error probability), β (Type II error probability)
[Select a sheet for (α, β)]
m (#systems to compare)
(estimated common variance)
minD (minimum detectable range)
OUTPUT:
n (topic set size required)
When the true diff between
best and worst systems is minD
or larger, ensure 100(1‐β)%
statistical power
Detecting a
nonexistent diff
Missing a true
diff

samplesizeCI2.xlsx
INPUT:
α (for 100(1‐α)% confidence interval)
δ (Worst‐case Confidence interval Width)
(estimated variance for the difference between
two systems)
OUTPUT:
n (topic set
size required)
The CI for the diff
between ANY system
pair should be no
larger than WCW (=δ)

How to estimate the common
variance
Given a topic‐by‐run score matrix for a particular
evaluation measure, obtain the residual variance VE
of ANOVA. This is an unbiased estimate of .
In this study, we let .
(If two sets of scores have the same variance, the
variance of the score differences are double that in
the worst case.)
For the ANOVA‐
based tool
For the CI‐based
tool

Proposal: use WCW curves to compare
evaluation measure stability (1)
• Using samplesizeCI2 with estimated variances for
the measures, we should be able to draw curves
like these:
WCW
Topic
set size
Comparison of measures in terms of
WCW is valid since we want CIs to be
as tight as possible for any measure

like these:
WCW
Topic
set size
Unlike discriminative power and the
swap method, we can easily consider
a wide range of topic set sizes

like these:
WCW
Topic
set size
For a given topic set size, we can
discuss the diffs across measures that
practically matter:
n=50 ⇒ WCW≒0.15 for Q and nDCG
and WCW>0.20 for AP and nERR

Another example (more in paper)
According to this data set,
D‐nDCG and D#‐nDCG achieve
smaller WCWs than α‐nDCG and
nERR‐IA

1. Compute a variance estimate        from a topic‐by‐
run matrix of the measure in question.
2. Instead of entering
(α=0.05, δ=WCW,                       ) to samplesizeCI2,
enter
(α=0.05, β=0.20, m=10,         , minD=WCW) to
samplesizeANOVA2 to obtain the topic set size.
3. Try different WCW’s and record the n’s to draw the
curve.
How to draw a WCW curve in practice
samplesizeCI2 cannot handle large topic set sizes due to a limitation in Excel.
Why samplesizeANOVA2 can be used instead can be explained analytically (see paper).

Reviewer 1
"The paper is clearly relevant to EVIA and I am
confident that the audience will ask questions to
help them grasp what the paper is contributing, but
it was hard going.“
⇒ Apologies… This is a sequel to my IRJ paper, but
making it stand‐alone while avoiding self‐plagiarism
was very tough. I will try to do a better job in the
book I’m writing:
Laboratory Experiments in Information Retrieval:
Sample Sizes, Effect Sizes, and Statistical Power (Springer)

Reviewer 3
"So, somehow, I got to the end of reading the paper
and didn't feel that I'd picked up even the basis of
what it was claiming. And I didn't get told what a
WCW was, sorry.“
⇒ Apologies again. See my responses to Reviewer 1.
But please note: even the abstract (of the submitted
version) says: “WCW is the worst‐case width of a
confidence interval (CI) for the difference between
any two systems, given a topic set size.”

Reviewer 2
“I found the paper certainly interesting, a great
match to Evia, I have strong doubts about whether
this is a good way to evaluate evaluation measures,
and what purpose does it serve to have a powerful
enough measure.”
⇒ My view is that statistical stability of a measure is
a useful property. We can use smaller topic sets.
That’s more economical. Note that I never claimed
that high stability (power) is a sufficient condition for
a good evaluation measure.

Conclusions and future work
Advantages of using WCW‐curves for comparing
evaluation measures:
‐ Comparison in terms of WCW is valid because we want
a tight CI with any measure;
‐ Unlike discriminative power and the swap method, it
provides a reliable view across different topic set sizes;
‐ For a given topic set size, we can discuss the differences
in WCW that practically matter.
FUTURE WORK: Apply WCW to a wide range of tasks and
measures, and compare with discriminative power etc.

References
[Nagata03] How to Design the Sample Size (in Japanese),
Asakura Shoten, 2003.
[Sakai06] Evaluating Evaluation Metrics based on the
Bootstrap, ACM SIGIR 2006.
[Sakai07] Alternatives to Bpref, ACM SIGIR 2007.
[Sakai16] Topic Set Size Design, Information Retrieval
Journal 19(3), 2016.
http://link.springer.com/content/pdf/10.1007%2Fs10791‐015‐9273‐z.pdf
(open access)
[Sanderson+10] Do User Preferences and Evaluation
Measures Line Up? ACM SIGIR 2010.
[Voorhees+02] The Effect of Topic Set Size on Retrieval
Experiment Error, ACM SIGIR 2002.

Evia2017wcw

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Evia2017wcw

Semelhante a Evia2017wcw (20)

Mais de Tetsuya Sakai

Mais de Tetsuya Sakai (20)

Último

Último (20)

Evia2017wcw