1) The document proposes using worst-case confidence interval (WCW) widths to compare the stability of evaluation measures across different topic set sizes.
2) It describes how to calculate WCW curves for measures by using statistical tools to determine the topic set size needed for a given WCW. These curves allow comparison of measure stability over a range of set sizes.
3) The author responds to reviewer comments, acknowledging some difficulty in presentation but arguing WCW analysis provides a valid and useful way to evaluate measures that considers stability across set sizes.
7. Discriminative power [Sakai06,07]
Topic set
System1 System2
System1 System3
System2 System3
(Randomised)
Tukey HSD
test etc.
Measure 1 Measure 2
1<2
(p=0.003)
1<3
(p=0.035)
2>3
(p=0.071)
Stable measures give us more confidence given a topic set size
1<2
(p=0.201)
1<3
(p=0.523)
2>3
(p=0.721)
With Measure 1, we conclude 1<2, 1<3. With Measure 2, there is no conclusion.
Measure 1 is more discriminative.