SlideShare uma empresa Scribd logo
1 de 61
Baixar para ler offline
SIGIR 2013
Dublin, Ireland · July 30thPicture by Philip Milne
On the Measurement of
Test Collection Reliability
@julian_urbano University Carlos III of Madrid
Mónica Marrero University Carlos III of Madrid
Diego Martín Technical University of Madrid
Gratefully supported
by Student Travel Grant
Is System A More Effective
than System B?
-1 1
Δeffectiveness
𝑑0
Is System A More Effective
than System B?
Get a test collection and evaluate
Measure the average difference 𝒅
and conclude which one is better
Samples
Test collections are samples from a
larger, possibly infinite, population
Documents, queries and assessors
𝒅 is only an estimate
How reliable is our conclusion?
Reliability vs. Cost
Building reliable collections is easy…
Just use more documents, more queries,
more assessors
…but it is prohibitively expensive
Our best bet is to increase query set size
Data-based approach
1.Randomly split query set
2.Compute indicators of reliability
based on those two subsets
3.Extrapolate to larger query sets
..with some variations
Voorhees’98, Zobel’98, Buckley & Voorhees’00,
Voorhees & Buckley’02, Sanderson & Zobel’05,
Sakai’07, Voorhees’09
Data-based Reliability Indicators
based on results with two collections
Kendall 𝝉 correlation
stability of the ranking of systems
𝝉 𝑨𝑷 correlation
add a top-heaviness components
Absolute sensitivity
minimum absolute 𝒅 s.t. swaps <5%
Relative sensitivity
minimum relative 𝒅 s.t. swaps <5%
Data-based Reliability Indicators
based on results with two collections
Power ratio
statistically significant results
Minor conflict ratio
statistically non-significant swap
Major conflict ratio
statistically significant swap
RMSE
differences in 𝒅
Generalizability Theory
Directly address variability of scores
G-study
Estimate variance components
from previous, representative, data
D-study
Estimate reliability based on
estimated variance components
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal!
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
some systems
better for
some queries
D-study
Relative stability
𝑬𝝆 𝟐
=
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Absolute stability
𝚽 =
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Easy to estimate how many queries we
need for a certain stability level
Generalizability Theory
Proposed by Bodoff’07
Kanoulas & Aslam’09
derive optimal gain & discount in nDCG
TREC Million Query Track
≈80 queries sufficient for stable rankings
≈130 queries for stable absolute scores
In this Paper / Talk
How sensitive is the D-study to the
initial data used in the G-study?
How to interpret G-theory in practice,
why 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?
From the above two, review the
reliability of >40 TREC test collections
variability of G-theory
indicators of reliability
Data
43 TREC collections
from TREC-3 to TREC 2011
12 tasks across 10 tracks
Ad Hoc, Web, Novelty, Genomics,
Robust, Terabyte, Enterprise, Million
Query, Medical and Microblog
Experiment
Vary number of queries in G-study
from 𝒏 𝒒 = 𝟓 to full set
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
, 𝚽
Compute 𝒏 𝒒
′
to reach 0.95 stability
200 random trials
Variability due to queries
Variability due to queries
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.3, depending on
what 10 queries we use
Experiment (II)
The same, but vary number of systems
from 𝒏 𝒔 = 𝟓 to full set
Use all queries available
200 random trials
Variability due to systems
Variability due to systems
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use
Results
G-Theory is very sensitive to initial data
Need about 50 queries and 50 systems for
differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1
Number of queries for 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓
may change in orders of magnitude
Microblog2011 (all 184 systems and 30 queries):
need 63 to 133 queries
Medical2011 (all 34 queries and 40 systems):
need 109 to 566 queries
Use Confidence Intervals
Bodoff’08
Confidence intervals in G-study
But what about the D-study?
Feldt’65 and Arteaga et al.’82
Work reasonably well even when
assumptions are violated Brennan’01
Example
Example
Example
Account for variability
in initial data
Example
Required number of
queries to reach the
lower end of the interval
Summary in TREC
that is, the 43 collections we study here
𝑬𝝆 𝟐
: mean=0.88 sd=0.1
95% conf. intervals are 0.1 long
𝚽: mean=0.74 sd=0.2
95% conf. intervals are 0.19 long
interpretation of G-Theory
indicators of reliability
Experiment
Split query set in 2 subsets
from 𝒏 𝒒 = 𝟏𝟎 to full set / 2
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
and 𝚽 and map onto 𝝉,
sensitivity, power, conflicts, etc.
50 random trials
>28,000 datapoints
Example: 𝑬𝝆 𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
Million
Query
2007
Million Query 2008
*All mappings in the paper
Future Predictions
Allows us to make more informed
decisions within a collection
What about a new collection?
Fit a single model for each mapping
with 90% and 95% prediction intervals
Assess whether a larger collection
is really worth the effort
Example: 𝑬𝝆 𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection target
*All mappings in the paper
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
review of TREC collections
Outline
Estimate 𝑬𝝆 𝟐
and 𝚽, with 95%
confidence intervals, and full query set
Map onto 𝝉, sensitivity, power,
conflicts, etc.
Results within task offer historical
perspective since 1994
Example: Ad Hoc 3-8
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %
𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑]
Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]
50 queries were used
*All collections and mappings in the paper
Example: Web Ad Hoc
TREC-8 to TREC-2001: WT2g and WT10g
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎
TREC-2009 to TREC-2011: ClueWeb09
𝑬𝝆 𝟐
∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖
50 queries were used
Historical Trend
Decreasing within and across tracks?
Historical Trend
Systems getting better for specific problems?
Historical Trend
Increasing task-specificity in queries?
summing up
Generalizability Theory
Regarded as more appropriate,
easy to use and powerful tool
to assess test collection reliability
Very sensitive to the initial data
used to estimate variance components
Almost impossible to interpret
in practical terms
Sensitivity of G-Theory
About 50 queries and 50 systems
are needed for robust estimates
Caution if building a new collection
Can always use confidence intervals
Interpretation of G-Theory
Empirical mapping onto traditional
indicators of reliability like 𝝉 correlation
𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐
≈ 𝟎. 𝟗𝟕
𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
Historical Reliability in TREC
On average, 𝑬𝝆 𝟐
= 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕
Some collections clearly unreliable
Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
50 queries not enough for stable
rankings, about 200 are needed
Implications
Fixing a minimum number of queries
across tracks is unrealistic
Not even across editions of the same task
Need to analyze on a case-by-case
basis, while building the collections
to be continued…
Future Work
Study assessor effect
Study document-collection effect
Better models to map G-Theory
onto data-based indicators
We fitted theoretically correct(-ish) models,
but in practice theory does not hold
Methods to reliably measure reliability
while building the collection
Source Code Online
Code for R stats software
G-study and D-study
Required number of queries
Map onto data-based indicators
Confidence intervals
..in two simple steps
G-Theory too sensitive to initial data
Questionable with small collections
Compute confidence intervals
Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗
50 queries not enough for stable rankings
Fixing a minimum number of
queries across tasks is unrealistic
Need to analyze on a case-by-case basis

Mais conteúdo relacionado

Mais procurados

Why you need power analysis
Why you need power analysisWhy you need power analysis
Why you need power analysispcdjohnson
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Matt Hansen
 
MAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingMAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingPaul Irwing
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Matt Hansen
 
Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Matt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Matt Hansen
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsMatt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Matt Hansen
 
Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power nQuery
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11Mazhar Poohlah
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004Salford Systems
 
Breakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsBreakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsStatistics Solutions
 

Mais procurados (19)

Why you need power analysis
Why you need power analysisWhy you need power analysis
Why you need power analysis
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
MAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingMAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. Irwing
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)
 
Statistics Homework Help
Statistics Homework HelpStatistics Homework Help
Statistics Homework Help
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)
 
Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
 
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)
 
abcxyz
abcxyzabcxyz
abcxyz
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence Intervals
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
 
Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11
 
Ijcatr04051005
Ijcatr04051005Ijcatr04051005
Ijcatr04051005
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Breakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsBreakdown of Regression Models for Dissertations
Breakdown of Regression Models for Dissertations
 

Destaque

Language testing - Contrastive analysis
Language testing - Contrastive analysis Language testing - Contrastive analysis
Language testing - Contrastive analysis King Saud University
 
Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Alfi Suru
 
3 basic-principles_of_assessment
3  basic-principles_of_assessment3  basic-principles_of_assessment
3 basic-principles_of_assessmenthakim azman
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011photomatt
 

Destaque (6)

Language testing - Contrastive analysis
Language testing - Contrastive analysis Language testing - Contrastive analysis
Language testing - Contrastive analysis
 
Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)
 
Reliability
ReliabilityReliability
Reliability
 
3 basic-principles_of_assessment
3  basic-principles_of_assessment3  basic-principles_of_assessment
3 basic-principles_of_assessment
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Semelhante a On the Measurement of Test Collection Reliability

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxnagarajan740445
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Bayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarBayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarnQuery
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014tjcarter
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesnQuery
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librariansJohn McDonald
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingMeghana Gowda
 
ensemble learning
ensemble learningensemble learning
ensemble learningbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 

Semelhante a On the Measurement of Test Collection Reliability (20)

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Bayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarBayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size Webinar
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librarians
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modeling
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 

Mais de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 

Mais de Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

On the Measurement of Test Collection Reliability

  • 1. SIGIR 2013 Dublin, Ireland · July 30thPicture by Philip Milne On the Measurement of Test Collection Reliability @julian_urbano University Carlos III of Madrid Mónica Marrero University Carlos III of Madrid Diego Martín Technical University of Madrid
  • 3. Is System A More Effective than System B? -1 1 Δeffectiveness 𝑑0
  • 4. Is System A More Effective than System B? Get a test collection and evaluate Measure the average difference 𝒅 and conclude which one is better
  • 5. Samples Test collections are samples from a larger, possibly infinite, population Documents, queries and assessors 𝒅 is only an estimate How reliable is our conclusion?
  • 6. Reliability vs. Cost Building reliable collections is easy… Just use more documents, more queries, more assessors …but it is prohibitively expensive Our best bet is to increase query set size
  • 7. Data-based approach 1.Randomly split query set 2.Compute indicators of reliability based on those two subsets 3.Extrapolate to larger query sets ..with some variations Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05, Sakai’07, Voorhees’09
  • 8. Data-based Reliability Indicators based on results with two collections Kendall 𝝉 correlation stability of the ranking of systems 𝝉 𝑨𝑷 correlation add a top-heaviness components Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5% Relative sensitivity minimum relative 𝒅 s.t. swaps <5%
  • 9. Data-based Reliability Indicators based on results with two collections Power ratio statistically significant results Minor conflict ratio statistically non-significant swap Major conflict ratio statistically significant swap RMSE differences in 𝒅
  • 10. Generalizability Theory Directly address variability of scores G-study Estimate variance components from previous, representative, data D-study Estimate reliability based on estimated variance components
  • 11. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection
  • 12. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal!
  • 13. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty
  • 14. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty some systems better for some queries
  • 15. D-study Relative stability 𝑬𝝆 𝟐 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 ′ Absolute stability 𝚽 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 ′ Easy to estimate how many queries we need for a certain stability level
  • 16. Generalizability Theory Proposed by Bodoff’07 Kanoulas & Aslam’09 derive optimal gain & discount in nDCG TREC Million Query Track ≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores
  • 17. In this Paper / Talk How sensitive is the D-study to the initial data used in the G-study? How to interpret G-theory in practice, why 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓? From the above two, review the reliability of >40 TREC test collections
  • 19. Data 43 TREC collections from TREC-3 to TREC 2011 12 tasks across 10 tracks Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
  • 20. Experiment Vary number of queries in G-study from 𝒏 𝒒 = 𝟓 to full set Use all runs available Run D-study Compute 𝑬𝝆 𝟐 , 𝚽 Compute 𝒏 𝒒 ′ to reach 0.95 stability 200 random trials
  • 22. Variability due to queries We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.3, depending on what 10 queries we use
  • 23. Experiment (II) The same, but vary number of systems from 𝒏 𝒔 = 𝟓 to full set Use all queries available 200 random trials
  • 25. Variability due to systems We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.5, depending on what 20 systems we use
  • 26. Results G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1 Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries
  • 27. Use Confidence Intervals Bodoff’08 Confidence intervals in G-study But what about the D-study? Feldt’65 and Arteaga et al.’82 Work reasonably well even when assumptions are violated Brennan’01
  • 31. Example Required number of queries to reach the lower end of the interval
  • 32. Summary in TREC that is, the 43 collections we study here 𝑬𝝆 𝟐 : mean=0.88 sd=0.1 95% conf. intervals are 0.1 long 𝚽: mean=0.74 sd=0.2 95% conf. intervals are 0.19 long
  • 34. Experiment Split query set in 2 subsets from 𝒏 𝒒 = 𝟏𝟎 to full set / 2 Use all runs available Run D-study Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc. 50 random trials >28,000 datapoints
  • 35. Example: 𝑬𝝆 𝟐 → 𝝉 *All mappings in the paper
  • 36. Example: 𝑬𝝆 𝟐 → 𝝉 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85 *All mappings in the paper
  • 37. Example: 𝑬𝝆 𝟐 → 𝝉 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 *All mappings in the paper
  • 38. Example: 𝑬𝝆 𝟐 → 𝝉 Million Query 2007 Million Query 2008 *All mappings in the paper
  • 39. Future Predictions Allows us to make more informed decisions within a collection What about a new collection? Fit a single model for each mapping with 90% and 95% prediction intervals Assess whether a larger collection is really worth the effort
  • 40. Example: 𝑬𝝆 𝟐 → 𝝉 *All mappings in the paper
  • 41. Example: 𝑬𝝆 𝟐 → 𝝉 current collection *All mappings in the paper
  • 42. Example: 𝑬𝝆 𝟐 → 𝝉 current collection target *All mappings in the paper
  • 43. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 44. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 45. review of TREC collections
  • 46. Outline Estimate 𝑬𝝆 𝟐 and 𝚽, with 95% confidence intervals, and full query set Map onto 𝝉, sensitivity, power, conflicts, etc. Results within task offer historical perspective since 1994
  • 47. Example: Ad Hoc 3-8 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 % 𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 % Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗] 50 queries were used *All collections and mappings in the paper
  • 48. Example: Web Ad Hoc TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎 TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖 50 queries were used
  • 50. Historical Trend Systems getting better for specific problems?
  • 53. Generalizability Theory Regarded as more appropriate, easy to use and powerful tool to assess test collection reliability Very sensitive to the initial data used to estimate variance components Almost impossible to interpret in practical terms
  • 54. Sensitivity of G-Theory About 50 queries and 50 systems are needed for robust estimates Caution if building a new collection Can always use confidence intervals
  • 55. Interpretation of G-Theory Empirical mapping onto traditional indicators of reliability like 𝝉 correlation 𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
  • 56. Historical Reliability in TREC On average, 𝑬𝝆 𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕 Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 50 queries not enough for stable rankings, about 200 are needed
  • 57. Implications Fixing a minimum number of queries across tracks is unrealistic Not even across editions of the same task Need to analyze on a case-by-case basis, while building the collections
  • 59. Future Work Study assessor effect Study document-collection effect Better models to map G-Theory onto data-based indicators We fitted theoretically correct(-ish) models, but in practice theory does not hold Methods to reliably measure reliability while building the collection
  • 60. Source Code Online Code for R stats software G-study and D-study Required number of queries Map onto data-based indicators Confidence intervals ..in two simple steps
  • 61. G-Theory too sensitive to initial data Questionable with small collections Compute confidence intervals Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings Fixing a minimum number of queries across tasks is unrealistic Need to analyze on a case-by-case basis