Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Towards Minimal Test Collections
for Evaluation of
Audio Music Similarity and Retrieval
@julian_urbano @m_schedl
University Carlos III of Madrid Johannes Kepler University

AdMIRe 2012
Picture by ERdi43 (Wikipedia) Lyon, France · April 17th

Problem
evaluation of IR systems is costly
Annotations
time consuming
expensive
boring
(Bad) Consequence
small and biased test collections
unlikely to change from year to year
Solution
apply low-cost evaluation methodologies

nearly 2 decades of
Meta-Evaluation in Text IR
some good practices
inherited from here
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

ISMIR MIREX
(2000-today)
(2005-today)

a lot of things
have happened here!

Minimal Test Collections (MTC) [Carterette at al.]
estimate the ranking of systems
with very few judgments (high incompleteness)

Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year
to make thousands of judgments
Year Teams Systems Queries Results Judgments Overlap
2006 5 6 60 1,800 1,629 10%
2007 8 12 100 6,000 4,832 19%
2009 9 15 100 7,500 6,732 10%
2010 5 8 100 4,000 2,737 32%
2011 10 18 100 9,000 6,322 30%

evaluation
with
incomplete judgments

Basic Idea
treat similarity scores as random variables
can be estimated with uncertainty

gain of an arbitrary document: Gi ⤳ multinomial

𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙
𝑙∈ℒ

ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

whenever document i is judged:
𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0
*all variance formulas in the paper

AG@k is also treated as a random variable

1
𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘
𝑘
𝑖∈𝒟

iterate all documents ranking at which
(in practice, only it was retrieved
the top k retrieved)

Ultimate Goal
compute a good estimate with the least effort

Comparing Two Systems
what really matters is the sign of the difference
1
𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑘
𝑖∈𝒟

Evaluating Several Queries
1
𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞
𝒬
𝑞∈𝒬
iterate all queries

The Rationale
if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then
judge another document else stop judging

Distribution of AG@k
what are the possible assignments of similarity?

𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘
𝛾 𝑘 ∈𝛤 𝑘
ultimately
iterate all possible
depends on the
permutations of k
distribution of Gi
similarity assignments

Plain English
the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
run Monte Carlo simulation

Actually, AG@k is a Special Case
let G be the similarity of the top k for all queries
query AG@k for a single query

1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
...
Q. take a sample of k documents. Mean = XQ
Mean of sample means = X
mean AG@k over all queries

Central Limit Theorem
as Q→∞, X approximates a normal distribution
regardless of the distribution of G

AG@k is Normally Distributed
use the normal cumulative density function Φ
−𝐸 ∆𝐴𝐺@𝑘
𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ
𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
BROAD scale FINE scale

0.030
0.0 0.2 0.4 0.6 0.8 1.0

0.020
Density
Density

0.010
0.000

0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100

AG@5 AG@5

Confidence as a Function of # Judgments

100
95
90
or waste
Confidence in ranking of systems

our time
or keep judging
85

we can to be really confident
80

stop judging
75
70
65
60
55
50

0 10 20 30 40 50 60 70 80 90 100
Percent of judgments

what documents should we judge?
those that maximize the confidence

The Trick
documents retrieved by both systems are useless
there is no need to judge them
whatever Gi is, it is added and then subtracted

Comparing Several Systems
compute a weight wi for each query-document
judge the document with largest effect

wi in the Original MTC
wi = largest weight across system pairs
reduces to # of system pairs affected by query-doc i

wi Dependent on Confidence
if we are highly confident about a pair of systems
we do not need to judge another of their documents

even if it has the largest weight

2
𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘
𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional
to confidence
iterate system pairs
with low confidence

better results than traditional weights

MTC for ΔAG@k
average confidence on the ranking

1
while 𝐴,𝐵 ∈𝒮
𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do
𝒮
select the
𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖 best document

from all unjudged query-documents
judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ )
𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗
𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0
update
(increase confidence)
end while

Why MIREX 2011
largest edition so far
18 systems (153 pairwise comparisons)
100 queries and 6,322 judgments

Distribution of Gi
let us work with a uniform distribution for now

Confidence as Judgments are Made

correct bins: estimated sign is correct or
not significant anyway

high confidence
with considerably
less effort

Accuracy as Judgments are Made
estimated bins always
better than expected


estimated signs
highly correlated
with confidence


rankings with tau = 0.9 traditionally considered
equivalent (same as 95% accuracy)

high confidence
and
high accuracy
with considerably
less effort

Statistical Significance
MTC allows us to accurately estimate the ranking
but for the current set of queries
can we generalize to a general set of queries?

Not Trivial
we have the variance of the estimates
but not the sample variance

Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A
Lower bound: best case for B

1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋 known judgments
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
𝑘
𝑖∈𝜋
1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
𝑘
𝑖∈𝜋

*same for the lower bound


1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋 retrieved by A
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
best 𝑘
similarity 𝑖∈𝜋 unknown judgments
score 1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
𝑘
𝑖∈𝜋



1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
𝑘 retrieved by B
𝑖∈𝜋
but not by A
1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
worst 𝑘
similarity 𝑖∈𝜋 unknown judgments
score


3 Rules
1. Assume best case for A (upper bound)
if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound)
if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B
and in the best case for B we do not have B >>> A
then conclude they are not significantly different
Problem
upper and lower bounds are very unrealistic

Incorporate a Heuristic
4. If the estimated difference is larger than t
naively conclude significance

Choose t Based on Power Analysis
t = effect-size detectable by a t-test with
• sample variance σ2=0.0615 from previous
• sample size n=100 MIREX editions

• Type I Error rate α=0.05 typical values
• Type II Error rate β=0.15

t ≈ 0.067

Accuracy of the Significance Estimates

pretty good
around 95% confidence

rule 4 (heuristic) ends up
overestimating significance


rules 1 to 3 begin to apply
and correct overestimations

rule 4 (heuristic) ends up
overestimating significance


closer to
expected

never under 90%

significance
can be estimated
fairly well too

Introduce MTC to the MIR folks

Work out the Math
for MTC with AG@k

See How Well it would have Done
in AMS 2011
quite well actually!

Learn the true Distribution of Similarity Judgments
it‘s clearly not uniform
would give more accurate estimates with less effort
use previous AMS data or fit a model as we judge

Significance Testing with Incomplete Judgments
best-case scenarios are very unrealistic

Study Low-Cost Methodologies for other MIR Tasks

MTC Greatly Reduces the Effort for AMS (and SMS)
have MIREX volunteers incrementally create
brand new test collections for other tasks

Better Yet
study low-cost methodologies for the other tasks
Not Only for MIREX
private collections for in-house evaluations
no possibility of gathering large pools of annotators
lost-cost becomes paramount

the MIR community
needs a paradigm shift
from a priori to a posteriori
evaluation methods
to reduce cost
and gain reliability

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (13)

Semelhante a Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Semelhante a Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval (20)

Mais de Julián Urbano

Mais de Julián Urbano (10)

Último

Último (20)

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval