SlideShare uma empresa Scribd logo
1 de 131
Baixar para ler offline
Evaluation in
Audio Music Similarity

PhD dissertation
by
Julián Urbano
Picture by Javier García

Leganés, October 3rd 2013
Outline
•
•
•
•
•

Introduction
Validity
Reliability
Efficiency
Conclusions and Future Work

2
Outline
• Introduction
– Scope
– The Cranfield Paradigm

•
•
•
•

Validity
Reliability
Efficiency
Conclusions and Future Work

3
Information Retrieval
• Automatic representation, storage and search of
unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music

• A user has an information need and uses an IR
system that retrieves the relevant or significant
information from a collection of documents

4
Information Retrieval Evaluation
• IR systems are based on models to estimate
relevance, implementing different techniques
• How good is my system? What system is better?
• Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements

• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
5
History of IR Evaluation research

MEDLARS
Cranfield 2
SMART
1960

SIGIR
1970

1980

1990

2000

2010

6
History of IR Evaluation research

MEDLARS
Cranfield 2
SMART
1960

TREC

SIGIR
1970

1980

1990

INEX
CLEF
NTCIR

2000

2010

6
History of IR Evaluation research

MEDLARS
Cranfield 2
SMART
1960

TREC

SIGIR
1970

1980

1990

INEX
CLEF
NTCIR

2000

2010

ISMIR
MIREX
MusiCLEF
MSD Challenge
6
History of IR Evaluation research

MEDLARS
Cranfield 2
SMART
1960

TREC

SIGIR
1970

1980

1990

INEX
CLEF
NTCIR

2000

2010

ISMIR
MIREX
MusiCLEF
MSD Challenge
6
History of IR Evaluation research

MEDLARS
Cranfield 2
SMART
1960

TREC

SIGIR
1970

1980

1990

INEX
CLEF
NTCIR

2000

2010

ISMIR
MIREX
MusiCLEF
MSD Challenge
6
Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection

• Annual evaluation in MIREX
7
Outline
• Introduction
– Scope
– The Cranfield Paradigm

•
•
•
•

Validity
Reliability
Efficiency
Conclusions and Future Work

8
Outline
• Introduction
– Scope
– The Cranfield Paradigm

•
•
•
•

Validity
Reliability
Efficiency
Conclusions and Future Work

9
The two questions
• How good is my system?
– What does good mean?
– What is good enough?

• Is system A better than system B?
– What does better mean?
– How much better?

• Efficiency? Effectiveness? Ease?
10
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure
rate, frustration, ease to learn, ease to use …
– Their distributions describe user experience, fully

• User satisfaction is the bigger picture
– How likely is it that an arbitrary user, with an arbitrary
query (and with an arbitrary document collection) will
be satisfied by the system?

• This is the ultimate goal: the good, the better
11
The Cranfield Paradigm
• Estimate user-measure distributions
– Sample documents, queries and users
– Monitor user experience and behavior
– Representativeness, cost, ethics, privacy …

• Fix samples to allow reproducibility
– But cannot fix users and their behavior
– Remove users, but include a static user component,
fixed across experiments: ground truth judgments
– Still need to include the dynamics of the process: user
models behind effectiveness measures and scales
12
Test collections
• Our goal is the users:
user-measure = f(system)

• Cranfield measures systems:
system-effectiveness = f(system, measure, scale)

• Estimators of the distributions of user-measures
– Only source of variability is the systems themselves
– Reproducibility becomes easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
13
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to?
– How well are effectiveness and satisfaction correlated?
– How good is good and how better is better?

• Reliability: how repeatable are the results?
– How large do samples have to be?
– What statistical methods should be used?

• Efficiency: how inexpensive is it to get valid and
reliable results?
– Can we estimate results with fewer judgments?
14
Goal of this dissertation

Study and improve
the validity, reliability and efficiency
of the methods used to evaluate AMS systems

Additionally, improve meta-evaluation methods
15
Outline
• Introduction
– Scope
– The Cranfield Paradigm

•
•
•
•

Validity
Reliability
Efficiency
Conclusions and Future Work

16
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions

• Reliability
• Efficiency
• Conclusions and Future Work

17
Assumption of Cranfield
• Systems with better effectiveness are perceived
by users as more useful, more satisfactory
• But different effectiveness measures and
relevance scales produce different distributions
– Which one is better to predict user satisfaction?

• Map system effectiveness onto user satisfaction,
experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will
find the results satisfactory?
– What if DCG@20 = 0.46?
18
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

MIREX
X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
19
Experimental design

20
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation

• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory

21
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions

• 113 unique subjects
22
Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better

What do we expect?

23
Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better

23
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction

24
Single system: how good is it?
• Users don’t pay attention to ranking?

25
Single system: how good is it?
• Exponential gain underestimates satisfaction

26
Single system: how good is it?
• Document utility independent of others

27
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one

What do we expect?

28
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one

28
Two systems: which one is better?
• Large differences needed for users to note them

29
Two systems: which one is better?
• More relevance levels are better to discriminate

30
Two systems: which one is better?
• Cascade and navigational user models are not
appropriate

31
Two systems: which one is better?
• Users do prefer the (supposedly) worse system

32
Summary
• Effectiveness and satisfaction are clearly correlated
– But there is a bias of 20% because of user disagreement
– Room for improvement through personalization

• Magnitude of differences does matter
– Just looking at rankings is very naive
• Be careful with statistical significance

– Need Δλ≈0.4 for users to agree with effectiveness
• Historically, only 20% of times in MIREX

• Differences among measures and scales
– Linear gain slightly better than exponential gain
– Informational and positional user models better than
navigational and cascade
– The more relevance levels, the better
33
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
34
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
35
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions

• Reliability
• Efficiency
• Conclusions and Future Work

36
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions

• Reliability
• Efficiency
• Conclusions and Future Work

37
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql @5 = 0.61 = 0.7

• Easily for n users and a single query
– P Sat15 = 10 Q l @5 = 0.61 = 0.21

• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are
interpolated with simple polynomials
38
Expected probability of satisfaction
• Now we can compute point and interval estimates
of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness

39
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary
– Now it is meaningful, in terms of user satisfaction

• Intuitively, we want the majority of users to find
the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP

Sat

(0.5)

• Improving queries for which we are bad is
worthier than further improving those for which
we are already good
40
Distribution of P(Sat)
• Need to estimate the cumulative distribution
function of user satisfaction: FP(Sat)
• Not described by a typical distribution family
– ecdf converges, but what is a good sample size?
– Compare with Normal, Truncated Normal and Beta

• Compared on >2M random samples from MIREX
collections, at different query set sizes
• Goodness of fit as to Cramér-von Mises ω2
41
Estimated distribution of P(Sat)
• More than ≈25 queries in the collection
– ecdf approximates better

• Less than ≈25 queries in the collection
– Normal for graded scales, ecdf for binary scales

• Beta is always the best with the Fine scale
• The more levels in the relevance scale, the better
• Linear gain better than exponential gain
42
Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

43
Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

43
Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

43
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries

44
Measures and scales
Measure
P@5
AP@5
CGl@5
CGe@5
DCGl@5
DCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
GAP@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X

X
X
X

X
X

Artificial Graded
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X

Artificial Binary
ℓmin=20 ℓmin=40
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
AP@5
AP@5

45
Measures and scales
Measure
P@5
AP@5
CGl@5
CGe@5
DCGl@5
DCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
GAP@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X

X
X
X

X
X

Artificial Graded
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X

Artificial Binary
ℓmin=20 ℓmin=40
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
AP@5
AP@5

46
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions

• Reliability
• Efficiency
• Conclusions and Future Work

47
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size

• Efficiency
• Conclusions and Future Work

48
Random error
• Test collections are just samples from larger,
possibly infinite, populations
• If we conclude system A is better than B, how
confident can we be?
– Δλ 𝒬 is just an estimate of the population mean μΔλ

• Usually employ some statistical significance test
for differences in location
• If it is statistically significant, we have confidence
that the true difference is at least that large
49
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0 : μΔλ = 0
– H1 : μΔλ ≠ 0

• Run test, obtain p-value= P μΔλ ≥ Δλ 𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence

• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
50
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test

• Based on resampling
– Bootstrap test, permutation/randomization test

• They make certain assumptions about
distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that
assumptions are violated?
51
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates

• Safety
– Minimize Type I error rates
– Usually decreases power

• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
52
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections

• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels

• All systems and queries from MIREX 2007-2011
– >15M p-values
53
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the
most successful, depending on α level

54
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but
bootstrap is for usual levels

55
Optimal measure and scale
• Power: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Success: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Conflicts: very similar across measures

• Power: Fine, Broad and binary
• Success: Fine, Broad and binary
• Conflicts: very similar across scales

56
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size

• Efficiency
• Conclusions and Future Work

57
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size

• Efficiency
• Conclusions and Future Work

58
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?

• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes
and experimental designs

59
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

60
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ2 + σ2 + σ2
s
q
sq

• Estimated with Analysis of Variance
• If σ2 is small or σ2 is large, we need more queries
s
q
60
D-study: variance ratios
• Stability of absolute scores
Φ nq

σ2
s
=
σ2 + σ2
q
e
2
σs +
nq

• Stability of relative scores
Eρ2 nq =

σ2
s
σ2
σ2 + e
s
nq

• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
61
D-study: variance ratios
• Stability of absolute scores
Φ nq

σ2
s
=
σ2 + σ2
q
e
2
σs +
nq

• Stability of relative scores
Eρ2 nq =

σ2
s
σ2
σ2 + e
s
nq

• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
61
Effect of query set size
•
•
•
•

Average absolute stability Φ = 0.97
≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases
Fine scale slightly better than Broad and binary scales
RBPl@5 and nDCGl@5 are the most stable

62
Effect of query set size
•
•
•
•

Average relative stability Eρ2 = 0.98
≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
Fine scale better than Broad and binary scales
CGl@5 and RBPl@5 are the most stable

63
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable

• Tested in MIREX 2012
– Apparently in 2013 too

64
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable
– From Φ = 0.81 to Φ = 0.83
– From Eρ2 = 0.93 to Eρ2 = 0.95

65
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability

66
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied

• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q

67
Effect of assessor set size
•

2
2
Broad scale: σs ≈ σh:q
Fine scale: σ2 ≫ σ2
s
h:q

•
• Always better to spend resources on queries

68
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?

•
•
•
•

Employ G-Theory while building the collection
Fine better than Broad, better than binary
CGl@5 and DCGl@5 best for relative stability
RBPl@5 and nDCGl@5 best for absolute stability
69
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size

• Efficiency
• Conclusions and Future Work

70
Outline
•
•
•
•

Introduction
Validity
Reliability
Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation

• Conclusions and Future Work

71
Probabilistic evaluation
• The MIREX setting is still expensive
– Need to judge all top k documents from all systems
– Takes days, even weeks sometimes

• Model relevance probabilistically
• Relevance judgments are random variables over
the space of possible assignments of relevance
• Effectiveness measures are also probabilistic
72
Probabilistic evaluation
• Accuracy increases as we make judgments
– E R d ← rd

• Reliability increases too (confidence)
– Var R d ← 0

• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop

• Judge as few documents as possible
73
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P R d = ℓ θd
– For each document separately
– Ordinal Logistic Regression

• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity
74
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity

– Decent fit, R2 ≈ 0.35

• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity

– Excellent fit, R2 ≈ 0.91
75
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine

• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine

• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine

• Negligible under the current MIREX setting
76
Outline
•
•
•
•

Introduction
Validity
Reliability
Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation

• Conclusions and Future Work

77
Outline
•
•
•
•

Introduction
Validity
Reliability
Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation

• Conclusions and Future Work

78
Probabilistic effectiveness measures
• Effectiveness scores are also random variables
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence

• For measures based on ideal ranking (nDCGl@k
and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series

79
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%

80
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
Confidence

[0.5, 0.6)
[0.6, 0.7)
[0.7, 0.8)
[0.8, 0.9)
[0.9, 0.95)
[0.95, 0.99)
[0.99, 1)
E[Accuracy]

DCGl@5
Broad
In bin
Accuracy
23 (6.5%)
0.826
14 (4%)
0.786
14 (4%)
0.571
22 (6.2%)
0.864
23 (6.5%)
0.87
24 (6.8%)
0.917
232 (65.9%)
0.996
0.938

Fine

In bin
22 (6.2%)
16 (4.5%)
11 (3.1%)
21 (6%)
19 (5.4%)
27 (7.7%)
236 (67%)

Accuracy
0.636
0.812
0.364
0.762
0.895
0.926
0.996
0.921
81
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems

• What documents should we judge?
– Those that are the most informative
– Measure-dependent
82
Relative estimates with judgments
• Judging effort dramatically reduced
– 1.3% with CGl@5, 9.7% with RBPl@5

• Average accuracy still 92%, but improved individually
– 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931

83
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores

• What documents should we judge?
– Those that reduce variance the most
– Measure-dependent
84
Absolute estimates with judgments
• The stopping condition is overly confident
– Virtually no judgments are even needed (supposedly)

• But effectiveness is highly overestimated

– Especially with nDCGl@5 and RBPl@5
– Mjud, and especially Mout, tend to overestimate relevance

85
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments

86
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct

• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy

• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05

87
Outline
•
•
•
•

Introduction
Validity
Reliability
Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation

• Conclusions and Future Work

88
Outline
•
•
•
•
•

Introduction
Validity
Reliability
Efficiency
Conclusions and Future Work
– Conclusions
– Future Work

89
Validity
• Cranfield tells us about systems, not about users
• Provide empirical mapping from system
effectiveness onto user satisfaction
• Room for personalization quantified in 20%
• Need large differences for users to note them
• Consider full distributions, not just averages
• Conclusions based on effectiveness tend to
contradict conclusions based on user satisfaction
90
Reliability
• Different significance tests for different needs
– Bootstrap test is the most powerful
– Wilcoxon and t-test are the safest
– Wilcoxon and bootstrap test are the most exact

•
•
•
•
•

Practical interpretation of p-values
MIREX collections generally larger than needed
Spend resources on queries, not on assessors
User models with deeper cutoffs are feasible
Employ G-Theory while building collections
91
Efficiency
•
•
•
•
•

Probabilistic evaluation reduces cost, dramatically
Two models to estimate document relevance
System rankings 92% accurate without judgments
2% of judgments to reach 95% confidence
25% of judgments to reduce error to 0.05

92
Measures and scales
• Best measure and scale depends on situation
• But generally speaking
– CGl@5, DCGl@5 and RBPl@5
– Fine scale
– Model distributions as Beta

93
Outline
•
•
•
•
•

Introduction
Validity
Reliability
Efficiency
Conclusions and Future Work
– Conclusions
– Future Work

94
Outline
•
•
•
•
•

Introduction
Validity
Reliability
Efficiency
Conclusions and Future Work
– Conclusions
– Future Work

95
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better
capture document utility
• Explicitly define judging guidelines

• Similar mapping for Text IR

96
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while
building test collections

97
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few
relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights

98
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval

99
Evaluation in
Audio Music Similarity

PhD dissertation
by
Julián Urbano
Picture by Javier García

Leganés, October 3rd 2013

Mais conteúdo relacionado

Destaque

A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationRichard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondRichard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondRichard Diamond
 

Destaque (14)

A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Laurie Loomis Malcomson
Laurie Loomis MalcomsonLaurie Loomis Malcomson
Laurie Loomis Malcomson
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Semelhante a Evaluation in Audio Music Similarity

Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiencesHim Chitchat
 
Information system audit
Information system audit Information system audit
Information system audit Jayant Dalvi
 
Preference Elicitation Interface
Preference Elicitation InterfacePreference Elicitation Interface
Preference Elicitation Interface晓愚 孟
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.pptKingSh2
 
Unit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptxUnit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptxssuser50f868
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...Alejandro Bellogin
 
Information Experience Lab, IE Lab at SISLT
Information Experience Lab, IE Lab at SISLTInformation Experience Lab, IE Lab at SISLT
Information Experience Lab, IE Lab at SISLTIsa Jahnke
 
Paper Prototype Evaluation
Paper Prototype EvaluationPaper Prototype Evaluation
Paper Prototype EvaluationDavid Lamas
 
Introduction to Usability Testing for Survey Research
Introduction to Usability Testing for Survey ResearchIntroduction to Usability Testing for Survey Research
Introduction to Usability Testing for Survey ResearchCaroline Jarrett
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumJonathas Magalhães
 

Semelhante a Evaluation in Audio Music Similarity (20)

Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiences
 
Information system audit
Information system audit Information system audit
Information system audit
 
Preference Elicitation Interface
Preference Elicitation InterfacePreference Elicitation Interface
Preference Elicitation Interface
 
Paper prototype evaluation
Paper prototype evaluationPaper prototype evaluation
Paper prototype evaluation
 
Evaluation techniques
Evaluation techniquesEvaluation techniques
Evaluation techniques
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.ppt
 
Human Computer Interaction Evaluation
Human Computer Interaction EvaluationHuman Computer Interaction Evaluation
Human Computer Interaction Evaluation
 
Unit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptxUnit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptx
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
 
Information Experience Lab, IE Lab at SISLT
Information Experience Lab, IE Lab at SISLTInformation Experience Lab, IE Lab at SISLT
Information Experience Lab, IE Lab at SISLT
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
Usability requirements
Usability requirements Usability requirements
Usability requirements
 
Paper Prototype Evaluation
Paper Prototype EvaluationPaper Prototype Evaluation
Paper Prototype Evaluation
 
Introduction to Usability Testing for Survey Research
Introduction to Usability Testing for Survey ResearchIntroduction to Usability Testing for Survey Research
Introduction to Usability Testing for Survey Research
 
2014 Paper Prototype Evaluation by David Lamas
2014 Paper Prototype Evaluation by David Lamas2014 Paper Prototype Evaluation by David Lamas
2014 Paper Prototype Evaluation by David Lamas
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
 
Hm 418 harris ch09 ppt
Hm 418 harris ch09 pptHm 418 harris ch09 ppt
Hm 418 harris ch09 ppt
 

Mais de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 

Mais de Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Evaluation in Audio Music Similarity

  • 1. Evaluation in Audio Music Similarity PhD dissertation by Julián Urbano Picture by Javier García Leganés, October 3rd 2013
  • 3. Outline • Introduction – Scope – The Cranfield Paradigm • • • • Validity Reliability Efficiency Conclusions and Future Work 3
  • 4. Information Retrieval • Automatic representation, storage and search of unstructured information – Traditionally textual information – Lately multimedia too: images, video, music • A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents 4
  • 5. Information Retrieval Evaluation • IR systems are based on models to estimate relevance, implementing different techniques • How good is my system? What system is better? • Answered with IR Evaluation experiments – “if you can’t measure it, you can’t improve it” – But we need to be able to trust our measurements • Research on IR Evaluation – Improve our methods to evaluate systems – Critical for the correct development of the field 5
  • 6. History of IR Evaluation research MEDLARS Cranfield 2 SMART 1960 SIGIR 1970 1980 1990 2000 2010 6
  • 7. History of IR Evaluation research MEDLARS Cranfield 2 SMART 1960 TREC SIGIR 1970 1980 1990 INEX CLEF NTCIR 2000 2010 6
  • 8. History of IR Evaluation research MEDLARS Cranfield 2 SMART 1960 TREC SIGIR 1970 1980 1990 INEX CLEF NTCIR 2000 2010 ISMIR MIREX MusiCLEF MSD Challenge 6
  • 9. History of IR Evaluation research MEDLARS Cranfield 2 SMART 1960 TREC SIGIR 1970 1980 1990 INEX CLEF NTCIR 2000 2010 ISMIR MIREX MusiCLEF MSD Challenge 6
  • 10. History of IR Evaluation research MEDLARS Cranfield 2 SMART 1960 TREC SIGIR 1970 1980 1990 INEX CLEF NTCIR 2000 2010 ISMIR MIREX MusiCLEF MSD Challenge 6
  • 11. Audio Music Similarity • Song as input to system, audio signal • Retrieve songs musically similar to it, by content • Resembles traditional Ad Hoc retrieval in Text IR • (most?) Important task in Music IR – Music recommendation – Playlist generation – Plagiarism detection • Annual evaluation in MIREX 7
  • 12. Outline • Introduction – Scope – The Cranfield Paradigm • • • • Validity Reliability Efficiency Conclusions and Future Work 8
  • 13. Outline • Introduction – Scope – The Cranfield Paradigm • • • • Validity Reliability Efficiency Conclusions and Future Work 9
  • 14. The two questions • How good is my system? – What does good mean? – What is good enough? • Is system A better than system B? – What does better mean? – How much better? • Efficiency? Effectiveness? Ease? 10
  • 15. Measure user experience • We are interested in user-measures – Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use … – Their distributions describe user experience, fully • User satisfaction is the bigger picture – How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system? • This is the ultimate goal: the good, the better 11
  • 16. The Cranfield Paradigm • Estimate user-measure distributions – Sample documents, queries and users – Monitor user experience and behavior – Representativeness, cost, ethics, privacy … • Fix samples to allow reproducibility – But cannot fix users and their behavior – Remove users, but include a static user component, fixed across experiments: ground truth judgments – Still need to include the dynamics of the process: user models behind effectiveness measures and scales 12
  • 17. Test collections • Our goal is the users: user-measure = f(system) • Cranfield measures systems: system-effectiveness = f(system, measure, scale) • Estimators of the distributions of user-measures – Only source of variability is the systems themselves – Reproducibility becomes easy – Experiments are inexpensive (collections are not) – Research becomes systematic 13
  • 18. Validity, Reliability and Efficiency • Validity: are we measuring what we want to? – How well are effectiveness and satisfaction correlated? – How good is good and how better is better? • Reliability: how repeatable are the results? – How large do samples have to be? – What statistical methods should be used? • Efficiency: how inexpensive is it to get valid and reliable results? – Can we estimate results with fewer judgments? 14
  • 19. Goal of this dissertation Study and improve the validity, reliability and efficiency of the methods used to evaluate AMS systems Additionally, improve meta-evaluation methods 15
  • 20. Outline • Introduction – Scope – The Cranfield Paradigm • • • • Validity Reliability Efficiency Conclusions and Future Work 16
  • 21. Outline • Introduction • Validity – System Effectiveness and User Satisfaction – Modeling Distributions • Reliability • Efficiency • Conclusions and Future Work 17
  • 22. Assumption of Cranfield • Systems with better effectiveness are perceived by users as more useful, more satisfactory • But different effectiveness measures and relevance scales produce different distributions – Which one is better to predict user satisfaction? • Map system effectiveness onto user satisfaction, experimentally – If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory? – What if DCG@20 = 0.46? 18
  • 23. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 24. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 25. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 26. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 27. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 28. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 29. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 30. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 31. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 32. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 33. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 34. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine Artificial Graded nℒ=3 nℒ=4 nℒ=5 MIREX X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 19
  • 36. What can we infer? • Preference: difference noticed by user – Positive: user agrees with evaluation – Negative: user disagrees with evaluation • Non-preference: difference not noticed by user – Good: both systems are satisfactory – Bad: both systems are not satisfactory 21
  • 37. Data • Queries, documents and judgments from MIREX • 4115 unique and artificial examples • 432 unique queries, 5636 unique documents • Answers collected via Crowdsourcing – Quality control with trap questions • 113 unique subjects 22
  • 38. Single system: how good is it? • For 2045 examples (49%) users could not decide which system was better What do we expect? 23
  • 39. Single system: how good is it? • For 2045 examples (49%) users could not decide which system was better 23
  • 40. Single system: how good is it? • Large ℓmin thresholds underestimate satisfaction 24
  • 41. Single system: how good is it? • Users don’t pay attention to ranking? 25
  • 42. Single system: how good is it? • Exponential gain underestimates satisfaction 26
  • 43. Single system: how good is it? • Document utility independent of others 27
  • 44. Two systems: which one is better? • For 2090 examples (51%) users did prefer one system over the other one What do we expect? 28
  • 45. Two systems: which one is better? • For 2090 examples (51%) users did prefer one system over the other one 28
  • 46. Two systems: which one is better? • Large differences needed for users to note them 29
  • 47. Two systems: which one is better? • More relevance levels are better to discriminate 30
  • 48. Two systems: which one is better? • Cascade and navigational user models are not appropriate 31
  • 49. Two systems: which one is better? • Users do prefer the (supposedly) worse system 32
  • 50. Summary • Effectiveness and satisfaction are clearly correlated – But there is a bias of 20% because of user disagreement – Room for improvement through personalization • Magnitude of differences does matter – Just looking at rankings is very naive • Be careful with statistical significance – Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX • Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than navigational and cascade – The more relevance levels, the better 33
  • 51. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 34
  • 52. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 35
  • 53. Outline • Introduction • Validity – System Effectiveness and User Satisfaction – Modeling Distributions • Reliability • Efficiency • Conclusions and Future Work 36
  • 54. Outline • Introduction • Validity – System Effectiveness and User Satisfaction – Modeling Distributions • Reliability • Efficiency • Conclusions and Future Work 37
  • 55. Evaluate in terms of user satisfaction • So far, arbitrary users for a single query – P Sat Ql @5 = 0.61 = 0.7 • Easily for n users and a single query – P Sat15 = 10 Q l @5 = 0.61 = 0.21 • What about a sample of queries 𝒬? – Map queries separately for the distribution of P(Sat) – For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials 38
  • 56. Expected probability of satisfaction • Now we can compute point and interval estimates of the expected probability of satisfaction • Intuition fails when interpreting effectiveness 39
  • 57. System success • If P(Sat) ≥ threshold the system is successful – Setting the threshold was rather arbitrary – Now it is meaningful, in terms of user satisfaction • Intuitively, we want the majority of users to find the system satisfactory – P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5) • Improving queries for which we are bad is worthier than further improving those for which we are already good 40
  • 58. Distribution of P(Sat) • Need to estimate the cumulative distribution function of user satisfaction: FP(Sat) • Not described by a typical distribution family – ecdf converges, but what is a good sample size? – Compare with Normal, Truncated Normal and Beta • Compared on >2M random samples from MIREX collections, at different query set sizes • Goodness of fit as to Cramér-von Mises ω2 41
  • 59. Estimated distribution of P(Sat) • More than ≈25 queries in the collection – ecdf approximates better • Less than ≈25 queries in the collection – Normal for graded scales, ecdf for binary scales • Beta is always the best with the Fine scale • The more levels in the relevance scale, the better • Linear gain better than exponential gain 42
  • 60. Intuition fails, again • Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 43
  • 61. Intuition fails, again • Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 43
  • 62. Intuition fails, again • Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 43
  • 63. Historically, in MIREX • Systems are not as satisfactory as we thought • But they are more successful – Good (or bad) for some kinds of queries 44
  • 64. Measures and scales Measure P@5 AP@5 CGl@5 CGe@5 DCGl@5 DCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 GAP@5 Original Broad Fine X X X X X X X X X X X X X X Artificial Graded nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X Artificial Binary ℓmin=20 ℓmin=40 X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 AP@5 AP@5 45
  • 65. Measures and scales Measure P@5 AP@5 CGl@5 CGe@5 DCGl@5 DCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 GAP@5 Original Broad Fine X X X X X X X X X X X X X X Artificial Graded nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X Artificial Binary ℓmin=20 ℓmin=40 X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 AP@5 AP@5 46
  • 66. Outline • Introduction • Validity – System Effectiveness and User Satisfaction – Modeling Distributions • Reliability • Efficiency • Conclusions and Future Work 47
  • 67. Outline • Introduction • Validity • Reliability – Optimality of Statistical Significance Tests – Test Collection Size • Efficiency • Conclusions and Future Work 48
  • 68. Random error • Test collections are just samples from larger, possibly infinite, populations • If we conclude system A is better than B, how confident can we be? – Δλ 𝒬 is just an estimate of the population mean μΔλ • Usually employ some statistical significance test for differences in location • If it is statistically significant, we have confidence that the true difference is at least that large 49
  • 69. Statistical hypothesis testing • Set two mutually exclusive hypotheses – H0 : μΔλ = 0 – H1 : μΔλ ≠ 0 • Run test, obtain p-value= P μΔλ ≥ Δλ 𝒬 H0 – p ≤ α: statistically significant, high confidence – p > α: statistically non-significant, low confidence • Possible errors in the binary decision – Type I: incorrectly reject H0 – Type II: incorrectly accept H0 50
  • 70. Statistical significance tests • (Non-)parametric tests – t-test, Wilcoxon test, Sign test • Based on resampling – Bootstrap test, permutation/randomization test • They make certain assumptions about distributions and sampling methods – Often violated in IR evaluation experiments – Which test behaves better, in practice, knowing that assumptions are violated? 51
  • 71. Optimality criteria • Power – Achieve significance as often as possible (low Type II) – Usually increases Type I error rates • Safety – Minimize Type I error rates – Usually decreases power • Exactness – Maintain Type I error rate at α level – Permutation test is theoretically exact 52
  • 72. Experimental design • Randomly split query set in two • Evaluate all systems with both subsets – Simulating two different test collections • Compare p-values with both subsets – How well do statistical tests agree with themselves? – At different α levels • All systems and queries from MIREX 2007-2011 – >15M p-values 53
  • 73. Power and success • Bootstrap test is the most powerful • Wilcoxon, bootstrap and permutation are the most successful, depending on α level 54
  • 74. Conflicts • Wilcoxon and t-test are the safest at low α levels • Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels 55
  • 75. Optimal measure and scale • Power: CGl@5, GAP@5, DCGl@5 and RBPl@5 • Success: CGl@5, GAP@5, DCGl@5 and RBPl@5 • Conflicts: very similar across measures • Power: Fine, Broad and binary • Success: Fine, Broad and binary • Conflicts: very similar across scales 56
  • 76. Outline • Introduction • Validity • Reliability – Optimality of Statistical Significance Tests – Test Collection Size • Efficiency • Conclusions and Future Work 57
  • 77. Outline • Introduction • Validity • Reliability – Optimality of Statistical Significance Tests – Test Collection Size • Efficiency • Conclusions and Future Work 58
  • 78. Acceptable sample size • Reliability is higher with larger sample sizes – But it is also more expensive – What is an acceptable test collection size? • Answer with Generalizability Theory – G-Study: estimate variance components – D-Study: estimate reliability of different sample sizes and experimental designs 59
  • 79. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 80. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 81. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 82. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 83. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 84. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 85. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 86. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 87. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 88. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 89. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq 60
  • 90. G-study: variance components • Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ2 + σ2 + σ2 s q sq • Estimated with Analysis of Variance • If σ2 is small or σ2 is large, we need more queries s q 60
  • 91. D-study: variance ratios • Stability of absolute scores Φ nq σ2 s = σ2 + σ2 q e 2 σs + nq • Stability of relative scores Eρ2 nq = σ2 s σ2 σ2 + e s nq • We can easily estimate how many queries are needed to reach some level of stability (reliability) 61
  • 92. D-study: variance ratios • Stability of absolute scores Φ nq σ2 s = σ2 + σ2 q e 2 σs + nq • Stability of relative scores Eρ2 nq = σ2 s σ2 σ2 + e s nq • We can easily estimate how many queries are needed to reach some level of stability (reliability) 61
  • 93. Effect of query set size • • • • Average absolute stability Φ = 0.97 ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases Fine scale slightly better than Broad and binary scales RBPl@5 and nDCGl@5 are the most stable 62
  • 94. Effect of query set size • • • • Average relative stability Eρ2 = 0.98 ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases Fine scale better than Broad and binary scales CGl@5 and RBPl@5 are the most stable 63
  • 95. Effect of cutoff k • What if we use a deeper cutoff, k=10? – From 100 queries and k=5 to 50 queries and k=10 – Should still have stable scores – Judging effort should decrease – Rank-based measures should become more stable • Tested in MIREX 2012 – Apparently in 2013 too 64
  • 96. Effect of cutoff k • Judging effort reduced to 72% of the usual • Generally stable – From Φ = 0.81 to Φ = 0.83 – From Eρ2 = 0.93 to Eρ2 = 0.95 65
  • 97. Effect of cutoff k • Reliability given a fixed budged for judging? – k=10 allows us to use fewer queries, about 70% – Slightly reduced relative stability 66
  • 98. Effect of assessor set size • More assessors or simply more queries? – Judging effort is multiplied • Can be studied with MIREX 2006 data – 3 different assessors per query – Nested experimental design: s × h: q 67
  • 99. Effect of assessor set size • 2 2 Broad scale: σs ≈ σh:q Fine scale: σ2 ≫ σ2 s h:q • • Always better to spend resources on queries 68
  • 100. Summary • MIREX collections generally larger than necessary • For fixed budget – More queries better than more assessors – More queries slightly better than deeper cutoff • Worth studying alternative user model? • • • • Employ G-Theory while building the collection Fine better than Broad, better than binary CGl@5 and DCGl@5 best for relative stability RBPl@5 and nDCGl@5 best for absolute stability 69
  • 101. Outline • Introduction • Validity • Reliability – Optimality of Statistical Significance Tests – Test Collection Size • Efficiency • Conclusions and Future Work 70
  • 102. Outline • • • • Introduction Validity Reliability Efficiency – Learning Relevance Distributions – Low-cost Evaluation • Conclusions and Future Work 71
  • 103. Probabilistic evaluation • The MIREX setting is still expensive – Need to judge all top k documents from all systems – Takes days, even weeks sometimes • Model relevance probabilistically • Relevance judgments are random variables over the space of possible assignments of relevance • Effectiveness measures are also probabilistic 72
  • 104. Probabilistic evaluation • Accuracy increases as we make judgments – E R d ← rd • Reliability increases too (confidence) – Var R d ← 0 • Iteratively estimate relevance and effectiveness – If confidence is low, make judgments – If confidence is high, stop • Judge as few documents as possible 73
  • 105. Learning distributions of relevance • Uniform distribution is very uninformative • Historical distribution in MIREX has high variance • Estimate from a set of features: P R d = ℓ θd – For each document separately – Ordinal Logistic Regression • Three sets of features – Output-based, can always be used – Judgment-based, to exploit known judgments – Audio-based, to exploit musical similarity 74
  • 106. Learned models • Mout : can be used even without judgments – Similarity between systems’ outputs – Genre and artist metadata • Genre is highly correlated to similarity – Decent fit, R2 ≈ 0.35 • Mjud : can be used when there are judgments – Similarity between systems’ outputs – Known relevance of same system and same artist • Artist is extremely correlated to similarity – Excellent fit, R2 ≈ 0.91 75
  • 107. Estimation errors • Actual vs. predicted by Mout – 0.36 with Broad and 0.34 with Fine • Actual vs. predicted by Mjud – 0.14 with Broad and 0.09 with Fine • Among assessors in MIREX 2006 – 0.39 with Broad and 0.31 with Fine • Negligible under the current MIREX setting 76
  • 108. Outline • • • • Introduction Validity Reliability Efficiency – Learning Relevance Distributions – Low-cost Evaluation • Conclusions and Future Work 77
  • 109. Outline • • • • Introduction Validity Reliability Efficiency – Learning Relevance Distributions – Low-cost Evaluation • Conclusions and Future Work 78
  • 110. Probabilistic effectiveness measures • Effectiveness scores are also random variables • Different approaches to compute estimates – Deal with dependence of random variables – Different definitions of confidence • For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form – Approximated with Delta method and Taylor series 79
  • 111. Ranking without judgments 1. Estimate relevance with Mout 2. Estimate relative differences and rank systems • Average confidence in the rankings is 94% • Average accuracy of the ranking is 92% 80
  • 112. Ranking without judgments • Can we trust individual estimates? – Ideally, we want X% accuracy when X% confidence – Confidence slightly overestimated in [0.9, 0.99) Confidence [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 0.95) [0.95, 0.99) [0.99, 1) E[Accuracy] DCGl@5 Broad In bin Accuracy 23 (6.5%) 0.826 14 (4%) 0.786 14 (4%) 0.571 22 (6.2%) 0.864 23 (6.5%) 0.87 24 (6.8%) 0.917 232 (65.9%) 0.996 0.938 Fine In bin 22 (6.2%) 16 (4.5%) 11 (3.1%) 21 (6%) 19 (5.4%) 27 (7.7%) 236 (67%) Accuracy 0.636 0.812 0.364 0.762 0.895 0.926 0.996 0.921 81
  • 113. Relative estimates with judgments 1. Estimate relevance with Mout 2. Estimate relative differences and rank systems 3. While confidence is low (<95%) 1. Select a document and judge it 2. Update relevance estimates with Mjud when possible 3. Update estimates of differences and rank systems • What documents should we judge? – Those that are the most informative – Measure-dependent 82
  • 114. Relative estimates with judgments • Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5 • Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate – Expected accuracy improves slightly from 0.927 to 0.931 83
  • 115. Absolute estimates with judgments 1. Estimate relevance with Mout 2. Estimate absolute effectiveness scores 3. While confidence is low (expected error >±0.05) 1. Select a document and judge it 2. Update relevance estimates with Mjud when possible 3. Update estimates of absolute effectiveness scores • What documents should we judge? – Those that reduce variance the most – Measure-dependent 84
  • 116. Absolute estimates with judgments • The stopping condition is overly confident – Virtually no judgments are even needed (supposedly) • But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance 85
  • 117. Absolute estimates with judgments • Practical fix: correct variance • Estimates are better, but at the cost of judging – Need between 15% and 35% of judgments 86
  • 118. Summary • Estimate ranking of systems with no judgments – 92% accuracy on average, trustworthy individually – Statistically significant differences are always correct • If we want more confidence, judge documents – As few as 2% needed to reach 95% confidence – 74% of estimates have >99% confidence and accuracy • Estimate absolute scores, judging as necessary – Around 25% needed to ensure error <0.05 87
  • 119. Outline • • • • Introduction Validity Reliability Efficiency – Learning Relevance Distributions – Low-cost Evaluation • Conclusions and Future Work 88
  • 121. Validity • Cranfield tells us about systems, not about users • Provide empirical mapping from system effectiveness onto user satisfaction • Room for personalization quantified in 20% • Need large differences for users to note them • Consider full distributions, not just averages • Conclusions based on effectiveness tend to contradict conclusions based on user satisfaction 90
  • 122. Reliability • Different significance tests for different needs – Bootstrap test is the most powerful – Wilcoxon and t-test are the safest – Wilcoxon and bootstrap test are the most exact • • • • • Practical interpretation of p-values MIREX collections generally larger than needed Spend resources on queries, not on assessors User models with deeper cutoffs are feasible Employ G-Theory while building collections 91
  • 123. Efficiency • • • • • Probabilistic evaluation reduces cost, dramatically Two models to estimate document relevance System rankings 92% accurate without judgments 2% of judgments to reach 95% confidence 25% of judgments to reduce error to 0.05 92
  • 124. Measures and scales • Best measure and scale depends on situation • But generally speaking – CGl@5, DCGl@5 and RBPl@5 – Fine scale – Model distributions as Beta 93
  • 127. Validity • User studies to understand user behavior • What information to include in test collections • Other forms of relevance judgment to better capture document utility • Explicitly define judging guidelines • Similar mapping for Text IR 96
  • 128. Reliability • Corrections for Multiple Comparisons • Methods to reliably estimate reliability while building test collections 97
  • 129. Efficiency • Better models to estimate document relevance • Correct variance when having just a few relevance judgments available • Estimate relevance beyond k=5 • Other stopping conditions and document weights 98
  • 130. Conduct similar studies for the wealth of tasks in Music Information Retrieval 99
  • 131. Evaluation in Audio Music Similarity PhD dissertation by Julián Urbano Picture by Javier García Leganés, October 3rd 2013