SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Towards Minimal Test Collections
                           for Evaluation of
                   Audio Music Similarity and Retrieval
                        @julian_urbano                      @m_schedl
                      University Carlos III of Madrid   Johannes Kepler University




                                                                            AdMIRe 2012
Picture by ERdi43 (Wikipedia)                                    Lyon, France · April 17th
Problem
   evaluation of IR systems is costly
             Annotations
           time consuming
              expensive
               boring
           (Bad) Consequence
    small and biased test collections
  unlikely to change from year to year
                Solution
apply low-cost evaluation methodologies
nearly 2 decades of
                                         Meta-Evaluation in Text IR
             some good practices
             inherited from here
                                                  NTCIR                  CLEF
Cranfield 2 MEDLARS SMART          TREC           (1999-today)
                                                                        (2000-today)
  (1962-1966) (1966-1967)          (1992-today)
                    (1961-1995)
1960                                                                                          2011

                                                         ISMIR                MIREX
                                                         (2000-today)
                                                                               (2005-today)




                                           a lot of things
                                        have happened here!
Minimal Test Collections (MTC) [Carterette at al.]
        estimate the ranking of systems
 with very few judgments (high incompleteness)

   Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year
        to make thousands of judgments
  Year Teams Systems Queries Results Judgments Overlap
  2006   5       6     60     1,800    1,629    10%
  2007   8      12    100     6,000    4,832    19%
  2009   9      15    100     7,500    6,732    10%
  2010   5       8    100     4,000    2,737    32%
  2011 10       18    100     9,000    6,322    30%
evaluation
        with
incomplete judgments
Basic Idea
       treat similarity scores as random variables
       can be estimated            with uncertainty

       gain of an arbitrary document: Gi ⤳ multinomial

                     𝐸 𝐺𝑖 =           𝑃 𝐺𝑖 = 𝑙 · 𝑙
                               𝑙∈ℒ

       ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2              ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

                whenever document i is judged:
                   𝐸 𝐺𝑖 = 𝑙   𝑉𝑎𝑟 𝐺 𝑖 = 0
*all variance formulas in the paper
AG@k is also treated as a random variable

                 1
        𝐸 𝐴𝐺@𝑘 =               𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘
                 𝑘
                         𝑖∈𝒟

 iterate all documents              ranking at which
    (in practice, only               it was retrieved
  the top k retrieved)


              Ultimate Goal
compute a good estimate with the least effort
Comparing Two Systems
what really matters is the sign of the difference
           1
 𝐸 𝛥𝐴𝐺@𝑘 =              𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
           𝑘
                  𝑖∈𝒟

          Evaluating Several Queries
                  1
        𝐸 𝛥𝐴𝐺@𝑘 =                  𝐸 𝛥𝐴𝐺@𝑘 𝑞
                  𝒬
                             𝑞∈𝒬
       iterate all queries

                   The Rationale
     if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then
 judge another document else stop judging
Distribution of AG@k
what are the possible assignments of similarity?

  𝑃 𝐴𝐺@𝑘 = 𝓏 ≔                      𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘
                         𝛾 𝑘 ∈𝛤 𝑘
                                          ultimately
  iterate all possible
                                       depends on the
  permutations of k
                                      distribution of Gi
similarity assignments

                   Plain English
 the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
         run Monte Carlo simulation
Actually, AG@k is a Special Case
 let G be the similarity of the top k for all queries
     query         AG@k for a single query

1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
                       ...
Q. take a sample of k documents. Mean = XQ
                Mean of sample means = X
                mean AG@k over all queries

            Central Limit Theorem
 as Q→∞, X approximates a normal distribution
      regardless of the distribution of G
AG@k is Normally Distributed
use the normal cumulative density function Φ
                                                                              −𝐸 ∆𝐴𝐺@𝑘
                                     𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ
                                                                                      𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
                                           BROAD scale                                         FINE scale




                                                                              0.030
           0.0 0.2 0.4 0.6 0.8 1.0




                                                                              0.020
                                                                    Density
 Density




                                                                              0.010
                                                                              0.000




                                     0.0   0.5    1.0   1.5   2.0                     0   20    40   60     80   100

                                                 AG@5                                            AG@5
Confidence as a Function of # Judgments

                                    100
                                    95
                                    90
                                                                                                         or waste
 Confidence in ranking of systems



                                                                                                         our time
                                                                          or keep judging
                                    85




                                                      we can           to be really confident
                                    80




                                                   stop judging
                                    75
                                    70
                                    65
                                    60
                                    55
                                    50




                                          0   10   20   30    40      50     60     70   80   90   100
                                                             Percent of judgments


                                     what documents should we judge?
                                    those that maximize the confidence
The Trick
documents retrieved by both systems are useless
       there is no need to judge them
 whatever Gi is, it is added and then subtracted

          Comparing Several Systems
  compute a weight wi for each query-document
     judge the document with largest effect

               wi in the Original MTC
      wi = largest weight across system pairs
reduces to # of system pairs affected by query-doc i
wi Dependent on Confidence
 if we are highly confident about a pair of systems
we do not need to judge another of their documents

           even if it has the largest weight

                                                             2
   𝑤𝑖 =              1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘
          𝐴,𝐵 ∈𝒮−ℛ           weight inversely proportional
                                     to confidence
             iterate system pairs
             with low confidence

      better results than traditional weights
MTC for
   AMS
with AG@k
MTC for ΔAG@k
             average confidence on the ranking

        1
while          𝐴,𝐵 ∈𝒮
                        𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do
        𝒮
                                            select the
         𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖              best document

                 from all unjudged query-documents
        judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ )
          𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗
          𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0
                        update
                 (increase confidence)
end while
MTC in
MIREX AMS 2011
Why MIREX 2011
            largest edition so far
   18 systems (153 pairwise comparisons)
      100 queries and 6,322 judgments


              Distribution of Gi
let us work with a uniform distribution for now
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
high confidence
with considerably
   less effort
Accuracy as Judgments are Made
       estimated bins always
       better than expected
Accuracy as Judgments are Made




 estimated signs
highly correlated
with confidence
Accuracy as Judgments are Made




rankings with tau = 0.9 traditionally considered
      equivalent (same as 95% accuracy)
high confidence
       and
 high accuracy
with considerably
   less effort
Statistical Significance
MTC allows us to accurately estimate the ranking
        but for the current set of queries
 can we generalize to a general set of queries?


                   Not Trivial
     we have the variance of the estimates
         but not the sample variance
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

           1
   ∆𝐴𝐺@𝑘 =          𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘   +
            𝑘
               𝑖∈𝜋          known judgments
           1
         +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
           𝑘
              𝑖∈𝜋
           1
         −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
           𝑘
                  𝑖∈𝜋



*same for the lower bound
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

               1
   ∆𝐴𝐺@𝑘 =              𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘     +
                𝑘
                   𝑖∈𝜋               retrieved by A
               1
             +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
     best      𝑘
  similarity      𝑖∈𝜋          unknown judgments
    score      1
             −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
               𝑘
                  𝑖∈𝜋



*same for the lower bound
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

            1
   ∆𝐴𝐺@𝑘 =           𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
             𝑘
                𝑖∈𝜋
            1
          +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
            𝑘                            retrieved by B
               𝑖∈𝜋
                                          but not by A
            1
          −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
    worst   𝑘
  similarity      𝑖∈𝜋        unknown judgments
    score

*same for the lower bound
3 Rules
1. Assume best case for A (upper bound)
   if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound)
   if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B
   and in the best case for B we do not have B >>> A
   then conclude they are not significantly different
                     Problem
    upper and lower bounds are very unrealistic
Incorporate a Heuristic
4. If the estimated difference is larger than t
   naively conclude significance

         Choose t Based on Power Analysis
t = effect-size detectable by a t-test with
    • sample variance σ2=0.0615 from previous
    • sample size n=100               MIREX editions

    • Type I Error rate α=0.05         typical values
    • Type II Error rate β=0.15

                        t ≈ 0.067
Accuracy of the Significance Estimates




                              pretty good
                              around 95% confidence




 rule 4 (heuristic) ends up
overestimating significance
Accuracy of the Significance Estimates

                              rules 1 to 3 begin to apply
                              and correct overestimations




 rule 4 (heuristic) ends up
overestimating significance
Accuracy of the Significance Estimates




                      closer to
                      expected




          never under 90%
significance
can be estimated
 fairly well too
what we did
Introduce MTC to the MIR folks

      Work out the Math
      for MTC with AG@k

See How Well it would have Done
         in AMS 2011
         quite well actually!
what now
Learn the true Distribution of Similarity Judgments
              it‘s clearly not uniform
would give more accurate estimates with less effort
 use previous AMS data or fit a model as we judge


 Significance Testing with Incomplete Judgments
      best-case scenarios are very unrealistic


Study Low-Cost Methodologies for other MIR Tasks
what for
MTC Greatly Reduces the Effort for AMS (and SMS)
  have MIREX volunteers incrementally create
   brand new test collections for other tasks

                   Better Yet
 study low-cost methodologies for the other tasks
                 Not Only for MIREX
    private collections for in-house evaluations
no possibility of gathering large pools of annotators
           lost-cost becomes paramount
the MIR community
   needs a paradigm shift
from a priori to a posteriori
    evaluation methods
       to reduce cost
     and gain reliability

Mais conteúdo relacionado

Destaque

On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Richard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
Richard Diamond
 

Destaque (13)

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Semelhante a Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating Equipment
Jordan McBain
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
Gianmario Spacagna
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023
Joaquim Jorge
 
Statistical quality__control_2
Statistical  quality__control_2Statistical  quality__control_2
Statistical quality__control_2
Tech_MX
 

Semelhante a Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval (20)

Resolution
ResolutionResolution
Resolution
 
Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)
 
1.2
1.21.2
1.2
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating Equipment
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Measurement_and_Units.pptx
Measurement_and_Units.pptxMeasurement_and_Units.pptx
Measurement_and_Units.pptx
 
Clustering Theory
Clustering TheoryClustering Theory
Clustering Theory
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023
 
Input analysis
Input analysisInput analysis
Input analysis
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Linear sort
Linear sortLinear sort
Linear sort
 
Statistical quality__control_2
Statistical  quality__control_2Statistical  quality__control_2
Statistical quality__control_2
 

Mais de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 

Mais de Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Último

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

  • 1. Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval @julian_urbano @m_schedl University Carlos III of Madrid Johannes Kepler University AdMIRe 2012 Picture by ERdi43 (Wikipedia) Lyon, France · April 17th
  • 2. Problem evaluation of IR systems is costly Annotations time consuming expensive boring (Bad) Consequence small and biased test collections unlikely to change from year to year Solution apply low-cost evaluation methodologies
  • 3. nearly 2 decades of Meta-Evaluation in Text IR some good practices inherited from here NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 ISMIR MIREX (2000-today) (2005-today) a lot of things have happened here!
  • 4. Minimal Test Collections (MTC) [Carterette at al.] estimate the ranking of systems with very few judgments (high incompleteness) Application in Audio Music Similarity (AMS) dozens of volunteers required by MIREX every year to make thousands of judgments Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
  • 5. evaluation with incomplete judgments
  • 6. Basic Idea treat similarity scores as random variables can be estimated with uncertainty gain of an arbitrary document: Gi ⤳ multinomial 𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙 𝑙∈ℒ ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100} whenever document i is judged: 𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0 *all variance formulas in the paper
  • 7. AG@k is also treated as a random variable 1 𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 iterate all documents ranking at which (in practice, only it was retrieved the top k retrieved) Ultimate Goal compute a good estimate with the least effort
  • 8. Comparing Two Systems what really matters is the sign of the difference 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 Evaluating Several Queries 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞 𝒬 𝑞∈𝒬 iterate all queries The Rationale if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then judge another document else stop judging
  • 9. Distribution of AG@k what are the possible assignments of similarity? 𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘 𝛾 𝑘 ∈𝛤 𝑘 ultimately iterate all possible depends on the permutations of k distribution of Gi similarity assignments Plain English the ratio of similarity assignments s.t. AG@k=z For Complex Measures or Large Similarity Scales run Monte Carlo simulation
  • 10. Actually, AG@k is a Special Case let G be the similarity of the top k for all queries query AG@k for a single query 1. take a sample of k documents. Mean = X1 2. take a sample of k documents. Mean = X2 ... Q. take a sample of k documents. Mean = XQ Mean of sample means = X mean AG@k over all queries Central Limit Theorem as Q→∞, X approximates a normal distribution regardless of the distribution of G
  • 11. AG@k is Normally Distributed use the normal cumulative density function Φ −𝐸 ∆𝐴𝐺@𝑘 𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ 𝑉𝑎𝑟 ∆𝐴𝐺@𝑘 BROAD scale FINE scale 0.030 0.0 0.2 0.4 0.6 0.8 1.0 0.020 Density Density 0.010 0.000 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 AG@5 AG@5
  • 12. Confidence as a Function of # Judgments 100 95 90 or waste Confidence in ranking of systems our time or keep judging 85 we can to be really confident 80 stop judging 75 70 65 60 55 50 0 10 20 30 40 50 60 70 80 90 100 Percent of judgments what documents should we judge? those that maximize the confidence
  • 13. The Trick documents retrieved by both systems are useless there is no need to judge them whatever Gi is, it is added and then subtracted Comparing Several Systems compute a weight wi for each query-document judge the document with largest effect wi in the Original MTC wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i
  • 14. wi Dependent on Confidence if we are highly confident about a pair of systems we do not need to judge another of their documents even if it has the largest weight 2 𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘 𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional to confidence iterate system pairs with low confidence better results than traditional weights
  • 15. MTC for AMS with AG@k
  • 16. MTC for ΔAG@k average confidence on the ranking 1 while 𝐴,𝐵 ∈𝒮 𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do 𝒮 select the 𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖 best document from all unjudged query-documents judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ ) 𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗ 𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0 update (increase confidence) end while
  • 18. Why MIREX 2011 largest edition so far 18 systems (153 pairwise comparisons) 100 queries and 6,322 judgments Distribution of Gi let us work with a uniform distribution for now
  • 19. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 20. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 21. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 23. Accuracy as Judgments are Made estimated bins always better than expected
  • 24. Accuracy as Judgments are Made estimated signs highly correlated with confidence
  • 25. Accuracy as Judgments are Made rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
  • 26. high confidence and high accuracy with considerably less effort
  • 27. Statistical Significance MTC allows us to accurately estimate the ranking but for the current set of queries can we generalize to a general set of queries? Not Trivial we have the variance of the estimates but not the sample variance
  • 28. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 known judgments 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 𝑖∈𝜋 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋 *same for the lower bound
  • 29. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 retrieved by A 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − best 𝑘 similarity 𝑖∈𝜋 unknown judgments score 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋 *same for the lower bound
  • 30. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 retrieved by B 𝑖∈𝜋 but not by A 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 worst 𝑘 similarity 𝑖∈𝜋 unknown judgments score *same for the lower bound
  • 31. 3 Rules 1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B 2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A 3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different Problem upper and lower bounds are very unrealistic
  • 32. Incorporate a Heuristic 4. If the estimated difference is larger than t naively conclude significance Choose t Based on Power Analysis t = effect-size detectable by a t-test with • sample variance σ2=0.0615 from previous • sample size n=100 MIREX editions • Type I Error rate α=0.05 typical values • Type II Error rate β=0.15 t ≈ 0.067
  • 33. Accuracy of the Significance Estimates pretty good around 95% confidence rule 4 (heuristic) ends up overestimating significance
  • 34. Accuracy of the Significance Estimates rules 1 to 3 begin to apply and correct overestimations rule 4 (heuristic) ends up overestimating significance
  • 35. Accuracy of the Significance Estimates closer to expected never under 90%
  • 36. significance can be estimated fairly well too
  • 38. Introduce MTC to the MIR folks Work out the Math for MTC with AG@k See How Well it would have Done in AMS 2011 quite well actually!
  • 40. Learn the true Distribution of Similarity Judgments it‘s clearly not uniform would give more accurate estimates with less effort use previous AMS data or fit a model as we judge Significance Testing with Incomplete Judgments best-case scenarios are very unrealistic Study Low-Cost Methodologies for other MIR Tasks
  • 42. MTC Greatly Reduces the Effort for AMS (and SMS) have MIREX volunteers incrementally create brand new test collections for other tasks Better Yet study low-cost methodologies for the other tasks Not Only for MIREX private collections for in-house evaluations no possibility of gathering large pools of annotators lost-cost becomes paramount
  • 43. the MIR community needs a paradigm shift from a priori to a posteriori evaluation methods to reduce cost and gain reliability