SlideShare a Scribd company logo
1 of 32
Download to read offline
Practical and Effective Design
                  of a Crowdsourcing Task for
               Unconventional Relevance Judging
                             Julián Urbano @julian_urbano
         Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
                                  University Carlos III of Madrid




                                                                                TREC 2011
Picture by Michael Dornbierer                              Gaithersburg, USA · November 18th
Task I

Crowdsourcing Individual
  Relevance Judgments
In a Nutshell
• Amazon Mechanical Turk, External HITs
• All 5 documents per set in a sigle HIT = 435 HITs
• $0.20 per HIT = $0.04 per document
   ran out of time                graded     slider    hterms
      Hours to complete             8.5       38         20.5
   HITs submitted (overhead)     438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just preview) 29 (102)   83 (383)   30 (163)
Average documents per worker        76        32         75
   Total cost (including fees)    $95.7      $95.7      $95.7
Document Preprocessing
• Ensure smooth loading and safe rendering
  – Null hyperlinks
  – Embed all external resources
  – Remove CSS unrelated to style or layout
  – Remove unsafe HTML elements
  – Remove irrelevant HTML attributes




                                         4
Display Mode                 hterms run
• With images




• Black & white, no images



                             5
Display Mode (and II)
• Previous experiment
  – Workers seem to prefer images and colors
  – But some definitelly go for just text


• Allow them both, but images by default
• Black and white best with highlighting
  – 7 (24%) workers in graded
  – 21 (25%) in slider
  – 12 (40%) in hterms
                                        6
HIT Design




             7
Relevance Question
• graded: focus on binary labels



• Binary label
  – Bad = 0, Good = 1
  – Fair: different probabilities? Chose 1 too
• Ranking
  – Order by relevance, then by failures in
    Quality Control and then by time spent
Relevance Question (II)
• slider: focus on ranking


• Do not show handle at the beginning
  – Bias
  – Lazy indistinguishable from undecided


• Seemed unclear it was a slider
Relevance Question (III)

                 100 200 300 400 500 600 700
     Frequency
                 0




                                               0   20   40      60     80        100
                                                        slider value
                                                                            10
Relevance Question (IV)
• Binary label
  – Threshold
  – Normalized between 0 and 100
  – Worker-Normalized         threshold =   0.4
  – Set-Normalized
  – Set-Normalized Threshold
  – Cluster
• Ranking label
  – Implicit
Relevance Question (and V)
• hterms: focus on ranking, seriously


• Still unclear?
               600




                                                                     600
   Frequency




                                                         Frequency
               400




                                                                     400
               200




                                                                     200
               0




                                                                     0




                     0   20   40       60     80   100                     0   20   40       60     80   100
                               slider value                                          slider value
Quality Control
• Worker Level: demographic filters
• Task Level: additional info/questions
  – Implicit: work time, behavioral patterns
  – Explicit: additional verifiable questions
• Process Level: trap questions, training

• Aggregation Level: consensus from redundancy
QC: Worker Level
• At least 100 total approved HITs
• At least 95% approved HITs
  – 98% in hterm
• Work in 50 HITs at most

• Also tried
  – Country
  – Master Qualifications
QC: Implicit Task Level
• Time spent in each document
  – Images and Text modes together


• Don’t use time reported by Amazon
  – Preview + Work time


• Time failure: less than 4.5 secs


                                      15
QC: Implicit Task Level (and II)


                      Time Spent (secs)
                graded      slider   hterms
        Min       3           3           3
        1st Q    10          14           11
       Median    15          23           19


                                           16
QC: Explicit Task Level
• There is previous work with Wikipedia
  – Number of images
  – Headings
  – References
  – Paragraphs
• With music / video
  – Aproximate song duration


• Impractical with arbitrary Web documents
QC: Explicit Task Level (II)
• Ideas
  – Spot nonsensical but syntactically correct sentences
     “the car bought a computer about eating the sea”
     • Not easy to find the right spot to insert it
     • Too annoying for clearly (non)relevant documents
  – Report what paragraph made them decide
     • Kinda useless without redundancy
     • Might be several answers


• Reading comprehension test
QC: Explicit Task Level (III)
• Previous experiment
  – Give us 5-10 keywords to describe the document
     • 4 AMT runs with different demographics
     • 4 faculty members
  – Nearly always gave the top 1-2 most frequent terms
     • Stemming and removing stop words


• Offered two sets of 5 keywords,
  choose the one better describing the document
                                                19
QC: Explicit Task Level (and IV)
• Correct
  – 3 most frequent + 2 in the next 5
• Incorrect
  – 5 in the 25 least frequent


• Shuffle and random picks

• Keyword failure: chose the incorrect terms
                                        20
QC: Process Level
• Previous NIST judgments as trap questions?

• No
  – Need previous judgments
  – Not expected to be balanced
  – Overhead cost
  – More complex procress
  – Do not tell anything about non-trap examples

                                         21
Reject Work and Block Workers
• Limit the number of failures in QC

   Action       Failure   graded     slider     hterms
               Keyword       1         0          1
  Reject HIT
                Time         2         1          1
               Keyword       1         1          1
Block Worker
                Time         2         1          1
       Total HITs rejected 3 (1%)   100 (23%) 13 (3%)
    Total Workers blocked 0 (0%)    40 (48%)    4 (13%)
                                           22
Workers by Country
                        Preview Accept Reject % P     %A    %R    Preview Accept Reject % P     %A    %R    Preview Accept Reject % P % A     %R
Australia                                                                                                         8               100%
Bangladesh                                                             15      3     2   75%    60%   40%
Belgium                                                                                                          2               100%
Canada                        2                100%                     1                100%                    1               100%
Croatia                                                                 4      1          80% 100%     0%
Egypt                                                                                                            1               100%
Finland                      11    50      1   18%    98%    2%         4              100%                      8     43    1    15%   98%    2%
France                                                                  9    24      2 26%      92%    8%
Germany                                                                 1              100%
Guatemala                                                                                                        1               100%
India                       236   214      1   52% 100%      0%       543   235     63    65% 79%     21%      235    190    7    54%   96%    4%
Indonesia                                                               2                100%
Jamaica                                                                 8      5          62% 100%     0%
Japan                                                                   6                100%                    2               100%
Kenya                                                                   2                100%
Lebanon                                                                 3                100%
Lithuania                                                                                                        6      1        86% 100%      0%
Macedonia                                                               1                100%
Moldova                                                                 1                100%
Netherlands                                                                                                      1               100%
Pakistan                      1                100%                     4      1          80% 100%     0%        1               100%
Philippines                                                             8                100%                    3               100%
Poland                                                                  1                100%
Portugal                                                                                                         2               100%
Romania                       1                100%                     1                100%                    5               100%
Saudi Arabia                  3      1          75% 100%     0%         4      4          50% 100%     0%
Slovenia                      2      1          67% 100%     0%         3      1          75% 100%     0%        8      2         80% 100%     0%
Spain                        16                100%                    15                100%                    9               100%
Switzerland                   1                100%                     7                100%
United Arab Emirates                                                   27    12      1    68% 92%      8%       35      9       80% 100%       0%
United Kingdom                8     3          73% 100%      0%        18    18      2    47% 90%     10%        8     16       33% 100%       0%
United States               246   166      1   60% 99%       1%       381   110     28    73% 80%     20%      242    174    5 57% 97%         3%
Yugoslavia                                                             17    14      2    52% 88%     13%        1             100%
              Average       527   435      3   77%    99%    1%      1086   428    100    83% 90%     10%      579    435   13 85% 99%         1%
Results (unofficial)
                       Consensus Truth
             Acc    Rec Prec Spec AP           NDCG
    Median .761    .752 .789 .700 .798         .831
    graded .667    .702 .742 .651 .731         .785
     slider .659   .678 .710 .632 .778         .819
    hterms .725    .725 .781 .726 .818         .846
                           NIST Truth
             Acc    Rec   Prec Spec      AP    NDCG
    Median .623    .729   .773 .536     .931   .922
    graded .748    .802   .841 .632     .922   .958
     slider .690   .720   .821 .607     .889   .935
    hterms .731    .737   .857 .728     .894   .932
Task II

Aggregating Multiple
Relevance Judgments
Good and Bad Workers
• Bad ones in politics might still be good in sports

• Topic categories to distinguish
  – Type: Closed, limited, navigational, open-ended, etc.
  – Subject: politics, people, shopping, etc.
  – Rareness: topic keywords in Wordnet?
  – Readability: Flesch test
GetAnotherLabel
• Input
  – Some known labels
  – Worker responses
• Output
  – Expected label of unknowns
  – Expected quality for each worker
  – Confusion matrix for each worker



                                       27
Step-Wise GetAnotherLabel
• For each worker wi compute expected quality qi
  on all topics and quality qij on each topic type tj.
• For topics in tj, use only workers with qij>qi

• We didn’t use all known labels by good workers
  to compute their expected quality (and final
  label), but only labels in the topic category

• Rareness seemed to work slightly better
Train Rule and SVM Models
• Relevant-to-nonrelevant ratio
  – Unbiased majority voting
• For all workers , average correct-to-incorrect
  ratio when saying relevant/nonrelevant
• For all workers, average posterior probability of
  relevant/nonrelevant
  – Based on the confusion matrix from
    GerAnotherLabel
Results (unofficial)
                       Consensus Truth
             Acc    Rec Prec Spec AP           NDCG
    Median .811    .818 .830 .716 .806         .921
     rule .854     .818 .943 .915 .791         .904
     svm .847      .798 .953 .931 .855         .958
    wordnet .630   .663 .730 .574 .698         .823
                           NIST Truth
             Acc    Rec   Prec Spec      AP    NDCG
    Median .640    .754   .625 .560     .111   .359
     rule .699     .754   .679 .644     .166   .415
     svm .714      .750   .700 .678     .082   .331
    wordnet .571   .659   .560 .484     .060   .299
Sum Up
• Really work the task design
  – “Make it simple, but not simpler” (A. Einstein)
  – Make sure they understand it before scaling up
• Find good QC methods at the explicit task level
  for arbitrary Web pages
  – Was our question too obvious?
• Pretty decent judgments compared to NIST’s
• Look at the whole picture: system rankings
• Study long-term reliability of Crowdsourcing
  – You can’t prove God doesn’t exist
  – You can’t prove Crowdsourcing works

More Related Content

Viewers also liked

Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationRichard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondRichard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondRichard Diamond
 

Viewers also liked (11)

Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper

Process Improvement survey sample
Process Improvement survey sampleProcess Improvement survey sample
Process Improvement survey sampleMel Parrish
 
Chattanooga sme oee down time presentation
Chattanooga sme oee down time presentationChattanooga sme oee down time presentation
Chattanooga sme oee down time presentationJames Mansfield
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Полезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теорииПолезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теорииSQALab
 
12251984 pss7
12251984 pss712251984 pss7
12251984 pss712251984
 
Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]Social Samosa
 
Machine Learning Training in Phagwara
Machine Learning Training in PhagwaraMachine Learning Training in Phagwara
Machine Learning Training in PhagwaraE2MATRIX
 
Machine Learning Training in Chandigarh
Machine Learning Training in ChandigarhMachine Learning Training in Chandigarh
Machine Learning Training in ChandigarhE2MATRIX
 
Machine Learning Training in Ludhiana
Machine Learning Training in LudhianaMachine Learning Training in Ludhiana
Machine Learning Training in LudhianaE2MATRIX
 
Machine Learning Training in Mohali
Machine Learning Training in MohaliMachine Learning Training in Mohali
Machine Learning Training in MohaliE2MATRIX
 
Machine Learning Training in Jalandhar
Machine Learning Training in JalandharMachine Learning Training in Jalandhar
Machine Learning Training in JalandharE2MATRIX
 
IHC 2011 - Widgets Internship
IHC 2011 - Widgets InternshipIHC 2011 - Widgets Internship
IHC 2011 - Widgets InternshipEduardo Oliveira
 
Machine Learning Training in Amritsar
Machine Learning Training in AmritsarMachine Learning Training in Amritsar
Machine Learning Training in AmritsarE2MATRIX
 
Software Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That MatterSoftware Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That MatterWilliam Simms
 

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper (20)

SQC Guest Lecture- Starbucks
SQC Guest Lecture- StarbucksSQC Guest Lecture- Starbucks
SQC Guest Lecture- Starbucks
 
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward ControllersSelf-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
 
Process Improvement survey sample
Process Improvement survey sampleProcess Improvement survey sample
Process Improvement survey sample
 
Chattanooga sme oee down time presentation
Chattanooga sme oee down time presentationChattanooga sme oee down time presentation
Chattanooga sme oee down time presentation
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Полезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теорииПолезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теории
 
12251488 pss7
12251488 pss712251488 pss7
12251488 pss7
 
12251488 pss7
12251488 pss712251488 pss7
12251488 pss7
 
12251984 pss7
12251984 pss712251984 pss7
12251984 pss7
 
Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]
 
Machine Learning Training in Phagwara
Machine Learning Training in PhagwaraMachine Learning Training in Phagwara
Machine Learning Training in Phagwara
 
Machine Learning Training in Chandigarh
Machine Learning Training in ChandigarhMachine Learning Training in Chandigarh
Machine Learning Training in Chandigarh
 
Machine Learning Training in Ludhiana
Machine Learning Training in LudhianaMachine Learning Training in Ludhiana
Machine Learning Training in Ludhiana
 
Machine Learning Training in Mohali
Machine Learning Training in MohaliMachine Learning Training in Mohali
Machine Learning Training in Mohali
 
Machine Learning Training in Jalandhar
Machine Learning Training in JalandharMachine Learning Training in Jalandhar
Machine Learning Training in Jalandhar
 
IHC 2011 - Widgets Internship
IHC 2011 - Widgets InternshipIHC 2011 - Widgets Internship
IHC 2011 - Widgets Internship
 
Machine Learning Training in Amritsar
Machine Learning Training in AmritsarMachine Learning Training in Amritsar
Machine Learning Training in Amritsar
 
Software Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That MatterSoftware Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That Matter
 
Staisticsii
StaisticsiiStaisticsii
Staisticsii
 
Quality management
Quality managementQuality management
Quality management
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 

More from Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper

  • 1. Practical and Effective Design of a Crowdsourcing Task for Unconventional Relevance Judging Julián Urbano @julian_urbano Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns University Carlos III of Madrid TREC 2011 Picture by Michael Dornbierer Gaithersburg, USA · November 18th
  • 2. Task I Crowdsourcing Individual Relevance Judgments
  • 3. In a Nutshell • Amazon Mechanical Turk, External HITs • All 5 documents per set in a sigle HIT = 435 HITs • $0.20 per HIT = $0.04 per document ran out of time graded slider hterms Hours to complete 8.5 38 20.5 HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%) Submitted workers (just preview) 29 (102) 83 (383) 30 (163) Average documents per worker 76 32 75 Total cost (including fees) $95.7 $95.7 $95.7
  • 4. Document Preprocessing • Ensure smooth loading and safe rendering – Null hyperlinks – Embed all external resources – Remove CSS unrelated to style or layout – Remove unsafe HTML elements – Remove irrelevant HTML attributes 4
  • 5. Display Mode hterms run • With images • Black & white, no images 5
  • 6. Display Mode (and II) • Previous experiment – Workers seem to prefer images and colors – But some definitelly go for just text • Allow them both, but images by default • Black and white best with highlighting – 7 (24%) workers in graded – 21 (25%) in slider – 12 (40%) in hterms 6
  • 8. Relevance Question • graded: focus on binary labels • Binary label – Bad = 0, Good = 1 – Fair: different probabilities? Chose 1 too • Ranking – Order by relevance, then by failures in Quality Control and then by time spent
  • 9. Relevance Question (II) • slider: focus on ranking • Do not show handle at the beginning – Bias – Lazy indistinguishable from undecided • Seemed unclear it was a slider
  • 10. Relevance Question (III) 100 200 300 400 500 600 700 Frequency 0 0 20 40 60 80 100 slider value 10
  • 11. Relevance Question (IV) • Binary label – Threshold – Normalized between 0 and 100 – Worker-Normalized threshold = 0.4 – Set-Normalized – Set-Normalized Threshold – Cluster • Ranking label – Implicit
  • 12. Relevance Question (and V) • hterms: focus on ranking, seriously • Still unclear? 600 600 Frequency Frequency 400 400 200 200 0 0 0 20 40 60 80 100 0 20 40 60 80 100 slider value slider value
  • 13. Quality Control • Worker Level: demographic filters • Task Level: additional info/questions – Implicit: work time, behavioral patterns – Explicit: additional verifiable questions • Process Level: trap questions, training • Aggregation Level: consensus from redundancy
  • 14. QC: Worker Level • At least 100 total approved HITs • At least 95% approved HITs – 98% in hterm • Work in 50 HITs at most • Also tried – Country – Master Qualifications
  • 15. QC: Implicit Task Level • Time spent in each document – Images and Text modes together • Don’t use time reported by Amazon – Preview + Work time • Time failure: less than 4.5 secs 15
  • 16. QC: Implicit Task Level (and II) Time Spent (secs) graded slider hterms Min 3 3 3 1st Q 10 14 11 Median 15 23 19 16
  • 17. QC: Explicit Task Level • There is previous work with Wikipedia – Number of images – Headings – References – Paragraphs • With music / video – Aproximate song duration • Impractical with arbitrary Web documents
  • 18. QC: Explicit Task Level (II) • Ideas – Spot nonsensical but syntactically correct sentences “the car bought a computer about eating the sea” • Not easy to find the right spot to insert it • Too annoying for clearly (non)relevant documents – Report what paragraph made them decide • Kinda useless without redundancy • Might be several answers • Reading comprehension test
  • 19. QC: Explicit Task Level (III) • Previous experiment – Give us 5-10 keywords to describe the document • 4 AMT runs with different demographics • 4 faculty members – Nearly always gave the top 1-2 most frequent terms • Stemming and removing stop words • Offered two sets of 5 keywords, choose the one better describing the document 19
  • 20. QC: Explicit Task Level (and IV) • Correct – 3 most frequent + 2 in the next 5 • Incorrect – 5 in the 25 least frequent • Shuffle and random picks • Keyword failure: chose the incorrect terms 20
  • 21. QC: Process Level • Previous NIST judgments as trap questions? • No – Need previous judgments – Not expected to be balanced – Overhead cost – More complex procress – Do not tell anything about non-trap examples 21
  • 22. Reject Work and Block Workers • Limit the number of failures in QC Action Failure graded slider hterms Keyword 1 0 1 Reject HIT Time 2 1 1 Keyword 1 1 1 Block Worker Time 2 1 1 Total HITs rejected 3 (1%) 100 (23%) 13 (3%) Total Workers blocked 0 (0%) 40 (48%) 4 (13%) 22
  • 23. Workers by Country Preview Accept Reject % P %A %R Preview Accept Reject % P %A %R Preview Accept Reject % P % A %R Australia 8 100% Bangladesh 15 3 2 75% 60% 40% Belgium 2 100% Canada 2 100% 1 100% 1 100% Croatia 4 1 80% 100% 0% Egypt 1 100% Finland 11 50 1 18% 98% 2% 4 100% 8 43 1 15% 98% 2% France 9 24 2 26% 92% 8% Germany 1 100% Guatemala 1 100% India 236 214 1 52% 100% 0% 543 235 63 65% 79% 21% 235 190 7 54% 96% 4% Indonesia 2 100% Jamaica 8 5 62% 100% 0% Japan 6 100% 2 100% Kenya 2 100% Lebanon 3 100% Lithuania 6 1 86% 100% 0% Macedonia 1 100% Moldova 1 100% Netherlands 1 100% Pakistan 1 100% 4 1 80% 100% 0% 1 100% Philippines 8 100% 3 100% Poland 1 100% Portugal 2 100% Romania 1 100% 1 100% 5 100% Saudi Arabia 3 1 75% 100% 0% 4 4 50% 100% 0% Slovenia 2 1 67% 100% 0% 3 1 75% 100% 0% 8 2 80% 100% 0% Spain 16 100% 15 100% 9 100% Switzerland 1 100% 7 100% United Arab Emirates 27 12 1 68% 92% 8% 35 9 80% 100% 0% United Kingdom 8 3 73% 100% 0% 18 18 2 47% 90% 10% 8 16 33% 100% 0% United States 246 166 1 60% 99% 1% 381 110 28 73% 80% 20% 242 174 5 57% 97% 3% Yugoslavia 17 14 2 52% 88% 13% 1 100% Average 527 435 3 77% 99% 1% 1086 428 100 83% 90% 10% 579 435 13 85% 99% 1%
  • 24. Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG Median .761 .752 .789 .700 .798 .831 graded .667 .702 .742 .651 .731 .785 slider .659 .678 .710 .632 .778 .819 hterms .725 .725 .781 .726 .818 .846 NIST Truth Acc Rec Prec Spec AP NDCG Median .623 .729 .773 .536 .931 .922 graded .748 .802 .841 .632 .922 .958 slider .690 .720 .821 .607 .889 .935 hterms .731 .737 .857 .728 .894 .932
  • 26. Good and Bad Workers • Bad ones in politics might still be good in sports • Topic categories to distinguish – Type: Closed, limited, navigational, open-ended, etc. – Subject: politics, people, shopping, etc. – Rareness: topic keywords in Wordnet? – Readability: Flesch test
  • 27. GetAnotherLabel • Input – Some known labels – Worker responses • Output – Expected label of unknowns – Expected quality for each worker – Confusion matrix for each worker 27
  • 28. Step-Wise GetAnotherLabel • For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. • For topics in tj, use only workers with qij>qi • We didn’t use all known labels by good workers to compute their expected quality (and final label), but only labels in the topic category • Rareness seemed to work slightly better
  • 29. Train Rule and SVM Models • Relevant-to-nonrelevant ratio – Unbiased majority voting • For all workers , average correct-to-incorrect ratio when saying relevant/nonrelevant • For all workers, average posterior probability of relevant/nonrelevant – Based on the confusion matrix from GerAnotherLabel
  • 30. Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG Median .811 .818 .830 .716 .806 .921 rule .854 .818 .943 .915 .791 .904 svm .847 .798 .953 .931 .855 .958 wordnet .630 .663 .730 .574 .698 .823 NIST Truth Acc Rec Prec Spec AP NDCG Median .640 .754 .625 .560 .111 .359 rule .699 .754 .679 .644 .166 .415 svm .714 .750 .700 .678 .082 .331 wordnet .571 .659 .560 .484 .060 .299
  • 32. • Really work the task design – “Make it simple, but not simpler” (A. Einstein) – Make sure they understand it before scaling up • Find good QC methods at the explicit task level for arbitrary Web pages – Was our question too obvious? • Pretty decent judgments compared to NIST’s • Look at the whole picture: system rankings • Study long-term reliability of Crowdsourcing – You can’t prove God doesn’t exist – You can’t prove Crowdsourcing works