SlideShare a Scribd company logo
1 of 32
Download to read offline
Practical and Effective Design
                  of a Crowdsourcing Task for
               Unconventional Relevance Judging
                             Julián Urbano @julian_urbano
         Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
                                  University Carlos III of Madrid




                                                                                TREC 2011
Picture by Michael Dornbierer                              Gaithersburg, USA · November 18th
Task I

Crowdsourcing Individual
  Relevance Judgments
In a Nutshell
• Amazon Mechanical Turk, External HITs
• All 5 documents per set in a sigle HIT = 435 HITs
• $0.20 per HIT = $0.04 per document
   ran out of time                graded     slider    hterms
      Hours to complete             8.5       38         20.5
   HITs submitted (overhead)     438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just preview) 29 (102)   83 (383)   30 (163)
Average documents per worker        76        32         75
   Total cost (including fees)    $95.7      $95.7      $95.7
Document Preprocessing
• Ensure smooth loading and safe rendering
  – Null hyperlinks
  – Embed all external resources
  – Remove CSS unrelated to style or layout
  – Remove unsafe HTML elements
  – Remove irrelevant HTML attributes




                                         4
Display Mode                 hterms run
• With images




• Black & white, no images



                             5
Display Mode (and II)
• Previous experiment
  – Workers seem to prefer images and colors
  – But some definitelly go for just text


• Allow them both, but images by default
• Black and white best with highlighting
  – 7 (24%) workers in graded
  – 21 (25%) in slider
  – 12 (40%) in hterms
                                        6
HIT Design




             7
Relevance Question
• graded: focus on binary labels



• Binary label
  – Bad = 0, Good = 1
  – Fair: different probabilities? Chose 1 too
• Ranking
  – Order by relevance, then by failures in
    Quality Control and then by time spent
Relevance Question (II)
• slider: focus on ranking


• Do not show handle at the beginning
  – Bias
  – Lazy indistinguishable from undecided


• Seemed unclear it was a slider
Relevance Question (III)

                 100 200 300 400 500 600 700
     Frequency
                 0




                                               0   20   40      60     80        100
                                                        slider value
                                                                            10
Relevance Question (IV)
• Binary label
  – Threshold
  – Normalized between 0 and 100
  – Worker-Normalized         threshold =   0.4
  – Set-Normalized
  – Set-Normalized Threshold
  – Cluster
• Ranking label
  – Implicit
Relevance Question (and V)
• hterms: focus on ranking, seriously


• Still unclear?
               600




                                                                     600
   Frequency




                                                         Frequency
               400




                                                                     400
               200




                                                                     200
               0




                                                                     0




                     0   20   40       60     80   100                     0   20   40       60     80   100
                               slider value                                          slider value
Quality Control
• Worker Level: demographic filters
• Task Level: additional info/questions
  – Implicit: work time, behavioral patterns
  – Explicit: additional verifiable questions
• Process Level: trap questions, training

• Aggregation Level: consensus from redundancy
QC: Worker Level
• At least 100 total approved HITs
• At least 95% approved HITs
  – 98% in hterm
• Work in 50 HITs at most

• Also tried
  – Country
  – Master Qualifications
QC: Implicit Task Level
• Time spent in each document
  – Images and Text modes together


• Don’t use time reported by Amazon
  – Preview + Work time


• Time failure: less than 4.5 secs


                                      15
QC: Implicit Task Level (and II)


                      Time Spent (secs)
                graded      slider   hterms
        Min       3           3           3
        1st Q    10          14           11
       Median    15          23           19


                                           16
QC: Explicit Task Level
• There is previous work with Wikipedia
  – Number of images
  – Headings
  – References
  – Paragraphs
• With music / video
  – Aproximate song duration


• Impractical with arbitrary Web documents
QC: Explicit Task Level (II)
• Ideas
  – Spot nonsensical but syntactically correct sentences
     “the car bought a computer about eating the sea”
     • Not easy to find the right spot to insert it
     • Too annoying for clearly (non)relevant documents
  – Report what paragraph made them decide
     • Kinda useless without redundancy
     • Might be several answers


• Reading comprehension test
QC: Explicit Task Level (III)
• Previous experiment
  – Give us 5-10 keywords to describe the document
     • 4 AMT runs with different demographics
     • 4 faculty members
  – Nearly always gave the top 1-2 most frequent terms
     • Stemming and removing stop words


• Offered two sets of 5 keywords,
  choose the one better describing the document
                                                19
QC: Explicit Task Level (and IV)
• Correct
  – 3 most frequent + 2 in the next 5
• Incorrect
  – 5 in the 25 least frequent


• Shuffle and random picks

• Keyword failure: chose the incorrect terms
                                        20
QC: Process Level
• Previous NIST judgments as trap questions?

• No
  – Need previous judgments
  – Not expected to be balanced
  – Overhead cost
  – More complex procress
  – Do not tell anything about non-trap examples

                                         21
Reject Work and Block Workers
• Limit the number of failures in QC

   Action       Failure   graded     slider     hterms
               Keyword       1         0          1
  Reject HIT
                Time         2         1          1
               Keyword       1         1          1
Block Worker
                Time         2         1          1
       Total HITs rejected 3 (1%)   100 (23%) 13 (3%)
    Total Workers blocked 0 (0%)    40 (48%)    4 (13%)
                                           22
Workers by Country
                        Preview Accept Reject % P     %A    %R    Preview Accept Reject % P     %A    %R    Preview Accept Reject % P % A     %R
Australia                                                                                                         8               100%
Bangladesh                                                             15      3     2   75%    60%   40%
Belgium                                                                                                          2               100%
Canada                        2                100%                     1                100%                    1               100%
Croatia                                                                 4      1          80% 100%     0%
Egypt                                                                                                            1               100%
Finland                      11    50      1   18%    98%    2%         4              100%                      8     43    1    15%   98%    2%
France                                                                  9    24      2 26%      92%    8%
Germany                                                                 1              100%
Guatemala                                                                                                        1               100%
India                       236   214      1   52% 100%      0%       543   235     63    65% 79%     21%      235    190    7    54%   96%    4%
Indonesia                                                               2                100%
Jamaica                                                                 8      5          62% 100%     0%
Japan                                                                   6                100%                    2               100%
Kenya                                                                   2                100%
Lebanon                                                                 3                100%
Lithuania                                                                                                        6      1        86% 100%      0%
Macedonia                                                               1                100%
Moldova                                                                 1                100%
Netherlands                                                                                                      1               100%
Pakistan                      1                100%                     4      1          80% 100%     0%        1               100%
Philippines                                                             8                100%                    3               100%
Poland                                                                  1                100%
Portugal                                                                                                         2               100%
Romania                       1                100%                     1                100%                    5               100%
Saudi Arabia                  3      1          75% 100%     0%         4      4          50% 100%     0%
Slovenia                      2      1          67% 100%     0%         3      1          75% 100%     0%        8      2         80% 100%     0%
Spain                        16                100%                    15                100%                    9               100%
Switzerland                   1                100%                     7                100%
United Arab Emirates                                                   27    12      1    68% 92%      8%       35      9       80% 100%       0%
United Kingdom                8     3          73% 100%      0%        18    18      2    47% 90%     10%        8     16       33% 100%       0%
United States               246   166      1   60% 99%       1%       381   110     28    73% 80%     20%      242    174    5 57% 97%         3%
Yugoslavia                                                             17    14      2    52% 88%     13%        1             100%
              Average       527   435      3   77%    99%    1%      1086   428    100    83% 90%     10%      579    435   13 85% 99%         1%
Results (unofficial)
                       Consensus Truth
             Acc    Rec Prec Spec AP           NDCG
    Median .761    .752 .789 .700 .798         .831
    graded .667    .702 .742 .651 .731         .785
     slider .659   .678 .710 .632 .778         .819
    hterms .725    .725 .781 .726 .818         .846
                           NIST Truth
             Acc    Rec   Prec Spec      AP    NDCG
    Median .623    .729   .773 .536     .931   .922
    graded .748    .802   .841 .632     .922   .958
     slider .690   .720   .821 .607     .889   .935
    hterms .731    .737   .857 .728     .894   .932
Task II

Aggregating Multiple
Relevance Judgments
Good and Bad Workers
• Bad ones in politics might still be good in sports

• Topic categories to distinguish
  – Type: Closed, limited, navigational, open-ended, etc.
  – Subject: politics, people, shopping, etc.
  – Rareness: topic keywords in Wordnet?
  – Readability: Flesch test
GetAnotherLabel
• Input
  – Some known labels
  – Worker responses
• Output
  – Expected label of unknowns
  – Expected quality for each worker
  – Confusion matrix for each worker



                                       27
Step-Wise GetAnotherLabel
• For each worker wi compute expected quality qi
  on all topics and quality qij on each topic type tj.
• For topics in tj, use only workers with qij>qi

• We didn’t use all known labels by good workers
  to compute their expected quality (and final
  label), but only labels in the topic category

• Rareness seemed to work slightly better
Train Rule and SVM Models
• Relevant-to-nonrelevant ratio
  – Unbiased majority voting
• For all workers , average correct-to-incorrect
  ratio when saying relevant/nonrelevant
• For all workers, average posterior probability of
  relevant/nonrelevant
  – Based on the confusion matrix from
    GerAnotherLabel
Results (unofficial)
                       Consensus Truth
             Acc    Rec Prec Spec AP           NDCG
    Median .811    .818 .830 .716 .806         .921
     rule .854     .818 .943 .915 .791         .904
     svm .847      .798 .953 .931 .855         .958
    wordnet .630   .663 .730 .574 .698         .823
                           NIST Truth
             Acc    Rec   Prec Spec      AP    NDCG
    Median .640    .754   .625 .560     .111   .359
     rule .699     .754   .679 .644     .166   .415
     svm .714      .750   .700 .678     .082   .331
    wordnet .571   .659   .560 .484     .060   .299
Sum Up
• Really work the task design
  – “Make it simple, but not simpler” (A. Einstein)
  – Make sure they understand it before scaling up
• Find good QC methods at the explicit task level
  for arbitrary Web pages
  – Was our question too obvious?
• Pretty decent judgments compared to NIST’s
• Look at the whole picture: system rankings
• Study long-term reliability of Crowdsourcing
  – You can’t prove God doesn’t exist
  – You can’t prove Crowdsourcing works

More Related Content

Viewers also liked

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Richard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
Richard Diamond
 

Viewers also liked (11)

Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper

12251984 pss7
12251984 pss712251984 pss7
12251984 pss7
12251984
 
Software Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That MatterSoftware Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That Matter
William Simms
 

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper (20)

SQC Guest Lecture- Starbucks
SQC Guest Lecture- StarbucksSQC Guest Lecture- Starbucks
SQC Guest Lecture- Starbucks
 
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward ControllersSelf-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
 
Process Improvement survey sample
Process Improvement survey sampleProcess Improvement survey sample
Process Improvement survey sample
 
Chattanooga sme oee down time presentation
Chattanooga sme oee down time presentationChattanooga sme oee down time presentation
Chattanooga sme oee down time presentation
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Полезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теорииПолезные метрики покрытия. Практический опыт и немного теории
Полезные метрики покрытия. Практический опыт и немного теории
 
12251488 pss7
12251488 pss712251488 pss7
12251488 pss7
 
12251488 pss7
12251488 pss712251488 pss7
12251488 pss7
 
12251984 pss7
12251984 pss712251984 pss7
12251984 pss7
 
Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]Consumer Awareness of Privacy Issues on the Internet [Report]
Consumer Awareness of Privacy Issues on the Internet [Report]
 
Machine Learning Training in Phagwara
Machine Learning Training in PhagwaraMachine Learning Training in Phagwara
Machine Learning Training in Phagwara
 
Machine Learning Training in Chandigarh
Machine Learning Training in ChandigarhMachine Learning Training in Chandigarh
Machine Learning Training in Chandigarh
 
Machine Learning Training in Ludhiana
Machine Learning Training in LudhianaMachine Learning Training in Ludhiana
Machine Learning Training in Ludhiana
 
Machine Learning Training in Mohali
Machine Learning Training in MohaliMachine Learning Training in Mohali
Machine Learning Training in Mohali
 
Machine Learning Training in Jalandhar
Machine Learning Training in JalandharMachine Learning Training in Jalandhar
Machine Learning Training in Jalandhar
 
IHC 2011 - Widgets Internship
IHC 2011 - Widgets InternshipIHC 2011 - Widgets Internship
IHC 2011 - Widgets Internship
 
Machine Learning Training in Amritsar
Machine Learning Training in AmritsarMachine Learning Training in Amritsar
Machine Learning Training in Amritsar
 
Software Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That MatterSoftware Development And Delivery Metrics That Matter
Software Development And Delivery Metrics That Matter
 
Staisticsii
StaisticsiiStaisticsii
Staisticsii
 
Quality management
Quality managementQuality management
Quality management
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 

More from Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper

  • 1. Practical and Effective Design of a Crowdsourcing Task for Unconventional Relevance Judging Julián Urbano @julian_urbano Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns University Carlos III of Madrid TREC 2011 Picture by Michael Dornbierer Gaithersburg, USA · November 18th
  • 2. Task I Crowdsourcing Individual Relevance Judgments
  • 3. In a Nutshell • Amazon Mechanical Turk, External HITs • All 5 documents per set in a sigle HIT = 435 HITs • $0.20 per HIT = $0.04 per document ran out of time graded slider hterms Hours to complete 8.5 38 20.5 HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%) Submitted workers (just preview) 29 (102) 83 (383) 30 (163) Average documents per worker 76 32 75 Total cost (including fees) $95.7 $95.7 $95.7
  • 4. Document Preprocessing • Ensure smooth loading and safe rendering – Null hyperlinks – Embed all external resources – Remove CSS unrelated to style or layout – Remove unsafe HTML elements – Remove irrelevant HTML attributes 4
  • 5. Display Mode hterms run • With images • Black & white, no images 5
  • 6. Display Mode (and II) • Previous experiment – Workers seem to prefer images and colors – But some definitelly go for just text • Allow them both, but images by default • Black and white best with highlighting – 7 (24%) workers in graded – 21 (25%) in slider – 12 (40%) in hterms 6
  • 8. Relevance Question • graded: focus on binary labels • Binary label – Bad = 0, Good = 1 – Fair: different probabilities? Chose 1 too • Ranking – Order by relevance, then by failures in Quality Control and then by time spent
  • 9. Relevance Question (II) • slider: focus on ranking • Do not show handle at the beginning – Bias – Lazy indistinguishable from undecided • Seemed unclear it was a slider
  • 10. Relevance Question (III) 100 200 300 400 500 600 700 Frequency 0 0 20 40 60 80 100 slider value 10
  • 11. Relevance Question (IV) • Binary label – Threshold – Normalized between 0 and 100 – Worker-Normalized threshold = 0.4 – Set-Normalized – Set-Normalized Threshold – Cluster • Ranking label – Implicit
  • 12. Relevance Question (and V) • hterms: focus on ranking, seriously • Still unclear? 600 600 Frequency Frequency 400 400 200 200 0 0 0 20 40 60 80 100 0 20 40 60 80 100 slider value slider value
  • 13. Quality Control • Worker Level: demographic filters • Task Level: additional info/questions – Implicit: work time, behavioral patterns – Explicit: additional verifiable questions • Process Level: trap questions, training • Aggregation Level: consensus from redundancy
  • 14. QC: Worker Level • At least 100 total approved HITs • At least 95% approved HITs – 98% in hterm • Work in 50 HITs at most • Also tried – Country – Master Qualifications
  • 15. QC: Implicit Task Level • Time spent in each document – Images and Text modes together • Don’t use time reported by Amazon – Preview + Work time • Time failure: less than 4.5 secs 15
  • 16. QC: Implicit Task Level (and II) Time Spent (secs) graded slider hterms Min 3 3 3 1st Q 10 14 11 Median 15 23 19 16
  • 17. QC: Explicit Task Level • There is previous work with Wikipedia – Number of images – Headings – References – Paragraphs • With music / video – Aproximate song duration • Impractical with arbitrary Web documents
  • 18. QC: Explicit Task Level (II) • Ideas – Spot nonsensical but syntactically correct sentences “the car bought a computer about eating the sea” • Not easy to find the right spot to insert it • Too annoying for clearly (non)relevant documents – Report what paragraph made them decide • Kinda useless without redundancy • Might be several answers • Reading comprehension test
  • 19. QC: Explicit Task Level (III) • Previous experiment – Give us 5-10 keywords to describe the document • 4 AMT runs with different demographics • 4 faculty members – Nearly always gave the top 1-2 most frequent terms • Stemming and removing stop words • Offered two sets of 5 keywords, choose the one better describing the document 19
  • 20. QC: Explicit Task Level (and IV) • Correct – 3 most frequent + 2 in the next 5 • Incorrect – 5 in the 25 least frequent • Shuffle and random picks • Keyword failure: chose the incorrect terms 20
  • 21. QC: Process Level • Previous NIST judgments as trap questions? • No – Need previous judgments – Not expected to be balanced – Overhead cost – More complex procress – Do not tell anything about non-trap examples 21
  • 22. Reject Work and Block Workers • Limit the number of failures in QC Action Failure graded slider hterms Keyword 1 0 1 Reject HIT Time 2 1 1 Keyword 1 1 1 Block Worker Time 2 1 1 Total HITs rejected 3 (1%) 100 (23%) 13 (3%) Total Workers blocked 0 (0%) 40 (48%) 4 (13%) 22
  • 23. Workers by Country Preview Accept Reject % P %A %R Preview Accept Reject % P %A %R Preview Accept Reject % P % A %R Australia 8 100% Bangladesh 15 3 2 75% 60% 40% Belgium 2 100% Canada 2 100% 1 100% 1 100% Croatia 4 1 80% 100% 0% Egypt 1 100% Finland 11 50 1 18% 98% 2% 4 100% 8 43 1 15% 98% 2% France 9 24 2 26% 92% 8% Germany 1 100% Guatemala 1 100% India 236 214 1 52% 100% 0% 543 235 63 65% 79% 21% 235 190 7 54% 96% 4% Indonesia 2 100% Jamaica 8 5 62% 100% 0% Japan 6 100% 2 100% Kenya 2 100% Lebanon 3 100% Lithuania 6 1 86% 100% 0% Macedonia 1 100% Moldova 1 100% Netherlands 1 100% Pakistan 1 100% 4 1 80% 100% 0% 1 100% Philippines 8 100% 3 100% Poland 1 100% Portugal 2 100% Romania 1 100% 1 100% 5 100% Saudi Arabia 3 1 75% 100% 0% 4 4 50% 100% 0% Slovenia 2 1 67% 100% 0% 3 1 75% 100% 0% 8 2 80% 100% 0% Spain 16 100% 15 100% 9 100% Switzerland 1 100% 7 100% United Arab Emirates 27 12 1 68% 92% 8% 35 9 80% 100% 0% United Kingdom 8 3 73% 100% 0% 18 18 2 47% 90% 10% 8 16 33% 100% 0% United States 246 166 1 60% 99% 1% 381 110 28 73% 80% 20% 242 174 5 57% 97% 3% Yugoslavia 17 14 2 52% 88% 13% 1 100% Average 527 435 3 77% 99% 1% 1086 428 100 83% 90% 10% 579 435 13 85% 99% 1%
  • 24. Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG Median .761 .752 .789 .700 .798 .831 graded .667 .702 .742 .651 .731 .785 slider .659 .678 .710 .632 .778 .819 hterms .725 .725 .781 .726 .818 .846 NIST Truth Acc Rec Prec Spec AP NDCG Median .623 .729 .773 .536 .931 .922 graded .748 .802 .841 .632 .922 .958 slider .690 .720 .821 .607 .889 .935 hterms .731 .737 .857 .728 .894 .932
  • 26. Good and Bad Workers • Bad ones in politics might still be good in sports • Topic categories to distinguish – Type: Closed, limited, navigational, open-ended, etc. – Subject: politics, people, shopping, etc. – Rareness: topic keywords in Wordnet? – Readability: Flesch test
  • 27. GetAnotherLabel • Input – Some known labels – Worker responses • Output – Expected label of unknowns – Expected quality for each worker – Confusion matrix for each worker 27
  • 28. Step-Wise GetAnotherLabel • For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. • For topics in tj, use only workers with qij>qi • We didn’t use all known labels by good workers to compute their expected quality (and final label), but only labels in the topic category • Rareness seemed to work slightly better
  • 29. Train Rule and SVM Models • Relevant-to-nonrelevant ratio – Unbiased majority voting • For all workers , average correct-to-incorrect ratio when saying relevant/nonrelevant • For all workers, average posterior probability of relevant/nonrelevant – Based on the confusion matrix from GerAnotherLabel
  • 30. Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG Median .811 .818 .830 .716 .806 .921 rule .854 .818 .943 .915 .791 .904 svm .847 .798 .953 .931 .855 .958 wordnet .630 .663 .730 .574 .698 .823 NIST Truth Acc Rec Prec Spec AP NDCG Median .640 .754 .625 .560 .111 .359 rule .699 .754 .679 .644 .166 .415 svm .714 .750 .700 .678 .082 .331 wordnet .571 .659 .560 .484 .060 .299
  • 32. • Really work the task design – “Make it simple, but not simpler” (A. Einstein) – Make sure they understand it before scaling up • Find good QC methods at the explicit task level for arbitrary Web pages – Was our question too obvious? • Pretty decent judgments compared to NIST’s • Look at the whole picture: system rankings • Study long-term reliability of Crowdsourcing – You can’t prove God doesn’t exist – You can’t prove Crowdsourcing works