This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Powerful Google developer tools for immediate impact! (2023-24 C)
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper
1. Practical and Effective Design
of a Crowdsourcing Task for
Unconventional Relevance Judging
Julián Urbano @julian_urbano
Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
University Carlos III of Madrid
TREC 2011
Picture by Michael Dornbierer Gaithersburg, USA · November 18th
3. In a Nutshell
• Amazon Mechanical Turk, External HITs
• All 5 documents per set in a sigle HIT = 435 HITs
• $0.20 per HIT = $0.04 per document
ran out of time graded slider hterms
Hours to complete 8.5 38 20.5
HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just preview) 29 (102) 83 (383) 30 (163)
Average documents per worker 76 32 75
Total cost (including fees) $95.7 $95.7 $95.7
4. Document Preprocessing
• Ensure smooth loading and safe rendering
– Null hyperlinks
– Embed all external resources
– Remove CSS unrelated to style or layout
– Remove unsafe HTML elements
– Remove irrelevant HTML attributes
4
5. Display Mode hterms run
• With images
• Black & white, no images
5
6. Display Mode (and II)
• Previous experiment
– Workers seem to prefer images and colors
– But some definitelly go for just text
• Allow them both, but images by default
• Black and white best with highlighting
– 7 (24%) workers in graded
– 21 (25%) in slider
– 12 (40%) in hterms
6
8. Relevance Question
• graded: focus on binary labels
• Binary label
– Bad = 0, Good = 1
– Fair: different probabilities? Chose 1 too
• Ranking
– Order by relevance, then by failures in
Quality Control and then by time spent
9. Relevance Question (II)
• slider: focus on ranking
• Do not show handle at the beginning
– Bias
– Lazy indistinguishable from undecided
• Seemed unclear it was a slider
12. Relevance Question (and V)
• hterms: focus on ranking, seriously
• Still unclear?
600
600
Frequency
Frequency
400
400
200
200
0
0
0 20 40 60 80 100 0 20 40 60 80 100
slider value slider value
13. Quality Control
• Worker Level: demographic filters
• Task Level: additional info/questions
– Implicit: work time, behavioral patterns
– Explicit: additional verifiable questions
• Process Level: trap questions, training
• Aggregation Level: consensus from redundancy
14. QC: Worker Level
• At least 100 total approved HITs
• At least 95% approved HITs
– 98% in hterm
• Work in 50 HITs at most
• Also tried
– Country
– Master Qualifications
15. QC: Implicit Task Level
• Time spent in each document
– Images and Text modes together
• Don’t use time reported by Amazon
– Preview + Work time
• Time failure: less than 4.5 secs
15
16. QC: Implicit Task Level (and II)
Time Spent (secs)
graded slider hterms
Min 3 3 3
1st Q 10 14 11
Median 15 23 19
16
17. QC: Explicit Task Level
• There is previous work with Wikipedia
– Number of images
– Headings
– References
– Paragraphs
• With music / video
– Aproximate song duration
• Impractical with arbitrary Web documents
18. QC: Explicit Task Level (II)
• Ideas
– Spot nonsensical but syntactically correct sentences
“the car bought a computer about eating the sea”
• Not easy to find the right spot to insert it
• Too annoying for clearly (non)relevant documents
– Report what paragraph made them decide
• Kinda useless without redundancy
• Might be several answers
• Reading comprehension test
19. QC: Explicit Task Level (III)
• Previous experiment
– Give us 5-10 keywords to describe the document
• 4 AMT runs with different demographics
• 4 faculty members
– Nearly always gave the top 1-2 most frequent terms
• Stemming and removing stop words
• Offered two sets of 5 keywords,
choose the one better describing the document
19
20. QC: Explicit Task Level (and IV)
• Correct
– 3 most frequent + 2 in the next 5
• Incorrect
– 5 in the 25 least frequent
• Shuffle and random picks
• Keyword failure: chose the incorrect terms
20
21. QC: Process Level
• Previous NIST judgments as trap questions?
• No
– Need previous judgments
– Not expected to be balanced
– Overhead cost
– More complex procress
– Do not tell anything about non-trap examples
21
22. Reject Work and Block Workers
• Limit the number of failures in QC
Action Failure graded slider hterms
Keyword 1 0 1
Reject HIT
Time 2 1 1
Keyword 1 1 1
Block Worker
Time 2 1 1
Total HITs rejected 3 (1%) 100 (23%) 13 (3%)
Total Workers blocked 0 (0%) 40 (48%) 4 (13%)
22
26. Good and Bad Workers
• Bad ones in politics might still be good in sports
• Topic categories to distinguish
– Type: Closed, limited, navigational, open-ended, etc.
– Subject: politics, people, shopping, etc.
– Rareness: topic keywords in Wordnet?
– Readability: Flesch test
27. GetAnotherLabel
• Input
– Some known labels
– Worker responses
• Output
– Expected label of unknowns
– Expected quality for each worker
– Confusion matrix for each worker
27
28. Step-Wise GetAnotherLabel
• For each worker wi compute expected quality qi
on all topics and quality qij on each topic type tj.
• For topics in tj, use only workers with qij>qi
• We didn’t use all known labels by good workers
to compute their expected quality (and final
label), but only labels in the topic category
• Rareness seemed to work slightly better
29. Train Rule and SVM Models
• Relevant-to-nonrelevant ratio
– Unbiased majority voting
• For all workers , average correct-to-incorrect
ratio when saying relevant/nonrelevant
• For all workers, average posterior probability of
relevant/nonrelevant
– Based on the confusion matrix from
GerAnotherLabel
32. • Really work the task design
– “Make it simple, but not simpler” (A. Einstein)
– Make sure they understand it before scaling up
• Find good QC methods at the explicit task level
for arbitrary Web pages
– Was our question too obvious?
• Pretty decent judgments compared to NIST’s
• Look at the whole picture: system rankings
• Study long-term reliability of Crowdsourcing
– You can’t prove God doesn’t exist
– You can’t prove Crowdsourcing works