This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
How to Troubleshoot Apps for the Modern Connected Worker
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper
1. Practical and Effective Design
of a Crowdsourcing Task for
Unconventional Relevance Judging
Julián Urbano @julian_urbano
Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
University Carlos III of Madrid
TREC 2011
Picture by Michael Dornbierer Gaithersburg, USA · November 18th
3. In a Nutshell
• Amazon Mechanical Turk, External HITs
• All 5 documents per set in a sigle HIT = 435 HITs
• $0.20 per HIT = $0.04 per document
ran out of time graded slider hterms
Hours to complete 8.5 38 20.5
HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just preview) 29 (102) 83 (383) 30 (163)
Average documents per worker 76 32 75
Total cost (including fees) $95.7 $95.7 $95.7
4. Document Preprocessing
• Ensure smooth loading and safe rendering
– Null hyperlinks
– Embed all external resources
– Remove CSS unrelated to style or layout
– Remove unsafe HTML elements
– Remove irrelevant HTML attributes
4
5. Display Mode hterms run
• With images
• Black & white, no images
5
6. Display Mode (and II)
• Previous experiment
– Workers seem to prefer images and colors
– But some definitelly go for just text
• Allow them both, but images by default
• Black and white best with highlighting
– 7 (24%) workers in graded
– 21 (25%) in slider
– 12 (40%) in hterms
6
8. Relevance Question
• graded: focus on binary labels
• Binary label
– Bad = 0, Good = 1
– Fair: different probabilities? Chose 1 too
• Ranking
– Order by relevance, then by failures in
Quality Control and then by time spent
9. Relevance Question (II)
• slider: focus on ranking
• Do not show handle at the beginning
– Bias
– Lazy indistinguishable from undecided
• Seemed unclear it was a slider
12. Relevance Question (and V)
• hterms: focus on ranking, seriously
• Still unclear?
600
600
Frequency
Frequency
400
400
200
200
0
0
0 20 40 60 80 100 0 20 40 60 80 100
slider value slider value
13. Quality Control
• Worker Level: demographic filters
• Task Level: additional info/questions
– Implicit: work time, behavioral patterns
– Explicit: additional verifiable questions
• Process Level: trap questions, training
• Aggregation Level: consensus from redundancy
14. QC: Worker Level
• At least 100 total approved HITs
• At least 95% approved HITs
– 98% in hterm
• Work in 50 HITs at most
• Also tried
– Country
– Master Qualifications
15. QC: Implicit Task Level
• Time spent in each document
– Images and Text modes together
• Don’t use time reported by Amazon
– Preview + Work time
• Time failure: less than 4.5 secs
15
16. QC: Implicit Task Level (and II)
Time Spent (secs)
graded slider hterms
Min 3 3 3
1st Q 10 14 11
Median 15 23 19
16
17. QC: Explicit Task Level
• There is previous work with Wikipedia
– Number of images
– Headings
– References
– Paragraphs
• With music / video
– Aproximate song duration
• Impractical with arbitrary Web documents
18. QC: Explicit Task Level (II)
• Ideas
– Spot nonsensical but syntactically correct sentences
“the car bought a computer about eating the sea”
• Not easy to find the right spot to insert it
• Too annoying for clearly (non)relevant documents
– Report what paragraph made them decide
• Kinda useless without redundancy
• Might be several answers
• Reading comprehension test
19. QC: Explicit Task Level (III)
• Previous experiment
– Give us 5-10 keywords to describe the document
• 4 AMT runs with different demographics
• 4 faculty members
– Nearly always gave the top 1-2 most frequent terms
• Stemming and removing stop words
• Offered two sets of 5 keywords,
choose the one better describing the document
19
20. QC: Explicit Task Level (and IV)
• Correct
– 3 most frequent + 2 in the next 5
• Incorrect
– 5 in the 25 least frequent
• Shuffle and random picks
• Keyword failure: chose the incorrect terms
20
21. QC: Process Level
• Previous NIST judgments as trap questions?
• No
– Need previous judgments
– Not expected to be balanced
– Overhead cost
– More complex procress
– Do not tell anything about non-trap examples
21
22. Reject Work and Block Workers
• Limit the number of failures in QC
Action Failure graded slider hterms
Keyword 1 0 1
Reject HIT
Time 2 1 1
Keyword 1 1 1
Block Worker
Time 2 1 1
Total HITs rejected 3 (1%) 100 (23%) 13 (3%)
Total Workers blocked 0 (0%) 40 (48%) 4 (13%)
22
26. Good and Bad Workers
• Bad ones in politics might still be good in sports
• Topic categories to distinguish
– Type: Closed, limited, navigational, open-ended, etc.
– Subject: politics, people, shopping, etc.
– Rareness: topic keywords in Wordnet?
– Readability: Flesch test
27. GetAnotherLabel
• Input
– Some known labels
– Worker responses
• Output
– Expected label of unknowns
– Expected quality for each worker
– Confusion matrix for each worker
27
28. Step-Wise GetAnotherLabel
• For each worker wi compute expected quality qi
on all topics and quality qij on each topic type tj.
• For topics in tj, use only workers with qij>qi
• We didn’t use all known labels by good workers
to compute their expected quality (and final
label), but only labels in the topic category
• Rareness seemed to work slightly better
29. Train Rule and SVM Models
• Relevant-to-nonrelevant ratio
– Unbiased majority voting
• For all workers , average correct-to-incorrect
ratio when saying relevant/nonrelevant
• For all workers, average posterior probability of
relevant/nonrelevant
– Based on the confusion matrix from
GerAnotherLabel
32. • Really work the task design
– “Make it simple, but not simpler” (A. Einstein)
– Make sure they understand it before scaling up
• Find good QC methods at the explicit task level
for arbitrary Web pages
– Was our question too obvious?
• Pretty decent judgments compared to NIST’s
• Look at the whole picture: system rankings
• Study long-term reliability of Crowdsourcing
– You can’t prove God doesn’t exist
– You can’t prove Crowdsourcing works