Adventures in Crowdsourcing: Research at UT Austin & Beyond

Adventures in Crowdsourcing:
Research at UT Austin & Beyond

Matt Lease
School of Information @mattlease

University of Texas at Austin ml@ischool.utexas.edu

Outline
• Foundations
• Work at UT Austin
• A Few Roadblocks
– Workflow Design
– Sensitive data
– Regulation
– Fraud
– Ethics

August 23, 2012 Matt Lease - ml@ischool.utexas.edu 2

Amazon Mechanical Turk (MTurk)

• Marketplace for crowd labor (microtasks)
• Created in 2005 (still in “beta”)
• On-demand, scalable, 24/7 global workforce


Labeling Data (“Gold Rush”)

Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• 22K labels for US $26
• High agreement between
consensus labels and
gold-standard labels

Sorokin & Forsythe (CVPR 2008)
• MTurk for Computer Vision
• 4K labels for US $60


Kittur, Chi, & Suh (CHI 2008)

• MTurk for User Studies

• “…make creating believable invalid responses as
effortful as completing the task in good faith.”


Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
– Judge relevance of search engine results
• Various follow-on studies (design, quality, cost)


Social & Behavioral Sciences
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.

What about data quality?
• Many CS papers on statistical methods
– Online vs. offline, feature-based vs. content-agnostic
– Worker calibration, noise vs. bias, weighted voting
– Work in my lab by Jung, Kumar, Ryu, & Tang
• Human factors also matter!
– Instructions, design, interface, interaction
– Names, relationship, reputation
– Fair pay, hourly vs. per-task, recognition, advancement
– For contrast with MTurk, consider Kochhar (2010)
• See Lease, HComp‘11 11

Kovashka & Lease, CrowdConf’10


Grady & Lease, 2010 (Search Eval.)

August 23, 2012 Matt Lease - ml@ischool.utexas.edu 14/10

Noisy Supervised Classification
Kumar and Lease, 2011(a)

Our 1st study of aggregation (Fall’10)
Simple idea, simulated workers
Highlights concepts & open questions

Problem
• Crowd labels tends to be noisy
• Can reduce uncertainty via wisdom of crowds
– Collect & aggregate multiple labels per example
• How do we to maximize learning (labeling effort)?
– Label a new example?
– Get another label for an already-labeled example?

See: Sheng, Provost & Ipeirotis, KDD’08


Setup
• Task: Binary classification
• Learner: C4.5 decision tree
• Given
– An initial seed set of single-labeled examples (64)
– An unlimited pool of unlabeled examples
• Cost model
– Fixed unit cost for labeling any example
– Unlabeled examples are freely obtained
• Goal: Maximize learning rate (for labeling effort)

Compare 3 methods: SL, MV, & NB
• Single labeling (SL): label a new example

• Multi-Labeling: get another label for pool
– Majority Vote (MV): consensus by simple vote

– Naïve Bayes (NB): weight vote by annotator accuracy


Assumptions
• Example selection: random
– From pool for SL, from seed set for multi-labeling

• Fixed commitment to a single method a priori
• Balanced classes (accuracy, uniform prior)
• Annotator accuracies are known to system
– In practice, must estimate these: from gold data
(Snow et al. ’08) or EM (Dawid & Skene’79)


Simulation
• Each annotator
– Has parameter p (prob. of producing correct label)
– Generates exactly one label
• Uniform distribution of accuracies U(min,max)
• Generative model for simulation
– Pick an example x (with true label y*) at random
– Draw annotator accuracy p ~ U(min,max)
– Generate label y ~ P(y | p, y*)


Evaluation
• Data: datasets from UCI ML Repository
– Mushroom
– Spambase http://archive.ics.uci.edu/ml/datasets.html

– Tic-Tac-Toe
– Chess: King-Rook vs. King-Pawn
• Same trends across all 4, so we report first 2
• Random 70 / 30 split of data for seed+pool / test
• Repeat each run 10 times and average results


p ~ U(0.6, 1.0)
• Fairly accurate annotators (mean = 0.8)
• Little uncertainty -> little gain from multi-labeling


p ~ U(0.4, 0.6)
• Very noisy (mean = 0.5, random coin flip)
• SL and MV learning rates are flat
• NB wins by weighting more accurate workers


p ~ U(0.1, 0.7)
• Worsen accuracies further (mean = 0.4)
• NB virtually unchanged
• SL and MV predictions become anti-correlated
– We should actually flip their predictions…


Label flipping
• Is NB doing better due to how it uses accuracy,
or simply because it’s using more information?
• Average accuracy < 50% --> label usually wrong
– NB implicitly captures; SL and MV do not
• Label flipping: put all methods on even-footing
• Simple case of bias vs. noise
– Issue is not whether correlated or anti-correlated
– Issue is strength of correlation


p ~ U(0.1, 0.7)
No flipping
fter

With flipping
Mushroom Dataset Spambase Dataset
100
90
80 80
70
60 SL accuracy (%) SL accuracy (%)
60
MV accuracy(%)
40 MV accuracy(%) 50
NB accuracy (%)
40
20 NB accuracy (%)
30
0 20
64 128 256 512 1024 2048 4096 64 128 256 512 1024 2048


Summary of study
• Detecting anti-correlated (bad) workers
more important than the model used
• Open Questions
– When accuracies are estimated (noisy)?
– With actual error distribution (real data)?
– With different learners or tasks (e.g. ranking)?
– With dynamic choice of new example or re-label?
– With active learning example selection?
– With imbalanced classes?

Snapshots


Noisy Learning
to Rank

Kumar & Lease
2011b


Semi-Supervised Repeated Labeling
Tang & Lease, CIR’11


Smart Crowd Filter
• Ryu & Lease, ASIS&T’11
• Active Learning
– Train Multi-class SVM to estimate P(Y|X)
– Estimate average P(Y|X) for each worker
– Filter out workers below threshold
• Explore/Exploit (unexpected/expected labels)


Z-score Weighted Filtering & Voting

Jung & Lease, Hcomp’11


Inferring Missing Judgments

Jung & Lease, 2012


Jung & Lease, Hcomp’12


Social Network + Crowdsourcing
• Klinger & Lease, ASIS&T’11


Website Usability (Liu et al., 2012)


Designing & Optimizing Workflows

38

Workflow Management
• How should we balance automation vs.
human computation? Who does what?

• Who’s the right person for the job?

• Juggling constraints on budget, scheduling,
quality, effort …


What about sensitive data?
• Not all data can be publicly disclosed
– User data (e.g. AOL query log, Netflix ratings)
– Intellectual property
– Legal confidentiality
• Need to restrict who is in your crowd
– Separate channel (workforce) from technology
– Hot question for adoption at enterprise level

40

What about regulation?
• Wolfson & Lease (ASIS&T’11)
• As usual, technology is ahead of the law
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
– Understand risks of “just in-time compliance”

41

What about fraud?
• Some reports of robot “workers” on MTurk
– Artificial Artificial Artificial Intelligence
– Violates terms of service
• Why not just use a captcha?

42

Captcha Fraud

• Severity?

43

Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
44

Wang et al., WWW’12
• “…not only do malicious crowd-sourcing
systems exist, but they are rapidly growing…”

46

Robert Sim, MSR Summit’12

47

Identifying Workers (Uniquely)
• Need for identifiable workers
– Repeated labeling
– Recognizing “Master Workers”
• Today
– Platforms assign IDs intended to be unique
– Problem in practice, esp. with multiple platforms
– Sybil attacks
• Identity value
– If workers interchangeable, identities are disposable
– If workers are distinguished, identifies become valuable
– Reduce some types of attacks, increase others

What about ethics?
Fort, Adda, and Cohen (2011)
• “…opportunities for our community to deliberately
value ethics above cost savings.”
• Suggest we focus on unpaid games; narrow solution

Silberman, Irani, and Ross (2010)
• “How should we… conceptualize the role of these
people who we ask to power our computing?”
• Power dynamics between parties
• “Abstraction hides detail”
49

Davis et al. (2010) The HPU.

HPU

50

HPU: “Abstraction hides detail”

51

Digital Dirty Jobs
• The Googler who Looked at the Worst of the Internet
• Policing the Web’s Lurid Precincts
• Facebook content moderation
• The dirty job of keeping Facebook clean

• Even linguistic annotators report stress &
nightmares from reading news articles!
52

What about freedom?
• Vision: empowering worker freedom:
– work whenever you want for whomever you want

• Risk: people being compelled to perform work
– As crowdsourcing grows, greater $$$ at stake
– Digital sweat shops? Digital slaves?
– Prisoners used for gold farming
– We really don’t know much today
– Traction? Human Trafficking at MSR Summit’12

53

Thank You!
Students: Past & Present
– Catherine Grady (iSchool)
– Hyunjoon Jung (iSchool)
– Jorn Klinger (Linguistics)
– Adriana Kovashka (CS)
– Abhimanu Kumar (CS)
ir.ischool.utexas.edu/crowd
– Hohyon Ryu (iSchool)
– Wei Tang (CS)
– Stephen Wolfson (iSchool)
Support
– John P. Commons Fellowship
– Temple Fellowship
Matt Lease - ml@ischool.utexas.edu - @mattlease 54

Adventures in Crowdsourcing: Research at UT Austin & Beyond

Recommended

Recommended

More Related Content

Similar to Adventures in Crowdsourcing: Research at UT Austin & Beyond

Similar to Adventures in Crowdsourcing: Research at UT Austin & Beyond (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Adventures in Crowdsourcing: Research at UT Austin & Beyond