Crowdsourcing: From Aggregation to Search Engine Evaluation

Statistical Crowdsourcing: From Aggregating
Judgments to Search Engine Evaluation
Matt Lease
ir.ischool.utexas.edu
School of Information @mattlease
University of Texas at Austin ml@utexas.edu

Roadmap
• What are Crowdsourcing & Human Computation? 4-16
– A great research area for iSchools: something for everyone!
• Benchmarking Statistical Consensus Methods 18-26
• Psychometrics & Crowds for Relevance Judging 28-35
3
Matt Lease <ml@utexas.edu>

Crowdsourcing
• Jeff Howe. WIRED, June 2006
• Rise of digital work & internet
empowers a global workforce
via open call solicitations
• New application of principles
from open source movement
4

• Online marketplace for paid labor since 2005
• On-demand, elastic, 24/7 global workforce
• API integrates human labor with computation
Amazon Mechanical Turk (MTurk)
6

A New Scale of Labeled Data for AI
Snow et al., EMNLP 2008
• MTurk labels for 5 NLP Tasks
• 22K labels for only $26
• While individual annotations
noisy, aggregated consensus
labels show high agreement
with expert labels (“gold”)
7

AI + Human Computation =
A new breed of hybrid intelligent systems
PlateMate (Noronha et al., UIST’10)
8

Social & Behavioral Sciences
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
9

Ethics of Crowdsourcing?
11
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.

Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
12

Safeguarding Participant Data
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to strictly protect workers’ anonymity….”
14

`
Amazon profile page
URLs use the same
IDs as used on MTurk!
Lease et al., SSRN’13
15

Crowdsourcing & the Law: Independent
Contractors vs. Employees
• Wolfson & Lease, ASIS&T’11
• Some platforms classify online contributors as
independent contractors (vs. employees)
• While Employment is legally-defined (e.g., FLSA
and past court decisions), the definition leaks
• It seems unlikely Congress will provide clarity
• Class action litigation pending in the courts
16

Roadmap
• a
17

Science of Measurement & Benchmarks
• “If you cannot measure it, you cannot improve it.”
• Drive field innovation by clear challenge tasks
– e.g., David Tse’s FIST 2012 Keynote (Comp. Biology)
• Many things we can learn
– What is the current state-of-the-art?
– How do current methods compare?
– What works, what doesn’t, and why?
– How has field progressed over time? 18

Finding Consensus in Human Computation
• For an objective labeling task, how do we
resolve disagreement between responses?
• Simple baseline: majority voting
• Research pre-dates crowdsourcing
– Dawid and Skene’79, Smyth et al., ’95
• One of the most studied problems in HCOMP
– Laymen likely to err more than experts
– Methods in many areas: ML, Vision, NLP, IR, DB, … 19

20
SQUARE:
A Benchmark
for Research on
Computing
Crowd
Consensus
@HCOMP’13
ir.ischool.utexas.edu/square
(open source)

Methods
Include popular and/or open-source methods
• Majority Voting
• Expectation-Maximization (Dawid-Skene, 1979)
• Naïve Bayes (Snow et al., 2008)
• GLAD (Whitehill et al., 2009)
• ZenCrowd (Demartini et al., 2012)
• Raykar et al. (2012)
• CUBAM (Welinder et al., 2010)
22

Results: Unsupervised Accuracy
Relative gain/loss vs. majority voting
23
-15%
-10%
-5%
0%
5%
10%
15%
BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL
DS ZC RY GLAD CUBCAM

Results: Varying Supervision
24

Findings
• Majority voting never best, but rarely much worse
• No method performs far better than others
• Each method often best for some condition
– e.g., original dataset method was designed for
• DS & RY tend to perform best (RY adds priors)
25

Why Don’t We See Bigger Gains?
• Of course contributions aren’t just empirical…
• Maybe gold is too noisy to detect improvement?
– Cormack & Kolcz’09, Klebanov & Beigman’10
• Might we see bigger differences from
– Different tasks/scenarios?
– Better benchmark tests?
– Different methods or tuning?
• We invite community contributions!
26

Roadmap
• a
27

Multidimensional Relevance Modeling
via Psychometrics and Crowdsourcing
Joint work with
Yinglong Zhang Jin Zhang Jacek Gwizdka
Paper @ SIGIR 2014
28

Background: Evaluating IR Systems
• Classic Cranfield method (Cleverdon et al., 1966)
– Given a document collection & set of queries
– Judge documents for topical relevance to each query
– Evaluate on these queries & documents
• Problem: Scaling manual data labeling is difficult
• Idea: try Crowdsourcing
– Alonso et al. (SIGIR Forum 2008)
– Grady & Lease, 2010
– TREC 2011-2013 Crowdsourcing Track 29

But Problems are Deeper
• User relevance > simple topical relevance
– The Great Divide in IR: systems-centered vs. user-centered
– What other factors to model, & what is their relative
importance? Long history of studies, little consensus.
– Dearth of labeled data for training/evaluating systems
• Even trusted assessors disagree often on “simple”
topical relevance judgments
– Often attributed to subjectivity, but can we do better?
• How do we ensure quality of subjective data?
– Largely unstudied in HCOMP community to date
30

Pscychology to the Rescue!
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
31

Key Ideas from Pscyhometrics
• Use standard survey techniques for collecting
multi-dimensional relevance judgments
– Ask repeated, similar questions, & change polarity
• Analyze data via Structural Equation Modeling
– cousin to graphical models in statistics/AI
– Posit questions associated with latent factors
– Use Exploratory Factor Analysis to determine factors
& question associations, then prune questions
– Use Confirmatory Factor Analysis to assess
correlations, test significance, and compare models
33

Future Directions
• Strong foundation for ongoing positivist research of
alternative relevance factors
– For different user groups, search scenarios, etc.
– Need more data to support normative claims
• Train/test operational systems for varying factors
• Improve judging agreement by making task more
natural and/or assessing impact of latent factors
• Intra-subject vs. inter-subject aggregation?
• SEM vs. graphical modeling?
• Other methods for ensuring subjective data quality?
35

The Future of Crowd Work, CSCW’13
Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
36

Thank You!
ir.ischool.utexas.edu
Slides: www.slideshare.net/mattlease

A Few Moral Dilemmas
• A “fair” price for online work in a global economy?
– Is it better to pay nothing (i.e., volunteers, gamification)
rather than pay something small for valuable work?
• Are we obligated to inform people how their
participation / work products will be used?
– If my IRB doesn’t require me to obtain informed consent,
is there some other moral obligation to do so?
• A worker finds his ID posted in a researcher’s online
source code and asks that it be removed. This can’t
be done without recreating the repo, which many
people use. What should be done?
39

Ethical Crowdsourcing
• Assume researchers have good intentions, and
so issues of gross negligence are rare
– Withholding promised pay after work performed
– Not obtaining or complying with IRB oversight
• Instead, great challenge is how to recognize our
impacts appropriate actions in a complex world
– Educating ourselves takes time & effort
– Failing to educate ourselves could harm to others
• How can we strike a reasonable balance between
complete apathy vs. being overly alarmist?
40

• Contribute to society and human well-being
• Avoid harm to others
• Be honest and trustworthy
• Be fair and take action not to discriminate
• Respect the privacy of others
COMPLIANCE WITH THE CODE. As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with
membership in the ACM
41

CS2008 Curriculum Update (ACM, IEEE)
There is reasonably wide agreement that this topic of legal, social,
professional and ethical should feature in all computing degrees.
…financial and economic imperatives …Which approaches are less
expensive and is this sensible? With the advent of outsourcing and
off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning
exploitation… Such matters ought to feature in courses on legal,
ethical and professional practice.
if ethical considerations are covered only in the standalone course and
not “in context,” it will reinforce the false notion that technical processes
are void of ethical issues. Thus it is important that several traditional
courses include modules that analyze ethical considerations in the
context of the technical subject matter … It would be explicitly against
the spirit of the recommendations to have only a standalone course.
42

“Contribute to society and human
well-being; avoid harm to others”
• Do we have a moral obligation to try to ascertain
conditions under which work is performed? Or the
impact we have upon those performing the work?
• Do we feel differently when work is performed by
– Political refugees? Children? Prisoners? Disabled?
• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
– Does it matter why someone accepts offered work?
43

Some Notable Prior Research
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these people
who we ask to power our computing?”
– “abstraction hides detail'‘ - some details may be worth
keeping conspicuously present (Jessica Hullman)
• Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders of
innovative technologies, rather than employers unconcerned
with working conditions.”
– “…human computation currently relies on worker invisibility.”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately value
ethics above cost savings.” 44

Power Asymmetry on MTurk
45
• Mistakes happen, such as wrongly rejecting work – e.g., error by
new student, software bug, poor instructions, noisy gold, etc.
• How do we balance the harm caused by our mistakes to workers
(our liability) vs. our cost/effort of preventing such mistakes?

Task Decomposition
By minimizing context, greater task efficiency &
accuracy can often be achieved in practice
– e.g. “Can you name who is in this photo?”
• Much research on ways to streamline work
and decompose complex tasks
46

Context & Informed Consent
• Assume we wish to obtain informed consent
• Without context, consent cannot be informed
– Zittrain, Ubiquitous human computing (2008) 47

Consequences of Human Computation
as a Panacea where AI Falls Short
• The Googler who Looked at the Worst of the Internet
• Policing the Web’s Lurid Precincts
• Facebook content moderation
• The dirty job of keeping Facebook clean
• Even linguistic annotators report stress &
nightmares from reading news articles!
48

What about Freedom?
• Crowdsourcing vision: empowering freedom
– work whenever you want for whomever you want
• Risk: people compelled to perform work
– Chinese prisoners farming gold online
– Digital sweat shops? Digital slaves?
– We know relatively little today about work conditions
– How might we monitor and mitigate risk/growth of
crowd work inflicting harm to at-risk populations?
– Traction? Human Trafficking at MSR Summit’12
49

Robert Sim, MSR Summit’12
50

Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for researchers
interested in ethical issues in crowd work
51

Additional References
• Irani, Lilly C. The Ideological Work of Microwork. In preparation,
draft available online.
• Adda, Gilles, et al. Crowdsourcing for language resource
development: Critical analysis of amazon mechanical turk
overpowering use. Proceedings of the 5th Language and Technology
Conference (LTC). 2011.
• Adda, Gilles, and Joseph J. Mariani. Economic, Legal and Ethical
analysis of Crowdsourcing for Speech Processing. (2013).
• Harris, Christopher G., and Padmini Srinivasan. Crowdsourcing and
Ethics. Security and Privacy in Social Networks. 67-83. 2013.
• Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side
to Crowdsourcing. IEEE 3rd conference on social computing
(socialcom). 2011.
• Horton, John J. The condition of the Turking class: Are online
employers fair and honest?. Economics Letters 111.1 (2011): 10-12.
52

• Bederson, B. B., & Quinn, A. J. Web workers unite! addressing challenges
of online laborers. In CHI 2011 Human Computation Workshop, 97-106.
• Bederson, B. B., & Quinn, A. J. Participation in Human Computation. In
CHI 2011 Human Computation Workshop.
• Felstiner, Alek. Working the Crowd: Employment and Labor Law in the
Crowdsourcing Industry. Berkeley J. Employment & Labor Law 32.1 2011
• Felstiner, Alek. Sweatshop or Paper Route?: Child Labor Laws and In-
Game Work. CrowdConf (2010).
• Larson, Martha. Toward Responsible and Sustainable Crowsourcing.
Blog post + Slides from Dagstuhl, September 2013.
• Vili Lehdonvirta and Paul Mezier. Identity and Self-Organization in
Unstructured Work. Unpublished working paper. 16 October 2013.
• Zittrain, Jonathan. Minds for Sale. You Tube. 53
Additional References (2)

Crowdsourcing: From Aggregation to Search Engine Evaluation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Crowdsourcing: From Aggregation to Search Engine Evaluation

Semelhante a Crowdsourcing: From Aggregation to Search Engine Evaluation (20)

Mais de Matthew Lease

Mais de Matthew Lease (11)

Último

Último (20)

Crowdsourcing: From Aggregation to Search Engine Evaluation