SlideShare uma empresa Scribd logo
1 de 238
Crowdsourcing for Search Evaluation and
      Social-Algorithmic Search

                        Matthew Lease
                        University of Texas at Austin

                        Omar Alonso
                        Microsoft

                        August 12, 2012




August 12, 2012                                         1
Topics
• Crowd-powered data collection & applications
      – Evaluation: relevance judging, interactive studies, log data
      – Training: e.g., active learning (e.g. learning to rank)
      – Search: answering, verification, collaborations, physical
• Crowdsourcing & human computation
• Crowdsourcing platforms
• Incentive Engineering & Demographics
• Designing for Crowds & Quality assurance
• Future Challenges
• Broader Issues and the Dark Side
August 12, 2012                                                        2
What is Crowdsourcing?
• Let’s start with an example and work back
  toward a more general definition
• Example: Amazon’s Mechanical Turk (MTurk)
• Goal
      – See a concrete example of real crowdsourcing
      – Ground later discussion of abstract concepts
      – Provide a specific example with which we will
        contrast other forms of crowdsourcing

August 12, 2012                                         3
Human Intelligence Tasks (HITs)




August 12, 2012                          4
August 12, 2012   5
Jane saw the man with the binoculars




August 12, 2012                           6
Traditional Data Collection
• Setup data collection software / harness
• Recruit participants / annotators / assessors
• Pay a flat fee for experiment or hourly wage

• Characteristics
      –    Slow
      –    Expensive
      –    Difficult and/or Tedious
      –    Sample Bias…

August 12, 2012                                   7
“Hello World” Demo
• Let’s create and run a simple MTurk HIT
• This is a teaser highlighting concepts
      – Don’t worry about details; we’ll revisit them
• Goal
      – See a concrete example of real crowdsourcing
      – Ground our later discussion of abstract concepts
      – Provide a specific example with which we will
        contrast other forms of crowdsourcing

August 12, 2012                                            8
DEMO


August 12, 2012   9
Flip a coin
• Please flip a coin and report the results
• Two questions
      1. Coin type?
      2. Head or tails
• Results
            Row Labels    Count
                                  Row Labels    Counts
            Dollar          56
            Euro            11    head             57
            Other           30    tail             43
            (blank)          3
            Grand Total    100    Grand Total     100


August 12, 2012                                          10
NOW WHAT CAN I DO WITH IT?


August 12, 2012                 11
PHASE 1:
   COLLECTING & LABELING DATA

August 12, 2012                 12
Data is King!
• Massive free Web data
  changed how we train
  learning systems
  – Banko and Brill (2001).
    Human Language Tech.
  – Halevy et al. (2009). IEEE
    Intelligent Systems.

 • Crowds provide new access to cheap & labeled
   Big Data. But quality also matters!
  August 12, 2012                                 13
NLP: Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
    – Affect recognition
    – Word similarity
    – Recognizing textual entailment
    – Event temporal ordering
    – Word sense disambiguation
• 22K labels for US $26
• High agreement between
  consensus labels and
  gold-standard labels
August 12, 2012                        14
Computer Vision:
     Sorokin & Forsythe (CVPR 2008)
• 4K labels for US $60




August 12, 2012                       15
IR: Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
      – Judge relevance of search engine results
• Many follow-on studies (design, quality, cost)




August 12, 2012                                    16
User Studies: Kittur, Chi, & Suh (CHI 2008)



• “…make creating believable invalid responses as
  effortful as completing the task in good faith.”




 August 12, 2012                                 17
Social & Behavioral Sciences
• A Guide to Behavioral Experiments
  on Mechanical Turk
    – W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
    – L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
  Insights from Mechanical Turk
    – Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
  Inexpensive, Yet High-Quality, Data?
    – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
    – see also: Amazon Mechanical Turk Guide for Social Scientists
 August 12, 2012                                                     18
August 12, 2012   19
Remote Usability Testing
• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012
• Compares remote usability testing using MTurk and
  CrowdFlower (not uTest) vs. traditional on-site testing
• Advantages
      – More (Diverse) Participants
      – High Speed
      – Low Cost
• Disadvantages
      –    Lower Quality Feedback
      –    Less Interaction
      –    Greater need for quality control
      –    Less Focused User Groups
August 12, 2012                                             20
August 12, 2012   21
NLP Example – Dialect Identification




August 12, 2012                            22
NLP Example – Machine Translation
• Manual evaluation on translation quality is
  slow and expensive
• High agreement between non-experts and
  experts
• $0.10 to translate a sentence


            C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality
                             Using Amazon’s Mechanical Turk”, EMNLP 2009.


August 12, 2012                                                                             23
Computer Vision – Painting Similarity




                     Kovashka & Lease, CrowdConf’10

August 12, 2012                                       24
IR Example – Relevance and ads




August 12, 2012                        25
IR Example – Product Search




August 12, 2012                          26
IR Example – Snippet Evaluation
•    Study on summary lengths
•    Determine preferred result length
•    Asked workers to categorize web queries
•    Asked workers to evaluate snippet quality
•    Payment between $0.01 and $0.05 per HIT


    M. Kaisser, M. Hearst, and L. Lowe. “Improving Search Results Quality by Customizing Summary Lengths”, ACL/HLT, 2008.




August 12, 2012                                                                                                             27
IR Example – Relevance Assessment
•    Replace TREC-like relevance assessors with MTurk?
•    Selected topic “space program” (011)
•    Modified original 4-page instructions from TREC
•    Workers more accurate than original assessors!
•    40% provided justification for each answer


    O. Alonso and S. Mizzaro. “Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment”, SIGIR Workshop
    on the Future of IR Evaluation, 2009.



    August 12, 2012                                                                                                                28
IR Example – Timeline Annotation
• Workers annotate timeline on politics, sports, culture
• Given a timex (1970s, 1982, etc.) suggest something
• Given an event (Vietnam, World cup, etc.) suggest a timex




 K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”. ECIR 2010




 August 12, 2012                                                                                                            29
COLLECTING DATA WITH OTHER
   CROWDS & OTHER INCENTIVES

August 12, 2012                 30
Why Eytan Adar hates MTurk Research
      (CHI 2011 CHC Workshop)
• Overly-narrow focus on MTurk
    – Identify general vs. platform-specific problems
    – Academic vs. Industrial problems
• Inattention to prior work in other disciplines
• Turks aren’t Martians
    – Just human behavior (more later…)



 August 12, 2012                                        31
ESP Game (Games With a Purpose)
L. Von Ahn and L. Dabbish (2004)




August 12, 2012                    32
reCaptcha




L. von Ahn et al. (2008). In Science.
August 12, 2012                         33
Human Sensing and Monitoring
• Sullivan et al. (2009). Bio. Conservation (142):10
• Keynote by Steve Kelling at ASIS&T 2011




 August 12, 2012                                  34
•    Learning to map from web pages to queries
   •    Human computation game to elicit data
   •    Home grown system (no AMT)
   •    Try it!
             pagehunt.msrlivelabs.com


See also:
• H. Ma. et al. “Improving Search Engines Using Human Computation Games”, CIKM 2009.
• Law et al. SearchWar. HCOMP 2009.
• Bennett et al. Picture This. HCOMP 2009.
   August 12, 2012                                                                     35
Tracking Sentiment in Online Media
Brew et al., PAIS 2010
• Volunteer-crowd
• Judge in exchange for
  access to rich content
• Balance system needs
  with user interest
• Daily updates to non-
  stationary distribution
August 12, 2012                           36
PHASE 2: FROM DATA COLLECTION
   TO HUMAN COMPUTATION

August 12, 2012                    37
Human Computation
• What was old is new

• Crowdsourcing: A New Branch
  of Computer Science
  – D.A. Grier, March 29, 2011


• Tabulating the heavens:
  computing the Nautical
  Almanac in 18th-century
  England - M. Croarken’03       Princeton University Press, 2005
  August 12, 2012                                          38
The Mechanical Turk




Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804)


         J. Pontin. Artificial Intelligence, With Help From
         the Humans. New York Times (March 25, 2007)

August 12, 2012                                                         39
The Human Processing Unit (HPU)
• Davis et al. (2010)




                         HPU



August 12, 2012                 40
Human Computation
• Having people do stuff instead of computers
• Investigates use of people to execute certain
  computations for which capabilities of current
  automated methods are more limited
• Explores the metaphor of computation for
  characterizing attributes, capabilities, and
  limitations of human performance in executing
  desired tasks
• Computation is required, crowd is not
• von Ahn’s Thesis (2005), Law & von Ahn (2011)
August 12, 2012                                    41
APPLYING HUMAN COMPUTATION:
   CROWD-POWERED APPLICATIONS

August 12, 2012                  42
Crowd-Assisted Search: “Amazon Remembers”




 August 12, 2012                      43
Crowd-Assisted Search (2)

• Yan et al., MobiSys’10




• CrowdTerrier
  (McCreadie et al., SIGIR’12)


 August 12, 2012                               44/11
Translation by monolingual speakers
• C. Hu, CHI 2009




August 12, 2012                           45
Soylent: A Word Processor with a Crowd Inside

 • Bernstein et al., UIST 2010




 August 12, 2012                          46
fold.it
S. Cooper et al. (2010)




Alice G. Walton. Online Gamers Help Solve Mystery of
Critical AIDS Virus Enzyme. The Atlantic, October 8, 2011.
August 12, 2012                                        47
PlateMate (Noronha et al., UIST’10)




August 12, 2012                        48/11
Image Analysis and more: Eatery




August 12, 2012                       49
VizWiz     aaaaaaaa
Bingham et al. (UIST 2010)




August 12, 2012                         50/11
August 12, 2012   51/11
Crowd Sensing: Waze




August 12, 2012                         52
THE SOCIAL SIDE OF SEARCH

August 12, 2012             53
People are more than HPUs
• Why is Facebook popular? People are social.
• Information needs are contextually grounded in
  our social experiences and social networks
• The value of social search may be more than
  the relevance of the search results
• Our social networks also embody additional
  knowledge about us, our needs, and the world
The social dimension complements computation
August 12, 2012                              54
Community Q&A




August 12, 2012                   55/53
August 12, 2012   56
Complex Information Needs
 Who is Rahm Emanuel, Obama's Chief of Staff?
 How have dramatic shifts in terrorists resulted in an
equally dramatic shift in terrorist organizations?
 How do I find what events were in the news on my sons
birthday?
 Do you think the current drop in the Stock Market is
related to Obama's election to President?
 Why are prisoners on death row given final medicals?
 Should George Bush attack Iran's nuclear facility
before he leaves office?
 Why are people against gay marriage?
 Does anyone know anything interesting that happened
nation wide in 2008?
 Should the fact that a prisoner has cancer have any
bearing on an appeal for bail?
  August 12, 2012   Source: Yahoo! Answers, “News & Events”, Nov. 6 2008   57
Community Q&A
• Ask the village vs. searching the archive
• Posting and waiting can be slow
      – Find similar questions already answered
• Best answer (winner-take-all) vs. voting
• Challenges
      – Questions shorter than documents
      – Questions not queries, colloquial, errors
      – Latency & quality (e.g. question routing)
• Cf. work by Bruce Croft & students
August 12, 2012                                     58
Horowitz & Kamvar, WWW’10
• Routing: Trust vs. Authority
• Social networks vs. search engines
   – See also: Morris & Teevan, HCIC’12
  August 12, 2012                         59
Social Network integration
• Facebook Questions (with Bing)
• Google+ (acquired Aardvark)
• Twitter (cf. Paul, Hong, and Chi, ICWSM’11)




August 12, 2012                                 60
Search Buddies
Hecht et al. ICWSM 2012; Morris MSR Talk




August 12, 2012                            61
{where to go on vacation}




    • Tons of results              • MTurk: 50 answers,
                                     $1.80
    • Read title + snippet +
      URL                          • Quora: 2 answers
    • Explore a few pages in       • Y! Answers: 2
      detail                         answers
August 12, 2012
                                   • FB: 1 answer   62
{where to go on vacation}
  Countries
                                              Cities




August 12, 2012                                        63
{where to go on vacation}
• Let’s execute the same query in different days
                  Execution #1       Execution #2       Execution #3
                  Las Vegas      3   Kerala         6   Las Vegas          4
                  Hawaii         2   Goa            4   Himachal pradesh   3
                  Kerala         2   Ooty           3   Mauritius          2
                  Key West       2   Switzerland    3   Ooty               2
                  Orlando        2   Agra           2
                                     kodaikanal     2
                                     New Zealand    2


•    Table show places with frequency >= 2
•    Every execution uses same template & 50 workers
•    Completion time more or less the same
•    Results may differ
•    Related work: Zhang et al., CHI 2012
August 12, 2012                                                                64
SO WHAT IS CROWDSOURCING?

August 12, 2012                           65
August 12, 2012   66
From Outsourcing to Crowdsourcing
• Take a job traditionally
  performed by a known agent
  (often an employee)
• Outsource it to an undefined,
  generally large group of
  people via an open call
• New application of principles
  from open source movement
• Evolving & broadly defined ...
 August 12, 2012                         67
Crowdsourcing models
•    Micro-tasks & citizen science
•    Co-Creation
•    Open Innovation, Contests
•    Prediction Markets
•    Crowd Funding and Charity
•    “Gamification” (not serious gaming)
•    Transparent
•    cQ&A, Social Search, and Polling
•    Physical Interface/Task
August 12, 2012                            68
What is Crowdsourcing?
• A set of mechanisms and methods for scaling &
  directing crowd activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related topics/areas:
      –    Human computation (next slide…)
      –    Collective intelligence
      –    Crowd/Social computing
      –    Wisdom of Crowds
      –    People services, Human Clouds, Peer-production, …
August 12, 2012                                                69
What is not crowdsourcing?
• Post-hoc use of pre-existing crowd data
      – Data mining
      – Visual analytics
• Use of one or few people
      – Mixed-initiative design
      – Active learning
• Conducting a survey or poll… (*)


August 12, 2012                             70
Crowdsourcing Key Questions
• What are the goals?
      – Purposeful directing of human activity

• How can you incentivize participation?
      – Incentive engineering
      – Who are the target participants?

• Which model(s) are most appropriate?
      – How to adapt them to your context and goals?
August 12, 2012                                        71
What do you want to accomplish?
• Create
• Execute task/computation
• Fund
• Innovate and/or discover
• Learn
• Monitor
• Predict
August 12, 2012                 72
INCENTIVE ENGINEERING


August 12, 2012            73
Who are
the workers?


• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010.
   The New Demographics of Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.
 August 12, 2012                                       74
MTurk Demographics
• 2008-2009 studies found
  less global and diverse
  than previously thought
      – US
      – Female
      – Educated
      – Bored
      – Money is secondary

August 12, 2012                        75
2010 shows increasing diversity
47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)




 August 12, 2012                                       76
Why should your crowd participate?
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige (leaderboards, badges)
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

Multiple incentives can often operate in parallel (*caveat)
August 12, 2012                                           77
Example: Wikipedia
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                        78
Example: DuoLingo
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                       79
Example:
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                    80
Example: ESP
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                    81
Example: fold.it
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                      82
Example: FreeRice
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                       83
Example: cQ&A
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                    84
Example: reCaptcha
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)               Is there an existing human
                                   activity you can harness
• Learn something new              for another purpose?

• Obtain something else
• Create self-serving resource

August 12, 2012                                            85
Example: Mechanical Turk
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

August 12, 2012                    86
How Much to Pay?
• Price commensurate with task effort
      – Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback
• Ethics & market-factors: W. Mason and S. Suri, 2010.
      – e.g. non-profit SamaSource contracts workers refugee camps
      – Predict right price given market & task: Wang et al. CSDM’11
• Uptake & time-to-completion vs. Cost & Quality
      – Too little $$, no interest or slow – too much $$, attract spammers
      – Real problem is lack of reliable QA substrate
• Accuracy & quantity
      – More pay = more work, not better (W. Mason and D. Watts, 2009)
• Heuristics: start small, watch uptake and bargaining feedback
• Worker retention (“anchoring”)
See also: L.B. Chilton et al. KDD-HCOMP 2010.
   August 12, 2012                                                        87
Dan Pink – YouTube video
“The Surprising Truth about what Motivates us”




August 12, 2012                              88
PLATFORMS


August 12, 2012   89
Mechanical What?




August 12, 2012                      90
Does anyone really use it? Yes!




   http://www.mturk-tracker.com (P. Ipeirotis’10)

From 1/09 – 4/10, 7M HITs from 10K requestors
worth $500,000 USD (significant under-estimate)
 August 12, 2012                                  91
MTurk: The Requester
•    Sign up with your Amazon account
•    Amazon payments
•    Purchase prepaid HITs
•    There is no minimum or up-front fee
•    MTurk collects a 10% commission
•    The minimum commission charge is $0.005 per HIT




August 12, 2012                                        92
MTurk Dashboard
• Three tabs
      – Design
      – Publish
      – Manage
• Design
      – HIT Template
• Publish
      – Make work available
• Manage
      – Monitor progress


August 12, 2012                     93
August 12, 2012   94
MTurk: Dashboard - II




August 12, 2012                           95
MTurk API
•    Amazon Web Services API
•    Rich set of services
•    Command line tools
•    More flexibility than dashboard




August 12, 2012                        96
MTurk Dashboard vs. API
• Dashboard
      – Easy to prototype
      – Setup and launch an experiment in a few minutes
• API
      – Ability to integrate AMT as part of a system
      – Ideal if you want to run experiments regularly
      – Schedule tasks


August 12, 2012                                          97
• Multiple Channels
• Gold-based tests
• Only pay for
  “trusted” judgments




  August 12, 2012       98
CloudFactory
•    Information below from Mark Sears (Oct. 18, 2011)
•    Cloud Labor API
      – Tools to design virtual assembly lines
      – workflows with multiple tasks chained together
•    Focus on self serve tools for people to easily design crowd-powered assembly lines
     that can be easily integrated into software applications
•    Interfaces: command-line, RESTful API, and Web
•    Each “task station” can have either a human or robot worker assigned
      – web software services (AlchemyAPI, SendGrid, Google APIs, Twilio, etc.) or local software can
        be combined with human computation
•    Many built-in "best practices"
      – “Tournament Stations” where multiple results are compared by a other cloud workers until
        confidence of best answer is reached
      – “Improver Stations” have workers improve and correct work by other workers
      – Badges are earned by cloud workers passing tests created by requesters
      – Training and tools to create skill tests will be flexible
      – Algorithms to detect and kick out spammers/cheaters/lazy/bad workers


August 12, 2012                                                                                     99
More Crowd Labor Platforms
•     Clickworker
•     CloudCrowd
•     CrowdSource
•     DoMyStuff
•     Humanoid (by Matt Swason et al.)
•     Microtask
•     MobileWorks (by Anand Kulkarni )
•     myGengo
•     SmartSheet
•     vWorker
•     Industry heavy-weights
       –    Elance
       –    Liveops
       –    oDesk
       –    uTest
• and more…

    August 12, 2012                         100
Platform alternatives
• Why MTurk
      – Amazon brand, lots of research papers
      – Speed, price, diversity, payments
• Why not
      – Crowdsourcing != Mturk
      – Spam, no analytics, must build tools for worker & task quality
• Microsoft Universal Human Relevance System (UHRS)
• How to build your own crowdsourcing platform
      –    Back-end
      –    Template language for creating experiments
      –    Scheduler
      –    Payments?
August 12, 2012                                                  101
Why Micro-Tasks?
• Easy, cheap and fast
• Ready-to use infrastructure, e.g.
      – MTurk payments, workforce, interface widgets
      – CrowdFlower quality control mechanisms, etc.
      – Many others …
• Allows early, iterative, frequent trials
      – Iteratively prototype and test new ideas
      – Try new tasks, test when you want & as you go
• Many successful examples of use reported
August 12, 2012                                         102
Micro-Task Issues
• Process
      – Task design, instructions, setup, iteration
• Choose crowdsourcing platform (or roll your own)
• Human factors
      – Payment / incentives, interface and interaction design,
        communication, reputation, recruitment, retention
• Quality Control / Data Quality
      – Trust, reliability, spam detection, consensus labeling


August 12, 2012                                                  103
WORKFLOW DESIGN


August 12, 2012      104
PlateMate - Architecture




August 12, 2012                              105
Kulkarni et al.,
CSCW 2012

Turkomatic




  August 12, 2012   106
CrowdForge: Workers perform a task
       or further decompose them




       Kittur et al., CHI 2011


August 12, 2012                     107
Kittur et al., CrowdWeaver, CSCW 2012




August 12, 2012                    108
DESIGNING FOR CROWDS


August 12, 2012           109
August 12, 2012   110
Typical Workflow
•    Define and design what to test
•    Sample data
•    Design the experiment
•    Run experiment
•    Collect data and analyze results
•    Quality control



August 12, 2012                         111
Development Framework
• Incremental approach
• Measure, evaluate, and adjust as you go
• Suitable for repeatable tasks




August 12, 2012                             112
Survey Design
•    One of the most important parts
•    Part art, part science
•    Instructions are key
•    Prepare to iterate




August 12, 2012                        113
Questionnaire Design
• Ask the right questions
• Workers may not be IR experts so don’t
  assume the same understanding in terms of
  terminology
• Show examples
• Hire a technical writer
      – Engineer writes the specification
      – Writer communicates

August 12, 2012                               114
UX Design
• Time to apply all those usability concepts
• Generic tips
      – Experiment should be self-contained.
      – Keep it short and simple. Brief and concise.
      – Be very clear with the relevance task.
      – Engage with the worker. Avoid boring stuff.
      – Always ask for feedback (open-ended question) in
        an input box.

August 12, 2012                                       115
UX Design - II
•    Presentation
•    Document design
•    Highlight important concepts
•    Colors and fonts
•    Need to grab attention
•    Localization



August 12, 2012                     116
Examples - I
• Asking too much, task not clear, “do NOT/reject”
• Worker has to do a lot of stuff




August 12, 2012                                      117
Example - II
• Lot of work for a few cents
• Go here, go there, copy, enter, count …




August 12, 2012                             118
A Better Example
• All information is available
      – What to do
      – Search result
      – Question to answer




August 12, 2012                       119
August 12, 2012   120
Form and Metadata
• Form with a close question (binary relevance) and
  open-ended question (user feedback)
• Clear title, useful keywords
• Workers need to find your task




August 12, 2012                                       121
Relevance Judging – Example I




August 12, 2012                         122
Relevance Judging – Example II




August 12, 2012                              123
Implementation
• Similar to a UX
• Build a mock up and test it with your team
      – Yes, you need to judge some tasks
• Incorporate feedback and run a test on MTurk
  with a very small data set
      – Time the experiment
      – Do people understand the task?
• Analyze results
      – Look for spammers
      – Check completion times
• Iterate and modify accordingly
August 12, 2012                                  124
Implementation – II
• Introduce quality control
      – Qualification test
      – Gold answers (honey pots)
•    Adjust passing grade and worker approval rate
•    Run experiment with new settings & same data
•    Scale on data
•    Scale on workers

August 12, 2012                                125
Experiment in Production
•    Lots of tasks on MTurk at any moment
•    Need to grab attention
•    Importance of experiment metadata
•    When to schedule
      – Split a large task into batches and have 1 single
        batch in the system
      – Always review feedback from batch n before
        uploading n+1

August 12, 2012                                             126
Other design principles
• Text alignment
• Legibility
• Reading level: complexity of words and sentences
• Attractiveness (worker’s attention & enjoyment)
• Multi-cultural / multi-lingual
• Who is the audience (e.g. target worker community)
      – Special needs communities (e.g. simple color blindness)
• Parsimony
• Cognitive load: mental rigor needed to perform task
• Exposure effect
August 12, 2012                                                   127
The human side
• As a worker
    –    I hate when instructions are not clear
    –    I’m not a spammer – I just don’t get what you want
    –    Boring task
    –    A good pay is ideal but not the only condition for engagement
• As a requester
    – Attrition
    – Balancing act: a task that would produce the right results and
      is appealing to workers
    – I want your honest answer for the task
    – I want qualified workers; system should do some of that for me
• Managing crowds and tasks is a daily activity
    – more difficult than managing computers
 August 12, 2012                                                  128
Things that work
•    Qualification tests
•    Honey-pots
•    Good content and good presentation
•    Economy of attention
•    Things to improve
      – Manage workers in different levels of expertise
        including spammers and potential cases.
      – Mix different pools of workers based on different
        profile and expertise levels.

August 12, 2012                                         129
Things that need work
• UX and guidelines
      – Help the worker
      – Cost of interaction
•    Scheduling and refresh rate
•    Exposure effect
•    Sometimes we just don’t agree
•    How crowdsourcable is your task

August 12, 2012                           130
RELEVANCE JUDGING & CROWDSOURCING

August 12, 2012                         131
August 12, 2012   132
Motivating Example: Relevance Judging

• Relevance of search results is difficult to judge
      – Highly subjective
      – Expensive to measure
• Professional editors commonly used
• Potential benefits of crowdsourcing
      – Scalability (time and cost)
      – Diversity of judgments


August 12, 2012                                  133
August 12, 2012   134
Started with a joke …




August 12, 2012         135
Results for {idiot} at WSDM 2011
February 2011: 5/7 (R), 2/7 (NR)
    –   Most of the time those TV reality stars have absolutely no talent. They do whatever
        they can to make a quick dollar. Most of the time the reality tv stars don not have
        a mind of their own.   R
    –   Most are just celebrity wannabees. Many have little or no talent, they just want
        fame. R
    –   I can see this one going both ways. A particular sort of reality star comes to
        mind, though, one who was voted off Survivor because he chose not to use his
        immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R
    –   Just because someone else thinks they are an "idiot", doesn't mean that is what the
        word means. I don't like to think that any one person's photo would be used to
        describe a certain term.   NR
    –   While some reality-television stars are genuinely stupid (or cultivate an image of
        stupidity), that does not mean they can or should be classified as "idiots." Some
        simply act that way to increase their TV exposure and potential earnings. Other
        reality-television stars are really intelligent people, and may be considered as
        idiots by people who don't like them or agree with them. It is too subjective an
        issue to be a good result for a search engine. NR
    –   Have you seen the knuckledraggers on reality television? They should be required to
        change their names to idiot after appearing on the show. You could put numbers
        after the word idiot so we can tell them apart. R
    –   Although I have not followed too many of these shows, those that I have encountered
        have for a great part a very common property. That property is that most of the
        participants involved exhibit a shallow self-serving personality that borders on
        social pathological behavior. To perform or act in such an abysmal way could only
        be an act of an idiot. R
 August 12, 2012                                                                      136
Two Simple Examples of MTurk
1. Ask workers to classify a query
2. Ask workers to judge document relevance

Steps
• Define high-level task
• Design & implement interface & backend
• Launch, monitor progress, and assess work
• Iterate design

August 12, 2012                               137
Query Classification Task
•    Ask the user to classify a query
•    Show a form that contains a few categories
•    Upload a few queries (~20)
•    Use 3 workers




August 12, 2012                                   138
DEMO


August 12, 2012   139
August 12, 2012   140
Relevance Judging Task
• Use a few documents from a standard
  collection used for evaluating search engines
• Ask user to make binary judgments
• Modification: graded judging
• Use 5 workers




August 12, 2012                                   141
DEMO


August 12, 2012   142
Content quality
• People like to work on things that they like
• TREC ad-hoc vs. INEX
      – TREC experiments took twice to complete
      – INEX (Wikipedia), TREC (LA Times, FBIS)
• Topics
      – INEX: Olympic games, movies, salad recipes, etc.
      – TREC: cosmic events, Schengen agreement, etc.
• Content and judgments according to modern times
      – Airport security docs are pre 9/11
      – Antarctic exploration (global warming )

August 12, 2012                                            143
Content quality - II
• Document length
• Randomize content
• Avoid worker fatigue
      – Judging 100 documents on the same subject can
        be tiring, leading to decreasing quality




August 12, 2012                                         144
Presentation
• People scan documents for relevance cues
• Document design
• Highlighting no more than 10%




August 12, 2012                              145
Presentation - II




August 12, 2012                       146
Relevance justification
• Why settle for a label?
• Let workers justify answers
      – cf. Zaidan et al. (2007) “annotator rationales”
• INEX
      – 22% of assignments with comments
• Must be optional
• Let’s see how people justify

August 12, 2012                                           147
“Relevant” answers
 [Salad Recipes]
 Doesn't mention the word 'salad', but the recipe is one that could be considered a
    salad, or a salad topping, or a sandwich spread.
 Egg salad recipe
 Egg salad recipe is discussed.
 History of salad cream is discussed.
 Includes salad recipe
 It has information about salad recipes.
 Potato Salad
 Potato salad recipes are listed.
 Recipe for a salad dressing.
 Salad Recipes are discussed.
 Salad cream is discussed.
 Salad info and recipe
 The article contains a salad recipe.
 The article discusses methods of making potato salad.
 The recipe is for a dressing for a salad, so the information is somewhat narrow for
    the topic but is still potentially relevant for a researcher.
 This article describes a specific salad. Although it does not list a specific recipe,
    it does contain information relevant to the search topic.
 gives a recipe for tuna salad
 relevant for tuna salad recipes
 relevant to salad recipes
 this is on-topic for salad recipes




August 12, 2012                                                                    148
“Not relevant” answers
[Salad Recipes]
About gaming not salad recipes.
Article is about Norway.
Article is about Region Codes.
Article is about forests.
Article is about geography.
Document is about forest and trees.
Has nothing to do with salad or recipes.
Not a salad recipe
Not about recipes
Not about salad recipes
There is no recipe, just a comment on how salads fit into meal formats.
There is nothing mentioned about salads.
While dressings should be mentioned with salads, this is an article on one specific
    type of dressing, no recipe for salads.
article about a swiss tv show
completely off-topic for salad recipes
not a salad recipe
not about salad recipes
totally off base



August 12, 2012                                                                       149
August 12, 2012   150
Feedback length

• Workers will justify answers
• Has to be optional for good
  feedback
• In E51, mandatory comments
  – Length dropped
  – “Relevant” or “Not Relevant



  August 12, 2012                     151
Was the task difficult?
• Ask workers to rate difficulty of a search topic
• 50 topics; 5 workers, $0.01 per task




August 12, 2012                                  152
QUALITY ASSURANCE


August 12, 2012        153
When to assess quality of work
• Beforehand (prior to main task activity)
      – How: “qualification tests” or similar mechanism
      – Purpose: screening, selection, recruiting, training
• During
      – How: assess labels as worker produces them
             • Like random checks on a manufacturing line
      – Purpose: calibrate, reward/penalize, weight
• After
      – How: compute accuracy metrics post-hoc
      – Purpose: filter, calibrate, weight, retain (HR)
      – E.g. Jung & Lease (2011), Tang & Lease (2011), ...
August 12, 2012                                               154
How do we measure work quality?
• Compare worker’s label vs.
      – Known (correct, trusted) label
      – Other workers’ labels
             • P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or
               Multiple Workers? Sept. 2010.
      – Model predictions of the above
             • Model the labels (Ryu & Lease, ASIS&T11)
             • Model the workers (Chen et al., AAAI’10)
• Verify worker’s label
      – Yourself
      – Tiered approach (e.g. Find-Fix-Verify)
             • Quinn and B. Bederson’09, Bernstein et al.’10
August 12, 2012                                                          155
Typical Assumptions
• Objective truth exists
      – no minority voice / rare insights
      – Can relax this to model “truth distribution”
• Automatic answer comparison/evaluation
      – What about free text responses? Hope from NLP…
             • Automatic essay scoring
             • Translation (BLEU: Papineni, ACL’2002)
             • Summarization (Rouge: C.Y. Lin, WAS’2004)
      – Have people do it (yourself or find-verify crowd, etc.)
August 12, 2012                                            156
Distinguishing Bias vs. Noise
• Ipeirotis (HComp 2010)
• People often have consistent, idiosyncratic
  skews in their labels (bias)
      – E.g. I like action movies, so they get higher ratings
• Once detected, systematic bias can be
  calibrated for and corrected (yeah!)
• Noise, however, seems random & inconsistent
      – this is the real issue we want to focus on

August 12, 2012                                            157
Comparing to known answers
• AKA: gold, honey pot, verifiable answer, trap
• Assumes you have known answers
• Cost vs. Benefit
      – Producing known answers (experts?)
      – % of work spent re-producing them
• Finer points
      – Controls against collusion
      – What if workers recognize the honey pots?

August 12, 2012                                     158
Comparing to other workers
•    AKA: consensus, plurality, redundant labeling
•    Well-known metrics for measuring agreement
•    Cost vs. Benefit: % of work that is redundant
•    Finer points
      – Is consensus “truth” or systematic bias of group?
      – What if no one really knows what they’re doing?
             • Low-agreement across workers indicates problem is with the
               task (or a specific example), not the workers
      – Risk of collusion
• Sheng et al. (KDD 2008)
August 12, 2012                                                       159
Comparing to predicted label
• Ryu & Lease, ASIS&T11 (CrowdConf’11 poster)
• Catch-22 extremes
      – If model is really bad, why bother comparing?
      – If model is really good, why collect human labels?
• Exploit model confidence
      – Trust predictions proportional to confidence
      – What if model very confident and wrong?
• Active learning
      – Time sensitive: Accuracy / confidence changes
August 12, 2012                                          160
Compare to predicted worker labels
• Chen et al., AAAI’10
• Avoid inefficiency of redundant labeling
      – See also: Dekel & Shamir (COLT’2009)
• Train a classifier for each worker
• For each example labeled by a worker
      – Compare to predicted labels for all other workers
• Issues
      • Sparsity: workers have to stick around to train model…
      • Time-sensitivity: New workers & incremental updates?

August 12, 2012                                                  161
Methods for measuring agreement
• What to look for
      – Agreement, reliability, validity
• Inter-agreement level
      – Agreement between judges
      – Agreement between judges and the gold set
• Some statistics
      –    Percentage agreement
      –    Cohen’s kappa (2 raters)
      –    Fleiss’ kappa (any number of raters)
      –    Krippendorff’s alpha
• With majority vote, what if 2 say relevant, 3 say not?
      – Use expert to break ties (Kochhar et al, HCOMP’10; GQR)
      – Collect more judgments as needed to reduce uncertainty
August 12, 2012                                               162
Inter-rater reliability
• Lots of research
• Statistics books cover most of the material
• Three categories based on the goals
      – Consensus estimates
      – Consistency estimates
      – Measurement estimates




August 12, 2012                                 163
Sample code
      – R packages psy and irr
      >library(psy)
      >library(irr)
      >my_data <- read.delim(file="test.txt",
        head=TRUE, sep="t")
      >kappam.fleiss(my_data,exact=FALSE)

      >my_data2 <- read.delim(file="test2.txt",
        head=TRUE, sep="t")
      >ckappa(my_data2)




August 12, 2012                                   164
k coefficient
• Different interpretations of k
• For practical purposes you need to be >= moderate
• Results may vary
            k                  Interpretation
            <0                 Poor agreement
            0.01 – 0.20        Slight agreement
            0.21 – 0.40        Fair agreement
            0.41 – 0.60        Moderate agreement
            0.61 – 0.80        Substantial agreement
            0.81 – 1.00        Almost perfect agreement


August 12, 2012                                           165
Detection Theory
• Sensitivity measures
      – High sensitivity: good ability to discriminate
      – Low sensitivity: poor ability
                   Stimulus         “Yes”          “No”
                   Class
                   S1               Hits           Misses
                   S2               False alarms   Correct
                                                   rejections

            Hit rate H = P(“yes”|S2)
            False alarm rate F = P(“yes”|S1)


August 12, 2012                                                 166
August 12, 2012   167
Finding Consensus
• When multiple workers disagree on the
  correct label, how do we resolve this?
      – Simple majority vote (or average and round)
      – Weighted majority vote (e.g. naive bayes)
• Many papers from machine learning…
• If wide disagreement, likely there is a bigger
  problem which consensus doesn’t address


August 12, 2012                                       168
Quality Control on MTurk
• Rejecting work & Blocking workers (more later…)
    – Requestors don’t want bad PR or complaint emails
    – Common practice: always pay, block as needed
• Approval rate: easy to use, but value?
    – P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5
      and 5 Minutes. Oct. 2010
    – Many requestors don’t ever reject…
• Qualification test
    – Pre-screen workers’ capabilities & effectiveness
    – Example and pros/cons in next slides…
• Geographic restrictions
• Mechanical Turk Masters (June 23, 2011)
    – Recent addition, degree of benefit TBD…
  August 12, 2012                                                  169
August 12, 2012   170
Quality Control in General
• Extremely important part of the experiment
• Approach as “overall” quality; not just for workers
• Bi-directional channel
   – You may think the worker is doing a bad job.
   – The same worker may think you are a lousy requester.




 August 12, 2012                                     171
Tools and Packages for MTurk
• QA infrastructure layers atop MTurk promote
  useful separation-of-concerns from task
      – TurkIt
             • Quik Turkit provides nearly realtime services
      –    Turkit-online (??)
      –    Get Another Label (& qmturk)
      –    Turk Surveyor
      –    cv-web-annotation-toolkit (image labeling)
      –    Soylent
      –    Boto (python library)
             • Turkpipe: submit batches of jobs using the command line.
• More needed…
August 12, 2012                                                           172
A qualification test snippet
<Question>
  <QuestionIdentifier>question1</QuestionIdentifier>
  <QuestionContent>
    <Text>Carbon monoxide poisoning is</Text>
  </QuestionContent>
  <AnswerSpecification>
    <SelectionAnswer>
        <StyleSuggestion>radiobutton</StyleSuggestion>
           <Selections>
             <Selection>
                <SelectionIdentifier>1</SelectionIdentifier>
                <Text>A chemical technique</Text>
             </Selection>
             <Selection>
                <SelectionIdentifier>2</SelectionIdentifier>
                <Text>A green energy treatment</Text>
             </Selection>
             <Selection>
                 <SelectionIdentifier>3</SelectionIdentifier>
                 <Text>A phenomena associated with sports</Text>
             </Selection>
             <Selection>
                 <SelectionIdentifier>4</SelectionIdentifier>
                 <Text>None of the above</Text>
             </Selection>
           </Selections>
    </SelectionAnswer>
  </AnswerSpecification>
  August 12, 2012
</Question>                                                        173
Qualification tests: pros and cons
• Advantages
      – Great tool for controlling quality
      – Adjust passing grade
• Disadvantages
      –    Extra cost to design and implement the test
      –    May turn off workers, hurt completion time
      –    Refresh the test on a regular basis
      –    Hard to verify subjective tasks like judging relevance
• Try creating task-related questions to get worker
  familiar with task before starting task in earnest
August 12, 2012                                                     174
More on quality control & assurance
• HR issues: recruiting, selection, & retention
      – e.g., post/tweet, design a better qualification test,
        bonuses, …
• Collect more redundant judgments…
      – at some point defeats cost savings of
        crowdsourcing
      – 5 workers is often sufficient



August 12, 2012                                            175
Robots and Captchas
• Some reports of robots on MTurk
      – E.g. McCreadie et al. (2011)
      – violation of terms of service
      – Artificial artificial artificial intelligence
• Captchas seem ideal, but…
      – There is abuse of robots using turkers to solve captchas so
        they can access web resources
      – Turker wisdom is therefore to avoid such HITs
• What to do?
      –    Use standard captchas, notify workers
      –    Block robots other ways (e.g. external HITs)
      –    Catch robots through standard QC, response times
      –    Use HIT-specific captchas (Kazai et al., 2011)
August 12, 2012                                                  176
Other quality heuristics
• Justification/feedback as quasi-captcha
      – Successfully proven in past experiments
      – Should be optional
      – Automatically verifying feedback was written by a
        person may be difficult (classic spam detection task)
• Broken URL/incorrect object
      – Leave an outlier in the data set
      – Workers will tell you
      – If somebody answers “excellent” on a graded
        relevance test for a broken URL => probably spammer

August 12, 2012                                                 177
Dealing with bad workers
• Pay for “bad” work instead of rejecting it?
   – Pro: preserve reputation, admit if poor design at fault
   – Con: promote fraud, undermine approval rating system
• Use bonus as incentive
   – Pay the minimum $0.01 and $0.01 for bonus
   – Better than rejecting a $0.02 task
• If spammer “caught”, block from future tasks
   – May be easier to always pay, then block as needed

 August 12, 2012                                         178
Worker feedback
• Real feedback received via email after rejection
• Worker XXX
     I did. If you read these articles most of them have
     nothing to do with space programs. I’m not an idiot.

• Worker XXX
     As far as I remember there wasn't an explanation about
     what to do when there is no name in the text. I believe I
     did write a few comments on that, too. So I think you're
     being unfair rejecting my HITs.




August 12, 2012                                             179
Real email exchange with worker after rejection
WORKER: this is not fair , you made me work for 10 cents and i lost my 30 minutes
of time ,power and lot more and gave me 2 rejections at least you may keep it
pending. please show some respect to turkers

REQUESTER: I'm sorry about the rejection. However, in the directions given in the
hit, we have the following instructions: IN ORDER TO GET PAID, you must judge all 5
webpages below *AND* complete a minimum of three HITs.

Unfortunately, because you only completed two hits, we had to reject those hits.
We do this because we need a certain amount of data on which to make decisions
about judgment quality. I'm sorry if this caused any distress. Feel free to contact me
if you have any additional questions or concerns.

WORKER: I understood the problems. At that time my kid was crying and i went to
look after. that's why i responded like that. I was very much worried about a hit
being rejected. The real fact is that i haven't seen that instructions of 5 web page
and started doing as i do the dolores labs hit, then someone called me and i went
to attend that call. sorry for that and thanks for your kind concern.
  August 12, 2012                                                               180
Exchange with worker
• Worker XXX
     Thank you. I will post positive feedback for you at
     Turker Nation.

Me: was this a sarcastic comment?

•    I took a chance by accepting some of your HITs to see if
     you were a trustworthy author. My experience with you
     has been favorable so I will put in a good word for you
     on that website. This will help you get higher quality
     applicants in the future, which will provide higher
     quality work, which might be worth more to you, which
     hopefully means higher HIT amounts in the future.




August 12, 2012                                            181
Build Your Reputation as a Requestor
• Word of mouth effect
      – Workers trust the requester (pay on time, clear
        explanation if there is a rejection)
      – Experiments tend to go faster
      – Announce forthcoming tasks (e.g. tweet)
• Disclose your real identity?



August 12, 2012                                           182
Other practical tips
• Sign up as worker and do some HITs
• “Eat your own dog food”
• Monitor discussion forums
• Address feedback (e.g., poor guidelines,
  payments, passing grade, etc.)
• Everything counts!
      – Overall design only as strong as weakest link


August 12, 2012                                         183
Conclusions
• But one may say “this is all good but looks like
  a ton of work”
• The original goal: data is king
• Data quality and experimental designs are
  preconditions to make sure we get the right
  stuff
• Data will be later be used for rankers, ML
  models, evaluations, etc.
• Don’t cut corners
August 12, 2012                                 184
THE ROAD AHEAD


August 12, 2012                    185
What about sensitive data?
• Not all data can be publicly disclosed
      – User data (e.g. AOL query log, Netflix ratings)
      – Intellectual property
      – Legal confidentiality
• Need to restrict who is in your crowd
      – Separate channel (workforce) from technology
      – Hot question for adoption at enterprise level


August 12, 2012                                           186
Wisdom of Crowds (WoC)
Requires
• Diversity
• Independence
• Decentralization
• Aggregation

Input: large, diverse sample
     (to increase likelihood of overall pool quality)
Output: consensus or selection (aggregation)
August 12, 2012                                  187
WoC vs. Ensemble Learning
• Combine multiple models to improve performance
  over any constituent model
   – Can use many weak learners to make a strong one
   – Compensate for poor models with extra computation
• Works better with diverse, independent learners
• cf. NIPS 2010-2011 Workshops
   – Computational Social Science & the Wisdom of Crowds
• More investigation needed of traditional feature-
  based machine learning & ensemble methods for
  consensus labeling with crowdsourcing
  August 12, 2012                                   188
Active Learning
• Minimize number of labels to achieve goal
  accuracy rate of classifier
      – Select examples to label to maximize learning
• Vijayanarasimhan and Grauman (CVPR 2011)
      – Simple margin criteria: select maximally uncertain
        examples to label next

      – Finding which examples are uncertain can be
        computationally intensive (workers have to wait)
      – Use locality-sensitive hashing to find uncertain
        examples in sub-linear time
August 12, 2012                                         189
Active Learning (2)
• V&G report each learning iteration ~ 75 min
      – 15 minutes for model training & selection
      – 60 minutes waiting for crowd labels
• Leaving workers idle may lose them, slowing
  uptake and completion times
• Keep workers occupied
      – Mason and Suri (2010): paid waiting room
      – Laws et al. (EMNLP 2011): parallelize labeling and
        example selection via producer-consumer model
             • Workers consume examples, produce labels
             • Model consumes label, produces examples
August 12, 2012                                           190
Query execution
•    So you want to combine CPU + HPU in a DB?
•    Crowd can answer difficult queries
•    Query processing with human computation
•    Long term goal
      – When to switch from CPU to HPU and vice versa




August 12, 2012                                         191
MapReduce with human computation
• Commonalities
      – Large task divided into smaller sub-problems
      – Work distributed among worker nodes (workers)
      – Collect all answers and combine them
      – Varying performance of heterogeneous
        CPUs/HPUs
• Variations
      – Human response latency / size of “cluster”
      – Some tasks are not suitable

August 12, 2012                                      192
A Few Questions
• How should we balance automation vs.
  human computation? Which does what?

• Who’s the right person for the job?

• How do we handle complex tasks? Can we
  decompose them into smaller tasks? How?


August 12, 2012                             193
Research problems – operational
• Methodology
      – Budget, people, document, queries, presentation,
        incentives, etc.
      – Scheduling
      – Quality
• What’s the best “mix” of HC for a task?
• What are the tasks suitable for HC?
• Can I crowdsource my task?
      – Eickhoff and de Vries, WSDM 2011 CSDM Workshop

August 12, 2012                                       194
More problems
• Human factors vs. outcomes
• Editors vs. workers
• Pricing tasks
• Predicting worker quality from observable
  properties (e.g. task completion time)
• HIT / Requestor ranking or recommendation
• Expert search : who are the right workers given
  task nature and constraints
• Ensemble methods for Crowd Wisdom consensus
August 12, 2012                                195
Problems: crowds, clouds and algorithms
• Infrastructure
     – Current platforms are very rudimentary
     – No tools for data analysis
• Dealing with uncertainty (propagate rather than mask)
     –   Temporal and labeling uncertainty
     –   Learning algorithms
     –   Search evaluation
     –   Active learning (which example is likely to be labeled correctly)
• Combining CPU + HPU
     – Human Remote Call?
     – Procedural vs. declarative?
     – Integration points with enterprise systems
 August 12, 2012                                                             196
Algorithms
• Bandit problems; explore-exploit
• Optimizing amount of work by workers
      – Humans have limited throughput
      – Harder to scale than machines
• Selecting the right crowds
• Stopping rule



August 12, 2012                          197
BROADER CONSIDERATIONS:
     ETHICS, ECONOMICS, REGULATION

August 12, 2012                      198
What about ethics?
• Silberman, Irani, and Ross (2010)
    – “How should we… conceptualize the role of these
      people who we ask to power our computing?”
    – Power dynamics between parties
           • What are the consequences for a worker
             when your actions harm their reputation?
    – “Abstraction hides detail”

• Fort, Adda, and Cohen (2011)
    – “…opportunities for our community to deliberately
      value ethics above cost savings.”
 August 12, 2012                                        199
Example: SamaSource




August 12, 2012                         200
Davis et al. (2010) The HPU.




                            HPU




August 12, 2012                             201
HPU: “Abstraction hides detail”
• Not just turning a mechanical crank




August 12, 2012                          202
Micro-tasks & Task Decomposition
• Small, simple tasks can be completed faster by
  reducing extraneous context and detail
      – e.g. “Can you name who is in this photo?”




• Current workflow research investigates how to
  decompose complex tasks into simpler ones
August 12, 2012                                     203
Context & Informed Consent




• What is the larger task I’m contributing to?
• Who will benefit from it and how?
August 12, 2012                                  204
What about the regulation?
• Wolfson & Lease (ASIS&T 2011)
• As usual, technology is ahead of the law
   – employment law
   – patent inventorship
   – data security and the Federal Trade Commission
   – copyright ownership
   – securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
   – Understand risks of “just in-time compliance”

 August 12, 2012                                      205
Digital Dirty Jobs
• NY Times: Policing the Web’s Lurid Precincts
• Gawker: Facebook
  content moderation
• CultureDigitally: The dirty job
  of keeping Facebook clean




August 12, 2012                                  206
Jeff Howe Vision vs. Reality?
• Vision of empowering worker freedom:
      – work whenever you want for whomever you want
• When $$$ is at stake, populations at risk may
  be compelled to perform work by others
      – Digital sweat shops? Digital slaves?
      – We really don’t know (and need to learn more…)
      – Traction? Human Trafficking at MSR Summit’12



August 12, 2012                                          207
A DARKER SIDE TO CROWDSOURCING
     & HUMAN COMPUTATION

August 12, 2012              208
Putting the shoe on the other foot:
                     Spam




August 12, 2012                             209
What about trust?
• Some reports of robot “workers” on MTurk
      – E.g. McCreadie et al. (2011)
      – Violates terms of service
• Why not just use a captcha?




August 12, 2012                              210
Captcha Fraud




August 12, 2012                   211
Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
August 12, 2012                                    212
Defeating CAPTCHAs with crowds




August 12, 2012                 213
Gaming the System: SEO, etc.
WWW’12




August 12, 2012            215
Robert Sim, MSR Summit’12




August 12, 2012                         216
Conclusions
•    Crowdsourcing works and is here to stay
•    Fast turnaround, easy to experiment, cheap
•    Still have to design the experiments carefully!
•    Usability considerations
•    Worker quality
•    User feedback extremely useful



August 12, 2012                                    217
Conclusions - II
• Lots of opportunities to improve current platforms
• Integration with current systems
• While MTurk first to-market in micro-task vertical,
  many other vendors are emerging with different
  affordances or value-added features

• Many open research problems …



August 12, 2012                                         218
Conclusions – III
• Important to know your limitations and be
  ready to collaborate
• Lots of different skills and expertise required
      – Social/behavioral science
      – Human factors
      – Algorithms
      – Economics
      – Distributed systems
      – Statistics

August 12, 2012                                     219
REFERENCES & RESOURCES

August 12, 2012                            220
Surveys
• Ipeirotis, Panagiotis G., R. Chandrasekar, and P. Bennett. (2009).
  “A report on the human computation workshop (HComp).” ACM
  SIGKDD Explorations Newsletter 11(2).

• Alex Quinn and Ben Bederson. Human Computation: A Survey
  and Taxonomy of a Growing Field. In Proceedings of CHI 2011.

• Law and von Ahn (2011). Human Computation




   August 12, 2012                                            221
2013 Events Planned
Research events
• 1st year of HComp as AAAI conference
• 2nd annual Collective Intelligence?

Industrial Events
• 4th CrowdConf (San Francisco, Fall)
• 1st Crowdsourcing Week (Singapore, April)

August 12, 2012                               222
TREC Crowdsourcing Track
• Year 1 (2011) – horizontals
      –    Task 1 (hci): collect crowd relevance judgments
      –    Task 2 (stats): aggregate judgments
      –    Organizers: Kazai & Lease
      –    Sponsors: Amazon, CrowdFlower

• Year 2 (2012) – content types
      –    Task 1 (text): judge relevance
      –    Task 2 (images): judge relevance
      –    Organizers: Ipeirotis, Kazai, Lease, & Smucker
      –    Sponsors: Amazon, CrowdFlower, MobileWorks
August 12, 2012                                              223
2012 Workshops & Conferences
•   AAAI: Human Computation (HComp) (July 22-23)
•   AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)
•   ACL: 3rd Workshop of the People's Web meets NLP (July 12-13)
•   AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities (August 9-12)
•   CHI: CrowdCamp (May 5-6)
•   CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)
•   Collective Intelligence (April 18-20)
•   CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)
•   CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)
•   EC: Social Computing and User Generated Content Workshop (June 7)
•   ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)
•   ICEC: Harnessing Collective Intelligence with Games (September)
•   ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)
•   ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)
•   KDD: Workshop on Crowdsourcing and Data Mining (August 12)
•   Multimedia: Crowdsourcing for Multimedia (Nov 2)
•   SocialCom: Social Media for Human Computation (September 6)
•   TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)
•   WWW: CrowdSearch: Crowdsourcing Web search (April 17)
     August 12, 2012                                                                           224
Journal Special Issues 2012

 – Springer’s Information Retrieval (articles now online):
   Crowdsourcing for Information Retrieval

 – IEEE Internet Computing (articles now online):
   Crowdsourcing (Sept./Oct. 2012)

 – Hindawi’s Advances in Multimedia Journal: Multimedia
   Semantics Analysis via Crowdsourcing Geocontext

August 12, 2012                                        225
2011 Workshops & Conferences
•   AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
•   ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
•   Crowdsourcing Technologies for Language and Cognition Studies (July 27)
•   CHI-CHC: Crowdsourcing and Human Computation (May 8)
•   CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
•   CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
•   Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
•   EC: Workshop on Social Computing and User Generated Content (June 5)
•   ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
•   Interspeech: Crowdsourcing for speech processing (August)
•   NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
•   SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
•   TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)
•   UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
•   WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
    August 12, 2012                                                                           226
2011 Tutorials and Keynotes
•   By Omar Alonso and/or Matthew Lease
     –   CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
     –   CrowdConf: Crowdsourcing for Research and Engineering
     –   IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
     –   WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
     –   SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)

•   AAAI: Human Computation: Core Research Questions and State of the Art
     –   Edith Law and Luis von Ahn, August 7
•   ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
    Conservation
     –   Steve Kelling, October 10, ebird
•   EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
     –   Winter Mason and Siddharth Suri, June 5
•   HCIC: Quality Crowdsourcing for Human Computer Interaction Research
     –   Ed Chi, June 14-18, about HCIC)
     –   Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
•   Multimedia: Frontiers in Multimedia Search
     –   Alan Hanjalic and Martha Larson, Nov 28
•   VLDB: Crowdsourcing Applications and Platforms
     –   Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
•   WWW: Managing Crowdsourced Human Computation
     –   Panos Ipeirotis and Praveen Paritosh
    August 12, 2012                                                                                          227
Thank You!
Crowdsourcing news & information:
  ir.ischool.utexas.edu/crowd

For further questions, contact us at:
  omar.alonso@microsoft.com
  ml@ischool.utexas.edu




Cartoons by Mateo Burtch (buta@sonic.net)
August 12, 2012                             228
Additional Literature Reviews
• Man-Ching Yuen, Irwin King, and Kwong-Sak
  Leung. A Survey of Crowdsourcing Systems.
  SocialCom 2011.
• A. Doan, R. Ramakrishnan, A. Halevy.
  Crowdsourcing Systems on the World-Wide
  Web. Communications of the ACM, 2011.



August 12, 2012                               229
More Books
                  July 2010, kindle-only: “This book introduces you to the
                  top crowdsourcing sites and outlines step by step with
                  photos the exact process to get started as a requester on
                  Amazon Mechanical Turk.“




August 12, 2012                                                      230
Resources
A Few Blogs
 Behind Enemy Lines (P.G. Ipeirotis, NYU)
 Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)
 CrowdFlower Blog
 http://experimentalturk.wordpress.com
 Jeff Howe

A Few Sites
 The Crowdsortium
 Crowdsourcing.org
 CrowdsourceBase (for workers)
 Daily Crowdsource

MTurk Forums and Resources
 Turker Nation: http://turkers.proboards.com
 http://www.turkalert.com (and its blog)
 Turkopticon: report/avoid shady requestors
 Amazon Forum for MTurk
August 12, 2012                                                   231
Bibliography
   J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
   Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
   Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics
    Interface (GI 2010), 39-46.
   N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
   C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
   P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
   J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
    in the Loop (ACVHL), June 2010.
   M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
   D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
   JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
   J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
   P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
   J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
   P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
    Workshop on Active Learning and NLP, 2009.
   B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
   P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
   P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
   P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)

     August 12, 2012                                                                                                           232
Bibliography (2)
    A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
    Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
    Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
    K. Krippendorff. "Content Analysis", Sage Publications, 2003
    G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
    T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
     2009.
    W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
    J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
    A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
    J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
     Demographics in Amazon Mechanical Turk”. CHI 2010.
    F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
    R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
     for Natural Language Tasks”. EMNLP-2008.
    V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
     KDD 2008.
    S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
    L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
    L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.

August 12, 2012                                                                                                      233
Bibliography (3)
    Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
     Clustering on Teachers. AAAI 2010.
    Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
    Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
     EMNLP 2011.
    C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
     summarization branches out (WAS), 2004.
    C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
    Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
    Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
     Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
    S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
     Crawled Data and Crowds. CVPR 2011.
    Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.




August 12, 2012                                                                                         234
Recent Work
•   Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold
    Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.
•   Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and
    crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).
•   Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple
    annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.
•   Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for
    selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
    data mining (KDD), 259-268.
•   Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational
    Linguistics, 37(2):413–420.
•   Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection
    in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.
•   Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference
    on Artificial Intelligence.
•   Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via
    Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and
    development in information retrieval.
•   Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in
    Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
•   Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint
    arXiv:1110.3564.
•   Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in
    Crowdsourcing Relevance Judgments.” Springer's Information Retrieval Journal: Special Issue on Crowdsourcing.
    August 12, 2012                                                                                                  235
Recent Work (2)
•   Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in
    Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.
•   Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple
    Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).
•   Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in
    Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).
•   Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.
•   Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal
    of Machine Learning Research 11:1297-1322.
•   Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised
    learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual
    International Conference on Machine Learning (ICML), 889-896.
•   Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling
    Tasks.” Journal of Machine Learning Research 13:491-518.
•   Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural
    information processing systems (NIPS).
•   Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM
    symposium adjunct on User interface software and technology (UIST).
•   Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd
    Human Computation Workshop (HCOMP) at AAAI.
•   Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of
    Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.
•   Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective
    labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.
•    Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration
    of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).
•   Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International
    Conference on Machine Learning (ICML).
     August 12, 2012                                                                                                236
Crowdsourcing in IR: 2008-2010
   2008
          O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2.

   2009
          O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.
          P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.
          G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.
          G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.
          Law et al. “SearchWar”. HCOMP.
          H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009.

   2010
          SIGIR Workshop on Crowdsourcing for Search Evaluation.
          O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.
          K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.
          C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazon's Mechanical Turk.
          Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.
          G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.
          G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)
          Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.
          Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.
          Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.
          T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010.




     August 12, 2012                                                                                                                                           237
Crowdsourcing in IR: 2011
   WSDM Workshop on Crowdsourcing for Search and Data Mining.
   SIGIR Workshop on Crowdsourcing for Information Retrieval
   1st TREC Crowdsourcing Track


   O. Alonso and R. Baeza-Yates. “Design and Implementation of Relevance Assessments using Crowdsourcing, ECIR 2011.
   Roi Blanco, Harry Halpin, Daniel Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran. “Repeatable and
    Reliable Search System Evaluation using Crowd-Sourcing”. SIGIR 2011.
   Yen-Ta Huang, An-Jung Cheng, Liang-Chi Hsieh, Winston H. Hsu, Kuo-Wei Chang. “Region-Based Landmark Discovery by
    Crowdsourcing Geo-Referenced Photos.” SIGIR 2011.
   Hyun Joon Jung, Matthew Lease . “Improving Consensus Accuracy via Z-score and Weighted Voting”. HCOMP 2011.
   G. Kasneci, J. Van Gael, D. Stern, and T. Graepel, CoBayes: Bayesian Knowledge Corroboration with Assessors of
    Unknown Areas of Expertise, WSDM 2011.
   Gabriella Kazai,. “In Search of Quality in Crowdsourcing for Search Engine Evaluation”, ECIR 2011.
   Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling. “Crowdsourcing for Book Search Evaluation: Impact of Quality
    on Comparative System Ranking.” SIGIR 2011.
   Abhimanu Kumar, Matthew Lease . “Learning to Rank From a Noisy Crowd”. SIGIR 2011.
   Edith Law, Paul N. Bennett, and Eric Horvitz. “The Effects of Choice in Routing Relevance Judgments”. SIGIR 2011.


     August 12, 2012                                                                                                    238

Mais conteúdo relacionado

Mais procurados

20211103 jim spohrer oecd ai_science_productivity_panel v5
20211103 jim spohrer oecd ai_science_productivity_panel v520211103 jim spohrer oecd ai_science_productivity_panel v5
20211103 jim spohrer oecd ai_science_productivity_panel v5ISSIP
 
20220105 jim spohrer hicss v10
20220105 jim spohrer hicss v1020220105 jim spohrer hicss v10
20220105 jim spohrer hicss v10ISSIP
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9ISSIP
 
20220203 jim spohrer uidp v11
20220203 jim spohrer uidp v1120220203 jim spohrer uidp v11
20220203 jim spohrer uidp v11ISSIP
 
20211129 jim spohrer wsif digital_entrepreneurship v8
20211129 jim spohrer wsif digital_entrepreneurship v820211129 jim spohrer wsif digital_entrepreneurship v8
20211129 jim spohrer wsif digital_entrepreneurship v8ISSIP
 
Data urban service science 20130617 v2
Data urban service science 20130617 v2Data urban service science 20130617 v2
Data urban service science 20130617 v2ISSIP
 
An Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part IAn Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part IAlessandro Bozzon
 
20210303 jim spohrer service science_ai v7
20210303 jim spohrer service science_ai v720210303 jim spohrer service science_ai v7
20210303 jim spohrer service science_ai v7ISSIP
 
Short and Long of Data Driven Innovation
Short and Long of Data Driven InnovationShort and Long of Data Driven Innovation
Short and Long of Data Driven InnovationDavid De Roure
 
Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5ISSIP
 
Frontiers sutton spohrer 20150711 v2
Frontiers sutton spohrer 20150711 v2Frontiers sutton spohrer 20150711 v2
Frontiers sutton spohrer 20150711 v2ISSIP
 
20210519 jim spohrer sir rel future_ai v14
20210519 jim spohrer sir rel future_ai v1420210519 jim spohrer sir rel future_ai v14
20210519 jim spohrer sir rel future_ai v14ISSIP
 
Japan 20200724 v13
Japan 20200724 v13Japan 20200724 v13
Japan 20200724 v13ISSIP
 
Fundamentals of human computation
Fundamentals of human computationFundamentals of human computation
Fundamentals of human computationElena Simperl
 
Ibm welcome and cognitive 20170322 v7
Ibm welcome and cognitive 20170322 v7Ibm welcome and cognitive 20170322 v7
Ibm welcome and cognitive 20170322 v7ISSIP
 
Ert 20200420 v11
Ert 20200420 v11Ert 20200420 v11
Ert 20200420 v11ISSIP
 
Robots: What Could Go Wrong? What Could Go Right?
Robots: What Could Go Wrong? What Could Go Right? Robots: What Could Go Wrong? What Could Go Right?
Robots: What Could Go Wrong? What Could Go Right? Bohyun Kim
 
New and Emerging Forms of Data
New and Emerging Forms of DataNew and Emerging Forms of Data
New and Emerging Forms of DataDavid De Roure
 
The Computer Science Imperative
The Computer Science ImperativeThe Computer Science Imperative
The Computer Science ImperativeHal Speed
 
Big Data and Social Machines
Big Data and Social MachinesBig Data and Social Machines
Big Data and Social MachinesDavid De Roure
 

Mais procurados (20)

20211103 jim spohrer oecd ai_science_productivity_panel v5
20211103 jim spohrer oecd ai_science_productivity_panel v520211103 jim spohrer oecd ai_science_productivity_panel v5
20211103 jim spohrer oecd ai_science_productivity_panel v5
 
20220105 jim spohrer hicss v10
20220105 jim spohrer hicss v1020220105 jim spohrer hicss v10
20220105 jim spohrer hicss v10
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9
 
20220203 jim spohrer uidp v11
20220203 jim spohrer uidp v1120220203 jim spohrer uidp v11
20220203 jim spohrer uidp v11
 
20211129 jim spohrer wsif digital_entrepreneurship v8
20211129 jim spohrer wsif digital_entrepreneurship v820211129 jim spohrer wsif digital_entrepreneurship v8
20211129 jim spohrer wsif digital_entrepreneurship v8
 
Data urban service science 20130617 v2
Data urban service science 20130617 v2Data urban service science 20130617 v2
Data urban service science 20130617 v2
 
An Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part IAn Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part I
 
20210303 jim spohrer service science_ai v7
20210303 jim spohrer service science_ai v720210303 jim spohrer service science_ai v7
20210303 jim spohrer service science_ai v7
 
Short and Long of Data Driven Innovation
Short and Long of Data Driven InnovationShort and Long of Data Driven Innovation
Short and Long of Data Driven Innovation
 
Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5
 
Frontiers sutton spohrer 20150711 v2
Frontiers sutton spohrer 20150711 v2Frontiers sutton spohrer 20150711 v2
Frontiers sutton spohrer 20150711 v2
 
20210519 jim spohrer sir rel future_ai v14
20210519 jim spohrer sir rel future_ai v1420210519 jim spohrer sir rel future_ai v14
20210519 jim spohrer sir rel future_ai v14
 
Japan 20200724 v13
Japan 20200724 v13Japan 20200724 v13
Japan 20200724 v13
 
Fundamentals of human computation
Fundamentals of human computationFundamentals of human computation
Fundamentals of human computation
 
Ibm welcome and cognitive 20170322 v7
Ibm welcome and cognitive 20170322 v7Ibm welcome and cognitive 20170322 v7
Ibm welcome and cognitive 20170322 v7
 
Ert 20200420 v11
Ert 20200420 v11Ert 20200420 v11
Ert 20200420 v11
 
Robots: What Could Go Wrong? What Could Go Right?
Robots: What Could Go Wrong? What Could Go Right? Robots: What Could Go Wrong? What Could Go Right?
Robots: What Could Go Wrong? What Could Go Right?
 
New and Emerging Forms of Data
New and Emerging Forms of DataNew and Emerging Forms of Data
New and Emerging Forms of Data
 
The Computer Science Imperative
The Computer Science ImperativeThe Computer Science Imperative
The Computer Science Imperative
 
Big Data and Social Machines
Big Data and Social MachinesBig Data and Social Machines
Big Data and Social Machines
 

Destaque

An explorative approach for Crowdsourcing tasks design
An explorative approach for Crowdsourcing tasks design �An explorative approach for Crowdsourcing tasks design �
An explorative approach for Crowdsourcing tasks design Andrea Mauri
 
Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012xin wang
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchSean Ekins
 
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)Matthew Lease
 
Bao cao kỹ năng trả lời phỏng vấn xin việc
Bao cao   kỹ năng trả lời phỏng vấn xin việcBao cao   kỹ năng trả lời phỏng vấn xin việc
Bao cao kỹ năng trả lời phỏng vấn xin việcbkaa09303
 
打倒 PaintsChainer おうちで始めるDCGAN
打倒 PaintsChainer  おうちで始めるDCGAN打倒 PaintsChainer  おうちで始めるDCGAN
打倒 PaintsChainer おうちで始めるDCGANChoumirai
 
Control of plant diseases
Control of plant diseasesControl of plant diseases
Control of plant diseasesAmit Sahoo
 
Antropologia filosofica trascendencia del hombre
Antropologia filosofica trascendencia del hombreAntropologia filosofica trascendencia del hombre
Antropologia filosofica trascendencia del hombreMónica Pardo
 
Use Case: PostGIS and Agribotics
Use Case: PostGIS and AgriboticsUse Case: PostGIS and Agribotics
Use Case: PostGIS and AgriboticsPGConf APAC
 
Skymind - Udacity China presentation
Skymind - Udacity China presentationSkymind - Udacity China presentation
Skymind - Udacity China presentationAdam Gibson
 
Allgeier Präsentation am Swiss eHealth Forum 2017
Allgeier Präsentation am Swiss eHealth Forum 2017Allgeier Präsentation am Swiss eHealth Forum 2017
Allgeier Präsentation am Swiss eHealth Forum 2017Allgeier (Schweiz) AG
 
100 documentos comerciales
100 documentos comerciales100 documentos comerciales
100 documentos comercialesCarlos Godinez
 

Destaque (12)

An explorative approach for Crowdsourcing tasks design
An explorative approach for Crowdsourcing tasks design �An explorative approach for Crowdsourcing tasks design �
An explorative approach for Crowdsourcing tasks design
 
Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
 
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)
Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)
 
Bao cao kỹ năng trả lời phỏng vấn xin việc
Bao cao   kỹ năng trả lời phỏng vấn xin việcBao cao   kỹ năng trả lời phỏng vấn xin việc
Bao cao kỹ năng trả lời phỏng vấn xin việc
 
打倒 PaintsChainer おうちで始めるDCGAN
打倒 PaintsChainer  おうちで始めるDCGAN打倒 PaintsChainer  おうちで始めるDCGAN
打倒 PaintsChainer おうちで始めるDCGAN
 
Control of plant diseases
Control of plant diseasesControl of plant diseases
Control of plant diseases
 
Antropologia filosofica trascendencia del hombre
Antropologia filosofica trascendencia del hombreAntropologia filosofica trascendencia del hombre
Antropologia filosofica trascendencia del hombre
 
Use Case: PostGIS and Agribotics
Use Case: PostGIS and AgriboticsUse Case: PostGIS and Agribotics
Use Case: PostGIS and Agribotics
 
Skymind - Udacity China presentation
Skymind - Udacity China presentationSkymind - Udacity China presentation
Skymind - Udacity China presentation
 
Allgeier Präsentation am Swiss eHealth Forum 2017
Allgeier Präsentation am Swiss eHealth Forum 2017Allgeier Präsentation am Swiss eHealth Forum 2017
Allgeier Präsentation am Swiss eHealth Forum 2017
 
100 documentos comerciales
100 documentos comerciales100 documentos comerciales
100 documentos comerciales
 

Semelhante a Crowdsourcing for Search Evaluation and Social-Algorithmic Search

UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingMatthew Lease
 
Metrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMetrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMatthew Lease
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 
Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)Matthew Lease
 
S.P.A.C.E. Exploration for Software Engineering
 S.P.A.C.E. Exploration for Software Engineering S.P.A.C.E. Exploration for Software Engineering
S.P.A.C.E. Exploration for Software EngineeringCS, NcState
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingMatthew Lease
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Gregory Piatetsky-Shapiro
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesHendrik Drachsler
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 

Semelhante a Crowdsourcing for Search Evaluation and Social-Algorithmic Search (20)

UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
 
Metrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMetrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-Computing
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)
 
S.P.A.C.E. Exploration for Software Engineering
 S.P.A.C.E. Exploration for Software Engineering S.P.A.C.E. Exploration for Software Engineering
S.P.A.C.E. Exploration for Software Engineering
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Promise notes
Promise notesPromise notes
Promise notes
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
 
isd314-01
isd314-01isd314-01
isd314-01
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for Universities
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
DBMS
DBMSDBMS
DBMS
 

Mais de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
Toward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd WorkToward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd WorkMatthew Lease
 
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...Matthew Lease
 
Crowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine EvaluationCrowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine EvaluationMatthew Lease
 
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical TurkCrowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical TurkMatthew Lease
 
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsMatthew Lease
 
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences. Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences. Matthew Lease
 
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsCrowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsMatthew Lease
 
Mechanical Turk is Not Anonymous
Mechanical Turk is Not AnonymousMechanical Turk is Not Anonymous
Mechanical Turk is Not AnonymousMatthew Lease
 

Mais de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
Toward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd WorkToward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd Work
 
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
 
Crowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine EvaluationCrowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine Evaluation
 
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical TurkCrowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical Turk
 
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to Ethics
 
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences. Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences.
 
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsCrowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
 
Mechanical Turk is Not Anonymous
Mechanical Turk is Not AnonymousMechanical Turk is Not Anonymous
Mechanical Turk is Not Anonymous
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Crowdsourcing for Search Evaluation and Social-Algorithmic Search

  • 1. Crowdsourcing for Search Evaluation and Social-Algorithmic Search Matthew Lease University of Texas at Austin Omar Alonso Microsoft August 12, 2012 August 12, 2012 1
  • 2. Topics • Crowd-powered data collection & applications – Evaluation: relevance judging, interactive studies, log data – Training: e.g., active learning (e.g. learning to rank) – Search: answering, verification, collaborations, physical • Crowdsourcing & human computation • Crowdsourcing platforms • Incentive Engineering & Demographics • Designing for Crowds & Quality assurance • Future Challenges • Broader Issues and the Dark Side August 12, 2012 2
  • 3. What is Crowdsourcing? • Let’s start with an example and work back toward a more general definition • Example: Amazon’s Mechanical Turk (MTurk) • Goal – See a concrete example of real crowdsourcing – Ground later discussion of abstract concepts – Provide a specific example with which we will contrast other forms of crowdsourcing August 12, 2012 3
  • 4. Human Intelligence Tasks (HITs) August 12, 2012 4
  • 6. Jane saw the man with the binoculars August 12, 2012 6
  • 7. Traditional Data Collection • Setup data collection software / harness • Recruit participants / annotators / assessors • Pay a flat fee for experiment or hourly wage • Characteristics – Slow – Expensive – Difficult and/or Tedious – Sample Bias… August 12, 2012 7
  • 8. “Hello World” Demo • Let’s create and run a simple MTurk HIT • This is a teaser highlighting concepts – Don’t worry about details; we’ll revisit them • Goal – See a concrete example of real crowdsourcing – Ground our later discussion of abstract concepts – Provide a specific example with which we will contrast other forms of crowdsourcing August 12, 2012 8
  • 10. Flip a coin • Please flip a coin and report the results • Two questions 1. Coin type? 2. Head or tails • Results Row Labels Count Row Labels Counts Dollar 56 Euro 11 head 57 Other 30 tail 43 (blank) 3 Grand Total 100 Grand Total 100 August 12, 2012 10
  • 11. NOW WHAT CAN I DO WITH IT? August 12, 2012 11
  • 12. PHASE 1: COLLECTING & LABELING DATA August 12, 2012 12
  • 13. Data is King! • Massive free Web data changed how we train learning systems – Banko and Brill (2001). Human Language Tech. – Halevy et al. (2009). IEEE Intelligent Systems. • Crowds provide new access to cheap & labeled Big Data. But quality also matters! August 12, 2012 13
  • 14. NLP: Snow et al. (EMNLP 2008) • MTurk annotation for 5 Tasks – Affect recognition – Word similarity – Recognizing textual entailment – Event temporal ordering – Word sense disambiguation • 22K labels for US $26 • High agreement between consensus labels and gold-standard labels August 12, 2012 14
  • 15. Computer Vision: Sorokin & Forsythe (CVPR 2008) • 4K labels for US $60 August 12, 2012 15
  • 16. IR: Alonso et al. (SIGIR Forum 2008) • MTurk for Information Retrieval (IR) – Judge relevance of search engine results • Many follow-on studies (design, quality, cost) August 12, 2012 16
  • 17. User Studies: Kittur, Chi, & Suh (CHI 2008) • “…make creating believable invalid responses as effortful as completing the task in good faith.” August 12, 2012 17
  • 18. Social & Behavioral Sciences • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists August 12, 2012 18
  • 20. Remote Usability Testing • Liu, Bias, Lease, and Kuipers, ASIS&T, 2012 • Compares remote usability testing using MTurk and CrowdFlower (not uTest) vs. traditional on-site testing • Advantages – More (Diverse) Participants – High Speed – Low Cost • Disadvantages – Lower Quality Feedback – Less Interaction – Greater need for quality control – Less Focused User Groups August 12, 2012 20
  • 22. NLP Example – Dialect Identification August 12, 2012 22
  • 23. NLP Example – Machine Translation • Manual evaluation on translation quality is slow and expensive • High agreement between non-experts and experts • $0.10 to translate a sentence C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009. August 12, 2012 23
  • 24. Computer Vision – Painting Similarity Kovashka & Lease, CrowdConf’10 August 12, 2012 24
  • 25. IR Example – Relevance and ads August 12, 2012 25
  • 26. IR Example – Product Search August 12, 2012 26
  • 27. IR Example – Snippet Evaluation • Study on summary lengths • Determine preferred result length • Asked workers to categorize web queries • Asked workers to evaluate snippet quality • Payment between $0.01 and $0.05 per HIT M. Kaisser, M. Hearst, and L. Lowe. “Improving Search Results Quality by Customizing Summary Lengths”, ACL/HLT, 2008. August 12, 2012 27
  • 28. IR Example – Relevance Assessment • Replace TREC-like relevance assessors with MTurk? • Selected topic “space program” (011) • Modified original 4-page instructions from TREC • Workers more accurate than original assessors! • 40% provided justification for each answer O. Alonso and S. Mizzaro. “Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment”, SIGIR Workshop on the Future of IR Evaluation, 2009. August 12, 2012 28
  • 29. IR Example – Timeline Annotation • Workers annotate timeline on politics, sports, culture • Given a timex (1970s, 1982, etc.) suggest something • Given an event (Vietnam, World cup, etc.) suggest a timex K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”. ECIR 2010 August 12, 2012 29
  • 30. COLLECTING DATA WITH OTHER CROWDS & OTHER INCENTIVES August 12, 2012 30
  • 31. Why Eytan Adar hates MTurk Research (CHI 2011 CHC Workshop) • Overly-narrow focus on MTurk – Identify general vs. platform-specific problems – Academic vs. Industrial problems • Inattention to prior work in other disciplines • Turks aren’t Martians – Just human behavior (more later…) August 12, 2012 31
  • 32. ESP Game (Games With a Purpose) L. Von Ahn and L. Dabbish (2004) August 12, 2012 32
  • 33. reCaptcha L. von Ahn et al. (2008). In Science. August 12, 2012 33
  • 34. Human Sensing and Monitoring • Sullivan et al. (2009). Bio. Conservation (142):10 • Keynote by Steve Kelling at ASIS&T 2011 August 12, 2012 34
  • 35. Learning to map from web pages to queries • Human computation game to elicit data • Home grown system (no AMT) • Try it! pagehunt.msrlivelabs.com See also: • H. Ma. et al. “Improving Search Engines Using Human Computation Games”, CIKM 2009. • Law et al. SearchWar. HCOMP 2009. • Bennett et al. Picture This. HCOMP 2009. August 12, 2012 35
  • 36. Tracking Sentiment in Online Media Brew et al., PAIS 2010 • Volunteer-crowd • Judge in exchange for access to rich content • Balance system needs with user interest • Daily updates to non- stationary distribution August 12, 2012 36
  • 37. PHASE 2: FROM DATA COLLECTION TO HUMAN COMPUTATION August 12, 2012 37
  • 38. Human Computation • What was old is new • Crowdsourcing: A New Branch of Computer Science – D.A. Grier, March 29, 2011 • Tabulating the heavens: computing the Nautical Almanac in 18th-century England - M. Croarken’03 Princeton University Press, 2005 August 12, 2012 38
  • 39. The Mechanical Turk Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804) J. Pontin. Artificial Intelligence, With Help From the Humans. New York Times (March 25, 2007) August 12, 2012 39
  • 40. The Human Processing Unit (HPU) • Davis et al. (2010) HPU August 12, 2012 40
  • 41. Human Computation • Having people do stuff instead of computers • Investigates use of people to execute certain computations for which capabilities of current automated methods are more limited • Explores the metaphor of computation for characterizing attributes, capabilities, and limitations of human performance in executing desired tasks • Computation is required, crowd is not • von Ahn’s Thesis (2005), Law & von Ahn (2011) August 12, 2012 41
  • 42. APPLYING HUMAN COMPUTATION: CROWD-POWERED APPLICATIONS August 12, 2012 42
  • 43. Crowd-Assisted Search: “Amazon Remembers” August 12, 2012 43
  • 44. Crowd-Assisted Search (2) • Yan et al., MobiSys’10 • CrowdTerrier (McCreadie et al., SIGIR’12) August 12, 2012 44/11
  • 45. Translation by monolingual speakers • C. Hu, CHI 2009 August 12, 2012 45
  • 46. Soylent: A Word Processor with a Crowd Inside • Bernstein et al., UIST 2010 August 12, 2012 46
  • 47. fold.it S. Cooper et al. (2010) Alice G. Walton. Online Gamers Help Solve Mystery of Critical AIDS Virus Enzyme. The Atlantic, October 8, 2011. August 12, 2012 47
  • 48. PlateMate (Noronha et al., UIST’10) August 12, 2012 48/11
  • 49. Image Analysis and more: Eatery August 12, 2012 49
  • 50. VizWiz aaaaaaaa Bingham et al. (UIST 2010) August 12, 2012 50/11
  • 53. THE SOCIAL SIDE OF SEARCH August 12, 2012 53
  • 54. People are more than HPUs • Why is Facebook popular? People are social. • Information needs are contextually grounded in our social experiences and social networks • The value of social search may be more than the relevance of the search results • Our social networks also embody additional knowledge about us, our needs, and the world The social dimension complements computation August 12, 2012 54
  • 57. Complex Information Needs  Who is Rahm Emanuel, Obama's Chief of Staff?  How have dramatic shifts in terrorists resulted in an equally dramatic shift in terrorist organizations?  How do I find what events were in the news on my sons birthday?  Do you think the current drop in the Stock Market is related to Obama's election to President?  Why are prisoners on death row given final medicals?  Should George Bush attack Iran's nuclear facility before he leaves office?  Why are people against gay marriage?  Does anyone know anything interesting that happened nation wide in 2008?  Should the fact that a prisoner has cancer have any bearing on an appeal for bail? August 12, 2012 Source: Yahoo! Answers, “News & Events”, Nov. 6 2008 57
  • 58. Community Q&A • Ask the village vs. searching the archive • Posting and waiting can be slow – Find similar questions already answered • Best answer (winner-take-all) vs. voting • Challenges – Questions shorter than documents – Questions not queries, colloquial, errors – Latency & quality (e.g. question routing) • Cf. work by Bruce Croft & students August 12, 2012 58
  • 59. Horowitz & Kamvar, WWW’10 • Routing: Trust vs. Authority • Social networks vs. search engines – See also: Morris & Teevan, HCIC’12 August 12, 2012 59
  • 60. Social Network integration • Facebook Questions (with Bing) • Google+ (acquired Aardvark) • Twitter (cf. Paul, Hong, and Chi, ICWSM’11) August 12, 2012 60
  • 61. Search Buddies Hecht et al. ICWSM 2012; Morris MSR Talk August 12, 2012 61
  • 62. {where to go on vacation} • Tons of results • MTurk: 50 answers, $1.80 • Read title + snippet + URL • Quora: 2 answers • Explore a few pages in • Y! Answers: 2 detail answers August 12, 2012 • FB: 1 answer 62
  • 63. {where to go on vacation} Countries Cities August 12, 2012 63
  • 64. {where to go on vacation} • Let’s execute the same query in different days Execution #1 Execution #2 Execution #3 Las Vegas 3 Kerala 6 Las Vegas 4 Hawaii 2 Goa 4 Himachal pradesh 3 Kerala 2 Ooty 3 Mauritius 2 Key West 2 Switzerland 3 Ooty 2 Orlando 2 Agra 2 kodaikanal 2 New Zealand 2 • Table show places with frequency >= 2 • Every execution uses same template & 50 workers • Completion time more or less the same • Results may differ • Related work: Zhang et al., CHI 2012 August 12, 2012 64
  • 65. SO WHAT IS CROWDSOURCING? August 12, 2012 65
  • 67. From Outsourcing to Crowdsourcing • Take a job traditionally performed by a known agent (often an employee) • Outsource it to an undefined, generally large group of people via an open call • New application of principles from open source movement • Evolving & broadly defined ... August 12, 2012 67
  • 68. Crowdsourcing models • Micro-tasks & citizen science • Co-Creation • Open Innovation, Contests • Prediction Markets • Crowd Funding and Charity • “Gamification” (not serious gaming) • Transparent • cQ&A, Social Search, and Polling • Physical Interface/Task August 12, 2012 68
  • 69. What is Crowdsourcing? • A set of mechanisms and methods for scaling & directing crowd activities to achieve some goal(s) • Enabled by internet-connectivity • Many related topics/areas: – Human computation (next slide…) – Collective intelligence – Crowd/Social computing – Wisdom of Crowds – People services, Human Clouds, Peer-production, … August 12, 2012 69
  • 70. What is not crowdsourcing? • Post-hoc use of pre-existing crowd data – Data mining – Visual analytics • Use of one or few people – Mixed-initiative design – Active learning • Conducting a survey or poll… (*) August 12, 2012 70
  • 71. Crowdsourcing Key Questions • What are the goals? – Purposeful directing of human activity • How can you incentivize participation? – Incentive engineering – Who are the target participants? • Which model(s) are most appropriate? – How to adapt them to your context and goals? August 12, 2012 71
  • 72. What do you want to accomplish? • Create • Execute task/computation • Fund • Innovate and/or discover • Learn • Monitor • Predict August 12, 2012 72
  • 74. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers?... CHI 2010. August 12, 2012 74
  • 75. MTurk Demographics • 2008-2009 studies found less global and diverse than previously thought – US – Female – Educated – Bored – Money is secondary August 12, 2012 75
  • 76. 2010 shows increasing diversity 47% US, 34% India, 19% other (P. Ipeitorotis. March 2010) August 12, 2012 76
  • 77. Why should your crowd participate? • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige (leaderboards, badges) • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource Multiple incentives can often operate in parallel (*caveat) August 12, 2012 77
  • 78. Example: Wikipedia • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 78
  • 79. Example: DuoLingo • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 79
  • 80. Example: • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 80
  • 81. Example: ESP • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 81
  • 82. Example: fold.it • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 82
  • 83. Example: FreeRice • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 83
  • 84. Example: cQ&A • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 84
  • 85. Example: reCaptcha • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) Is there an existing human activity you can harness • Learn something new for another purpose? • Obtain something else • Create self-serving resource August 12, 2012 85
  • 86. Example: Mechanical Turk • Earn Money (real or virtual) • Have fun (or pass the time) • Socialize with others • Obtain recognition or prestige • Do Good (altruism) • Learn something new • Obtain something else • Create self-serving resource August 12, 2012 86
  • 87. How Much to Pay? • Price commensurate with task effort – Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback • Ethics & market-factors: W. Mason and S. Suri, 2010. – e.g. non-profit SamaSource contracts workers refugee camps – Predict right price given market & task: Wang et al. CSDM’11 • Uptake & time-to-completion vs. Cost & Quality – Too little $$, no interest or slow – too much $$, attract spammers – Real problem is lack of reliable QA substrate • Accuracy & quantity – More pay = more work, not better (W. Mason and D. Watts, 2009) • Heuristics: start small, watch uptake and bargaining feedback • Worker retention (“anchoring”) See also: L.B. Chilton et al. KDD-HCOMP 2010. August 12, 2012 87
  • 88. Dan Pink – YouTube video “The Surprising Truth about what Motivates us” August 12, 2012 88
  • 91. Does anyone really use it? Yes! http://www.mturk-tracker.com (P. Ipeirotis’10) From 1/09 – 4/10, 7M HITs from 10K requestors worth $500,000 USD (significant under-estimate) August 12, 2012 91
  • 92. MTurk: The Requester • Sign up with your Amazon account • Amazon payments • Purchase prepaid HITs • There is no minimum or up-front fee • MTurk collects a 10% commission • The minimum commission charge is $0.005 per HIT August 12, 2012 92
  • 93. MTurk Dashboard • Three tabs – Design – Publish – Manage • Design – HIT Template • Publish – Make work available • Manage – Monitor progress August 12, 2012 93
  • 95. MTurk: Dashboard - II August 12, 2012 95
  • 96. MTurk API • Amazon Web Services API • Rich set of services • Command line tools • More flexibility than dashboard August 12, 2012 96
  • 97. MTurk Dashboard vs. API • Dashboard – Easy to prototype – Setup and launch an experiment in a few minutes • API – Ability to integrate AMT as part of a system – Ideal if you want to run experiments regularly – Schedule tasks August 12, 2012 97
  • 98. • Multiple Channels • Gold-based tests • Only pay for “trusted” judgments August 12, 2012 98
  • 99. CloudFactory • Information below from Mark Sears (Oct. 18, 2011) • Cloud Labor API – Tools to design virtual assembly lines – workflows with multiple tasks chained together • Focus on self serve tools for people to easily design crowd-powered assembly lines that can be easily integrated into software applications • Interfaces: command-line, RESTful API, and Web • Each “task station” can have either a human or robot worker assigned – web software services (AlchemyAPI, SendGrid, Google APIs, Twilio, etc.) or local software can be combined with human computation • Many built-in "best practices" – “Tournament Stations” where multiple results are compared by a other cloud workers until confidence of best answer is reached – “Improver Stations” have workers improve and correct work by other workers – Badges are earned by cloud workers passing tests created by requesters – Training and tools to create skill tests will be flexible – Algorithms to detect and kick out spammers/cheaters/lazy/bad workers August 12, 2012 99
  • 100. More Crowd Labor Platforms • Clickworker • CloudCrowd • CrowdSource • DoMyStuff • Humanoid (by Matt Swason et al.) • Microtask • MobileWorks (by Anand Kulkarni ) • myGengo • SmartSheet • vWorker • Industry heavy-weights – Elance – Liveops – oDesk – uTest • and more… August 12, 2012 100
  • 101. Platform alternatives • Why MTurk – Amazon brand, lots of research papers – Speed, price, diversity, payments • Why not – Crowdsourcing != Mturk – Spam, no analytics, must build tools for worker & task quality • Microsoft Universal Human Relevance System (UHRS) • How to build your own crowdsourcing platform – Back-end – Template language for creating experiments – Scheduler – Payments? August 12, 2012 101
  • 102. Why Micro-Tasks? • Easy, cheap and fast • Ready-to use infrastructure, e.g. – MTurk payments, workforce, interface widgets – CrowdFlower quality control mechanisms, etc. – Many others … • Allows early, iterative, frequent trials – Iteratively prototype and test new ideas – Try new tasks, test when you want & as you go • Many successful examples of use reported August 12, 2012 102
  • 103. Micro-Task Issues • Process – Task design, instructions, setup, iteration • Choose crowdsourcing platform (or roll your own) • Human factors – Payment / incentives, interface and interaction design, communication, reputation, recruitment, retention • Quality Control / Data Quality – Trust, reliability, spam detection, consensus labeling August 12, 2012 103
  • 106. Kulkarni et al., CSCW 2012 Turkomatic August 12, 2012 106
  • 107. CrowdForge: Workers perform a task or further decompose them Kittur et al., CHI 2011 August 12, 2012 107
  • 108. Kittur et al., CrowdWeaver, CSCW 2012 August 12, 2012 108
  • 111. Typical Workflow • Define and design what to test • Sample data • Design the experiment • Run experiment • Collect data and analyze results • Quality control August 12, 2012 111
  • 112. Development Framework • Incremental approach • Measure, evaluate, and adjust as you go • Suitable for repeatable tasks August 12, 2012 112
  • 113. Survey Design • One of the most important parts • Part art, part science • Instructions are key • Prepare to iterate August 12, 2012 113
  • 114. Questionnaire Design • Ask the right questions • Workers may not be IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer – Engineer writes the specification – Writer communicates August 12, 2012 114
  • 115. UX Design • Time to apply all those usability concepts • Generic tips – Experiment should be self-contained. – Keep it short and simple. Brief and concise. – Be very clear with the relevance task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. August 12, 2012 115
  • 116. UX Design - II • Presentation • Document design • Highlight important concepts • Colors and fonts • Need to grab attention • Localization August 12, 2012 116
  • 117. Examples - I • Asking too much, task not clear, “do NOT/reject” • Worker has to do a lot of stuff August 12, 2012 117
  • 118. Example - II • Lot of work for a few cents • Go here, go there, copy, enter, count … August 12, 2012 118
  • 119. A Better Example • All information is available – What to do – Search result – Question to answer August 12, 2012 119
  • 121. Form and Metadata • Form with a close question (binary relevance) and open-ended question (user feedback) • Clear title, useful keywords • Workers need to find your task August 12, 2012 121
  • 122. Relevance Judging – Example I August 12, 2012 122
  • 123. Relevance Judging – Example II August 12, 2012 123
  • 124. Implementation • Similar to a UX • Build a mock up and test it with your team – Yes, you need to judge some tasks • Incorporate feedback and run a test on MTurk with a very small data set – Time the experiment – Do people understand the task? • Analyze results – Look for spammers – Check completion times • Iterate and modify accordingly August 12, 2012 124
  • 125. Implementation – II • Introduce quality control – Qualification test – Gold answers (honey pots) • Adjust passing grade and worker approval rate • Run experiment with new settings & same data • Scale on data • Scale on workers August 12, 2012 125
  • 126. Experiment in Production • Lots of tasks on MTurk at any moment • Need to grab attention • Importance of experiment metadata • When to schedule – Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n+1 August 12, 2012 126
  • 127. Other design principles • Text alignment • Legibility • Reading level: complexity of words and sentences • Attractiveness (worker’s attention & enjoyment) • Multi-cultural / multi-lingual • Who is the audience (e.g. target worker community) – Special needs communities (e.g. simple color blindness) • Parsimony • Cognitive load: mental rigor needed to perform task • Exposure effect August 12, 2012 127
  • 128. The human side • As a worker – I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement • As a requester – Attrition – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers; system should do some of that for me • Managing crowds and tasks is a daily activity – more difficult than managing computers August 12, 2012 128
  • 129. Things that work • Qualification tests • Honey-pots • Good content and good presentation • Economy of attention • Things to improve – Manage workers in different levels of expertise including spammers and potential cases. – Mix different pools of workers based on different profile and expertise levels. August 12, 2012 129
  • 130. Things that need work • UX and guidelines – Help the worker – Cost of interaction • Scheduling and refresh rate • Exposure effect • Sometimes we just don’t agree • How crowdsourcable is your task August 12, 2012 130
  • 131. RELEVANCE JUDGING & CROWDSOURCING August 12, 2012 131
  • 133. Motivating Example: Relevance Judging • Relevance of search results is difficult to judge – Highly subjective – Expensive to measure • Professional editors commonly used • Potential benefits of crowdsourcing – Scalability (time and cost) – Diversity of judgments August 12, 2012 133
  • 135. Started with a joke … August 12, 2012 135
  • 136. Results for {idiot} at WSDM 2011 February 2011: 5/7 (R), 2/7 (NR) – Most of the time those TV reality stars have absolutely no talent. They do whatever they can to make a quick dollar. Most of the time the reality tv stars don not have a mind of their own. R – Most are just celebrity wannabees. Many have little or no talent, they just want fame. R – I can see this one going both ways. A particular sort of reality star comes to mind, though, one who was voted off Survivor because he chose not to use his immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R – Just because someone else thinks they are an "idiot", doesn't mean that is what the word means. I don't like to think that any one person's photo would be used to describe a certain term. NR – While some reality-television stars are genuinely stupid (or cultivate an image of stupidity), that does not mean they can or should be classified as "idiots." Some simply act that way to increase their TV exposure and potential earnings. Other reality-television stars are really intelligent people, and may be considered as idiots by people who don't like them or agree with them. It is too subjective an issue to be a good result for a search engine. NR – Have you seen the knuckledraggers on reality television? They should be required to change their names to idiot after appearing on the show. You could put numbers after the word idiot so we can tell them apart. R – Although I have not followed too many of these shows, those that I have encountered have for a great part a very common property. That property is that most of the participants involved exhibit a shallow self-serving personality that borders on social pathological behavior. To perform or act in such an abysmal way could only be an act of an idiot. R August 12, 2012 136
  • 137. Two Simple Examples of MTurk 1. Ask workers to classify a query 2. Ask workers to judge document relevance Steps • Define high-level task • Design & implement interface & backend • Launch, monitor progress, and assess work • Iterate design August 12, 2012 137
  • 138. Query Classification Task • Ask the user to classify a query • Show a form that contains a few categories • Upload a few queries (~20) • Use 3 workers August 12, 2012 138
  • 141. Relevance Judging Task • Use a few documents from a standard collection used for evaluating search engines • Ask user to make binary judgments • Modification: graded judging • Use 5 workers August 12, 2012 141
  • 143. Content quality • People like to work on things that they like • TREC ad-hoc vs. INEX – TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS) • Topics – INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc. • Content and judgments according to modern times – Airport security docs are pre 9/11 – Antarctic exploration (global warming ) August 12, 2012 143
  • 144. Content quality - II • Document length • Randomize content • Avoid worker fatigue – Judging 100 documents on the same subject can be tiring, leading to decreasing quality August 12, 2012 144
  • 145. Presentation • People scan documents for relevance cues • Document design • Highlighting no more than 10% August 12, 2012 145
  • 146. Presentation - II August 12, 2012 146
  • 147. Relevance justification • Why settle for a label? • Let workers justify answers – cf. Zaidan et al. (2007) “annotator rationales” • INEX – 22% of assignments with comments • Must be optional • Let’s see how people justify August 12, 2012 147
  • 148. “Relevant” answers [Salad Recipes] Doesn't mention the word 'salad', but the recipe is one that could be considered a salad, or a salad topping, or a sandwich spread. Egg salad recipe Egg salad recipe is discussed. History of salad cream is discussed. Includes salad recipe It has information about salad recipes. Potato Salad Potato salad recipes are listed. Recipe for a salad dressing. Salad Recipes are discussed. Salad cream is discussed. Salad info and recipe The article contains a salad recipe. The article discusses methods of making potato salad. The recipe is for a dressing for a salad, so the information is somewhat narrow for the topic but is still potentially relevant for a researcher. This article describes a specific salad. Although it does not list a specific recipe, it does contain information relevant to the search topic. gives a recipe for tuna salad relevant for tuna salad recipes relevant to salad recipes this is on-topic for salad recipes August 12, 2012 148
  • 149. “Not relevant” answers [Salad Recipes] About gaming not salad recipes. Article is about Norway. Article is about Region Codes. Article is about forests. Article is about geography. Document is about forest and trees. Has nothing to do with salad or recipes. Not a salad recipe Not about recipes Not about salad recipes There is no recipe, just a comment on how salads fit into meal formats. There is nothing mentioned about salads. While dressings should be mentioned with salads, this is an article on one specific type of dressing, no recipe for salads. article about a swiss tv show completely off-topic for salad recipes not a salad recipe not about salad recipes totally off base August 12, 2012 149
  • 151. Feedback length • Workers will justify answers • Has to be optional for good feedback • In E51, mandatory comments – Length dropped – “Relevant” or “Not Relevant August 12, 2012 151
  • 152. Was the task difficult? • Ask workers to rate difficulty of a search topic • 50 topics; 5 workers, $0.01 per task August 12, 2012 152
  • 154. When to assess quality of work • Beforehand (prior to main task activity) – How: “qualification tests” or similar mechanism – Purpose: screening, selection, recruiting, training • During – How: assess labels as worker produces them • Like random checks on a manufacturing line – Purpose: calibrate, reward/penalize, weight • After – How: compute accuracy metrics post-hoc – Purpose: filter, calibrate, weight, retain (HR) – E.g. Jung & Lease (2011), Tang & Lease (2011), ... August 12, 2012 154
  • 155. How do we measure work quality? • Compare worker’s label vs. – Known (correct, trusted) label – Other workers’ labels • P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or Multiple Workers? Sept. 2010. – Model predictions of the above • Model the labels (Ryu & Lease, ASIS&T11) • Model the workers (Chen et al., AAAI’10) • Verify worker’s label – Yourself – Tiered approach (e.g. Find-Fix-Verify) • Quinn and B. Bederson’09, Bernstein et al.’10 August 12, 2012 155
  • 156. Typical Assumptions • Objective truth exists – no minority voice / rare insights – Can relax this to model “truth distribution” • Automatic answer comparison/evaluation – What about free text responses? Hope from NLP… • Automatic essay scoring • Translation (BLEU: Papineni, ACL’2002) • Summarization (Rouge: C.Y. Lin, WAS’2004) – Have people do it (yourself or find-verify crowd, etc.) August 12, 2012 156
  • 157. Distinguishing Bias vs. Noise • Ipeirotis (HComp 2010) • People often have consistent, idiosyncratic skews in their labels (bias) – E.g. I like action movies, so they get higher ratings • Once detected, systematic bias can be calibrated for and corrected (yeah!) • Noise, however, seems random & inconsistent – this is the real issue we want to focus on August 12, 2012 157
  • 158. Comparing to known answers • AKA: gold, honey pot, verifiable answer, trap • Assumes you have known answers • Cost vs. Benefit – Producing known answers (experts?) – % of work spent re-producing them • Finer points – Controls against collusion – What if workers recognize the honey pots? August 12, 2012 158
  • 159. Comparing to other workers • AKA: consensus, plurality, redundant labeling • Well-known metrics for measuring agreement • Cost vs. Benefit: % of work that is redundant • Finer points – Is consensus “truth” or systematic bias of group? – What if no one really knows what they’re doing? • Low-agreement across workers indicates problem is with the task (or a specific example), not the workers – Risk of collusion • Sheng et al. (KDD 2008) August 12, 2012 159
  • 160. Comparing to predicted label • Ryu & Lease, ASIS&T11 (CrowdConf’11 poster) • Catch-22 extremes – If model is really bad, why bother comparing? – If model is really good, why collect human labels? • Exploit model confidence – Trust predictions proportional to confidence – What if model very confident and wrong? • Active learning – Time sensitive: Accuracy / confidence changes August 12, 2012 160
  • 161. Compare to predicted worker labels • Chen et al., AAAI’10 • Avoid inefficiency of redundant labeling – See also: Dekel & Shamir (COLT’2009) • Train a classifier for each worker • For each example labeled by a worker – Compare to predicted labels for all other workers • Issues • Sparsity: workers have to stick around to train model… • Time-sensitivity: New workers & incremental updates? August 12, 2012 161
  • 162. Methods for measuring agreement • What to look for – Agreement, reliability, validity • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Percentage agreement – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • With majority vote, what if 2 say relevant, 3 say not? – Use expert to break ties (Kochhar et al, HCOMP’10; GQR) – Collect more judgments as needed to reduce uncertainty August 12, 2012 162
  • 163. Inter-rater reliability • Lots of research • Statistics books cover most of the material • Three categories based on the goals – Consensus estimates – Consistency estimates – Measurement estimates August 12, 2012 163
  • 164. Sample code – R packages psy and irr >library(psy) >library(irr) >my_data <- read.delim(file="test.txt", head=TRUE, sep="t") >kappam.fleiss(my_data,exact=FALSE) >my_data2 <- read.delim(file="test2.txt", head=TRUE, sep="t") >ckappa(my_data2) August 12, 2012 164
  • 165. k coefficient • Different interpretations of k • For practical purposes you need to be >= moderate • Results may vary k Interpretation <0 Poor agreement 0.01 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreement August 12, 2012 165
  • 166. Detection Theory • Sensitivity measures – High sensitivity: good ability to discriminate – Low sensitivity: poor ability Stimulus “Yes” “No” Class S1 Hits Misses S2 False alarms Correct rejections Hit rate H = P(“yes”|S2) False alarm rate F = P(“yes”|S1) August 12, 2012 166
  • 168. Finding Consensus • When multiple workers disagree on the correct label, how do we resolve this? – Simple majority vote (or average and round) – Weighted majority vote (e.g. naive bayes) • Many papers from machine learning… • If wide disagreement, likely there is a bigger problem which consensus doesn’t address August 12, 2012 168
  • 169. Quality Control on MTurk • Rejecting work & Blocking workers (more later…) – Requestors don’t want bad PR or complaint emails – Common practice: always pay, block as needed • Approval rate: easy to use, but value? – P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5 and 5 Minutes. Oct. 2010 – Many requestors don’t ever reject… • Qualification test – Pre-screen workers’ capabilities & effectiveness – Example and pros/cons in next slides… • Geographic restrictions • Mechanical Turk Masters (June 23, 2011) – Recent addition, degree of benefit TBD… August 12, 2012 169
  • 171. Quality Control in General • Extremely important part of the experiment • Approach as “overall” quality; not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. August 12, 2012 171
  • 172. Tools and Packages for MTurk • QA infrastructure layers atop MTurk promote useful separation-of-concerns from task – TurkIt • Quik Turkit provides nearly realtime services – Turkit-online (??) – Get Another Label (& qmturk) – Turk Surveyor – cv-web-annotation-toolkit (image labeling) – Soylent – Boto (python library) • Turkpipe: submit batches of jobs using the command line. • More needed… August 12, 2012 172
  • 173. A qualification test snippet <Question> <QuestionIdentifier>question1</QuestionIdentifier> <QuestionContent> <Text>Carbon monoxide poisoning is</Text> </QuestionContent> <AnswerSpecification> <SelectionAnswer> <StyleSuggestion>radiobutton</StyleSuggestion> <Selections> <Selection> <SelectionIdentifier>1</SelectionIdentifier> <Text>A chemical technique</Text> </Selection> <Selection> <SelectionIdentifier>2</SelectionIdentifier> <Text>A green energy treatment</Text> </Selection> <Selection> <SelectionIdentifier>3</SelectionIdentifier> <Text>A phenomena associated with sports</Text> </Selection> <Selection> <SelectionIdentifier>4</SelectionIdentifier> <Text>None of the above</Text> </Selection> </Selections> </SelectionAnswer> </AnswerSpecification> August 12, 2012 </Question> 173
  • 174. Qualification tests: pros and cons • Advantages – Great tool for controlling quality – Adjust passing grade • Disadvantages – Extra cost to design and implement the test – May turn off workers, hurt completion time – Refresh the test on a regular basis – Hard to verify subjective tasks like judging relevance • Try creating task-related questions to get worker familiar with task before starting task in earnest August 12, 2012 174
  • 175. More on quality control & assurance • HR issues: recruiting, selection, & retention – e.g., post/tweet, design a better qualification test, bonuses, … • Collect more redundant judgments… – at some point defeats cost savings of crowdsourcing – 5 workers is often sufficient August 12, 2012 175
  • 176. Robots and Captchas • Some reports of robots on MTurk – E.g. McCreadie et al. (2011) – violation of terms of service – Artificial artificial artificial intelligence • Captchas seem ideal, but… – There is abuse of robots using turkers to solve captchas so they can access web resources – Turker wisdom is therefore to avoid such HITs • What to do? – Use standard captchas, notify workers – Block robots other ways (e.g. external HITs) – Catch robots through standard QC, response times – Use HIT-specific captchas (Kazai et al., 2011) August 12, 2012 176
  • 177. Other quality heuristics • Justification/feedback as quasi-captcha – Successfully proven in past experiments – Should be optional – Automatically verifying feedback was written by a person may be difficult (classic spam detection task) • Broken URL/incorrect object – Leave an outlier in the data set – Workers will tell you – If somebody answers “excellent” on a graded relevance test for a broken URL => probably spammer August 12, 2012 177
  • 178. Dealing with bad workers • Pay for “bad” work instead of rejecting it? – Pro: preserve reputation, admit if poor design at fault – Con: promote fraud, undermine approval rating system • Use bonus as incentive – Pay the minimum $0.01 and $0.01 for bonus – Better than rejecting a $0.02 task • If spammer “caught”, block from future tasks – May be easier to always pay, then block as needed August 12, 2012 178
  • 179. Worker feedback • Real feedback received via email after rejection • Worker XXX I did. If you read these articles most of them have nothing to do with space programs. I’m not an idiot. • Worker XXX As far as I remember there wasn't an explanation about what to do when there is no name in the text. I believe I did write a few comments on that, too. So I think you're being unfair rejecting my HITs. August 12, 2012 179
  • 180. Real email exchange with worker after rejection WORKER: this is not fair , you made me work for 10 cents and i lost my 30 minutes of time ,power and lot more and gave me 2 rejections at least you may keep it pending. please show some respect to turkers REQUESTER: I'm sorry about the rejection. However, in the directions given in the hit, we have the following instructions: IN ORDER TO GET PAID, you must judge all 5 webpages below *AND* complete a minimum of three HITs. Unfortunately, because you only completed two hits, we had to reject those hits. We do this because we need a certain amount of data on which to make decisions about judgment quality. I'm sorry if this caused any distress. Feel free to contact me if you have any additional questions or concerns. WORKER: I understood the problems. At that time my kid was crying and i went to look after. that's why i responded like that. I was very much worried about a hit being rejected. The real fact is that i haven't seen that instructions of 5 web page and started doing as i do the dolores labs hit, then someone called me and i went to attend that call. sorry for that and thanks for your kind concern. August 12, 2012 180
  • 181. Exchange with worker • Worker XXX Thank you. I will post positive feedback for you at Turker Nation. Me: was this a sarcastic comment? • I took a chance by accepting some of your HITs to see if you were a trustworthy author. My experience with you has been favorable so I will put in a good word for you on that website. This will help you get higher quality applicants in the future, which will provide higher quality work, which might be worth more to you, which hopefully means higher HIT amounts in the future. August 12, 2012 181
  • 182. Build Your Reputation as a Requestor • Word of mouth effect – Workers trust the requester (pay on time, clear explanation if there is a rejection) – Experiments tend to go faster – Announce forthcoming tasks (e.g. tweet) • Disclose your real identity? August 12, 2012 182
  • 183. Other practical tips • Sign up as worker and do some HITs • “Eat your own dog food” • Monitor discussion forums • Address feedback (e.g., poor guidelines, payments, passing grade, etc.) • Everything counts! – Overall design only as strong as weakest link August 12, 2012 183
  • 184. Conclusions • But one may say “this is all good but looks like a ton of work” • The original goal: data is king • Data quality and experimental designs are preconditions to make sure we get the right stuff • Data will be later be used for rankers, ML models, evaluations, etc. • Don’t cut corners August 12, 2012 184
  • 185. THE ROAD AHEAD August 12, 2012 185
  • 186. What about sensitive data? • Not all data can be publicly disclosed – User data (e.g. AOL query log, Netflix ratings) – Intellectual property – Legal confidentiality • Need to restrict who is in your crowd – Separate channel (workforce) from technology – Hot question for adoption at enterprise level August 12, 2012 186
  • 187. Wisdom of Crowds (WoC) Requires • Diversity • Independence • Decentralization • Aggregation Input: large, diverse sample (to increase likelihood of overall pool quality) Output: consensus or selection (aggregation) August 12, 2012 187
  • 188. WoC vs. Ensemble Learning • Combine multiple models to improve performance over any constituent model – Can use many weak learners to make a strong one – Compensate for poor models with extra computation • Works better with diverse, independent learners • cf. NIPS 2010-2011 Workshops – Computational Social Science & the Wisdom of Crowds • More investigation needed of traditional feature- based machine learning & ensemble methods for consensus labeling with crowdsourcing August 12, 2012 188
  • 189. Active Learning • Minimize number of labels to achieve goal accuracy rate of classifier – Select examples to label to maximize learning • Vijayanarasimhan and Grauman (CVPR 2011) – Simple margin criteria: select maximally uncertain examples to label next – Finding which examples are uncertain can be computationally intensive (workers have to wait) – Use locality-sensitive hashing to find uncertain examples in sub-linear time August 12, 2012 189
  • 190. Active Learning (2) • V&G report each learning iteration ~ 75 min – 15 minutes for model training & selection – 60 minutes waiting for crowd labels • Leaving workers idle may lose them, slowing uptake and completion times • Keep workers occupied – Mason and Suri (2010): paid waiting room – Laws et al. (EMNLP 2011): parallelize labeling and example selection via producer-consumer model • Workers consume examples, produce labels • Model consumes label, produces examples August 12, 2012 190
  • 191. Query execution • So you want to combine CPU + HPU in a DB? • Crowd can answer difficult queries • Query processing with human computation • Long term goal – When to switch from CPU to HPU and vice versa August 12, 2012 191
  • 192. MapReduce with human computation • Commonalities – Large task divided into smaller sub-problems – Work distributed among worker nodes (workers) – Collect all answers and combine them – Varying performance of heterogeneous CPUs/HPUs • Variations – Human response latency / size of “cluster” – Some tasks are not suitable August 12, 2012 192
  • 193. A Few Questions • How should we balance automation vs. human computation? Which does what? • Who’s the right person for the job? • How do we handle complex tasks? Can we decompose them into smaller tasks? How? August 12, 2012 193
  • 194. Research problems – operational • Methodology – Budget, people, document, queries, presentation, incentives, etc. – Scheduling – Quality • What’s the best “mix” of HC for a task? • What are the tasks suitable for HC? • Can I crowdsource my task? – Eickhoff and de Vries, WSDM 2011 CSDM Workshop August 12, 2012 194
  • 195. More problems • Human factors vs. outcomes • Editors vs. workers • Pricing tasks • Predicting worker quality from observable properties (e.g. task completion time) • HIT / Requestor ranking or recommendation • Expert search : who are the right workers given task nature and constraints • Ensemble methods for Crowd Wisdom consensus August 12, 2012 195
  • 196. Problems: crowds, clouds and algorithms • Infrastructure – Current platforms are very rudimentary – No tools for data analysis • Dealing with uncertainty (propagate rather than mask) – Temporal and labeling uncertainty – Learning algorithms – Search evaluation – Active learning (which example is likely to be labeled correctly) • Combining CPU + HPU – Human Remote Call? – Procedural vs. declarative? – Integration points with enterprise systems August 12, 2012 196
  • 197. Algorithms • Bandit problems; explore-exploit • Optimizing amount of work by workers – Humans have limited throughput – Harder to scale than machines • Selecting the right crowds • Stopping rule August 12, 2012 197
  • 198. BROADER CONSIDERATIONS: ETHICS, ECONOMICS, REGULATION August 12, 2012 198
  • 199. What about ethics? • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of these people who we ask to power our computing?” – Power dynamics between parties • What are the consequences for a worker when your actions harm their reputation? – “Abstraction hides detail” • Fort, Adda, and Cohen (2011) – “…opportunities for our community to deliberately value ethics above cost savings.” August 12, 2012 199
  • 201. Davis et al. (2010) The HPU. HPU August 12, 2012 201
  • 202. HPU: “Abstraction hides detail” • Not just turning a mechanical crank August 12, 2012 202
  • 203. Micro-tasks & Task Decomposition • Small, simple tasks can be completed faster by reducing extraneous context and detail – e.g. “Can you name who is in this photo?” • Current workflow research investigates how to decompose complex tasks into simpler ones August 12, 2012 203
  • 204. Context & Informed Consent • What is the larger task I’m contributing to? • Who will benefit from it and how? August 12, 2012 204
  • 205. What about the regulation? • Wolfson & Lease (ASIS&T 2011) • As usual, technology is ahead of the law – employment law – patent inventorship – data security and the Federal Trade Commission – copyright ownership – securities regulation of crowdfunding • Take-away: don’t panic, but be mindful – Understand risks of “just in-time compliance” August 12, 2012 205
  • 206. Digital Dirty Jobs • NY Times: Policing the Web’s Lurid Precincts • Gawker: Facebook content moderation • CultureDigitally: The dirty job of keeping Facebook clean August 12, 2012 206
  • 207. Jeff Howe Vision vs. Reality? • Vision of empowering worker freedom: – work whenever you want for whomever you want • When $$$ is at stake, populations at risk may be compelled to perform work by others – Digital sweat shops? Digital slaves? – We really don’t know (and need to learn more…) – Traction? Human Trafficking at MSR Summit’12 August 12, 2012 207
  • 208. A DARKER SIDE TO CROWDSOURCING & HUMAN COMPUTATION August 12, 2012 208
  • 209. Putting the shoe on the other foot: Spam August 12, 2012 209
  • 210. What about trust? • Some reports of robot “workers” on MTurk – E.g. McCreadie et al. (2011) – Violates terms of service • Why not just use a captcha? August 12, 2012 210
  • 212. Requester Fraud on MTurk “Do not do any HITs that involve: filling in CAPTCHAs; secret shopping; test our web page; test zip code; free trial; click my link; surveys or quizzes (unless the requester is listed with a smiley in the Hall of Fame/Shame); anything that involves sending a text message; or basically anything that asks for any personal information at all—even your zip code. If you feel in your gut it’s not on the level, IT’S NOT. Why? Because they are scams...” August 12, 2012 212
  • 213. Defeating CAPTCHAs with crowds August 12, 2012 213
  • 214. Gaming the System: SEO, etc.
  • 216. Robert Sim, MSR Summit’12 August 12, 2012 216
  • 217. Conclusions • Crowdsourcing works and is here to stay • Fast turnaround, easy to experiment, cheap • Still have to design the experiments carefully! • Usability considerations • Worker quality • User feedback extremely useful August 12, 2012 217
  • 218. Conclusions - II • Lots of opportunities to improve current platforms • Integration with current systems • While MTurk first to-market in micro-task vertical, many other vendors are emerging with different affordances or value-added features • Many open research problems … August 12, 2012 218
  • 219. Conclusions – III • Important to know your limitations and be ready to collaborate • Lots of different skills and expertise required – Social/behavioral science – Human factors – Algorithms – Economics – Distributed systems – Statistics August 12, 2012 219
  • 221. Surveys • Ipeirotis, Panagiotis G., R. Chandrasekar, and P. Bennett. (2009). “A report on the human computation workshop (HComp).” ACM SIGKDD Explorations Newsletter 11(2). • Alex Quinn and Ben Bederson. Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of CHI 2011. • Law and von Ahn (2011). Human Computation August 12, 2012 221
  • 222. 2013 Events Planned Research events • 1st year of HComp as AAAI conference • 2nd annual Collective Intelligence? Industrial Events • 4th CrowdConf (San Francisco, Fall) • 1st Crowdsourcing Week (Singapore, April) August 12, 2012 222
  • 223. TREC Crowdsourcing Track • Year 1 (2011) – horizontals – Task 1 (hci): collect crowd relevance judgments – Task 2 (stats): aggregate judgments – Organizers: Kazai & Lease – Sponsors: Amazon, CrowdFlower • Year 2 (2012) – content types – Task 1 (text): judge relevance – Task 2 (images): judge relevance – Organizers: Ipeirotis, Kazai, Lease, & Smucker – Sponsors: Amazon, CrowdFlower, MobileWorks August 12, 2012 223
  • 224. 2012 Workshops & Conferences • AAAI: Human Computation (HComp) (July 22-23) • AAAI Spring Symposium: Wisdom of the Crowd (March 26-28) • ACL: 3rd Workshop of the People's Web meets NLP (July 12-13) • AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities (August 9-12) • CHI: CrowdCamp (May 5-6) • CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.) • Collective Intelligence (April 18-20) • CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23) • CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27) • EC: Social Computing and User Generated Content Workshop (June 7) • ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23) • ICEC: Harnessing Collective Intelligence with Games (September) • ICML: Machine Learning in Human Computation & Crowdsourcing (June 30) • ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27) • KDD: Workshop on Crowdsourcing and Data Mining (August 12) • Multimedia: Crowdsourcing for Multimedia (Nov 2) • SocialCom: Social Media for Human Computation (September 6) • TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16) • WWW: CrowdSearch: Crowdsourcing Web search (April 17) August 12, 2012 224
  • 225. Journal Special Issues 2012 – Springer’s Information Retrieval (articles now online): Crowdsourcing for Information Retrieval – IEEE Internet Computing (articles now online): Crowdsourcing (Sept./Oct. 2012) – Hindawi’s Advances in Multimedia Journal: Multimedia Semantics Analysis via Crowdsourcing Geocontext August 12, 2012 225
  • 226. 2011 Workshops & Conferences • AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8) • ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2) • Crowdsourcing Technologies for Language and Cognition Studies (July 27) • CHI-CHC: Crowdsourcing and Human Computation (May 8) • CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”) • CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2) • Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13) • EC: Workshop on Social Computing and User Generated Content (June 5) • ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20) • Interspeech: Crowdsourcing for speech processing (August) • NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD) • SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28) • TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18) • UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18) • WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9) August 12, 2012 226
  • 227. 2011 Tutorials and Keynotes • By Omar Alonso and/or Matthew Lease – CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only) – CrowdConf: Crowdsourcing for Research and Engineering – IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only) – WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9) – SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24) • AAAI: Human Computation: Core Research Questions and State of the Art – Edith Law and Luis von Ahn, August 7 • ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and Conservation – Steve Kelling, October 10, ebird • EC: Conducting Behavioral Research Using Amazon's Mechanical Turk – Winter Mason and Siddharth Suri, June 5 • HCIC: Quality Crowdsourcing for Human Computer Interaction Research – Ed Chi, June 14-18, about HCIC) – Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk • Multimedia: Frontiers in Multimedia Search – Alan Hanjalic and Martha Larson, Nov 28 • VLDB: Crowdsourcing Applications and Platforms – Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska) • WWW: Managing Crowdsourced Human Computation – Panos Ipeirotis and Praveen Paritosh August 12, 2012 227
  • 228. Thank You! Crowdsourcing news & information: ir.ischool.utexas.edu/crowd For further questions, contact us at: omar.alonso@microsoft.com ml@ischool.utexas.edu Cartoons by Mateo Burtch (buta@sonic.net) August 12, 2012 228
  • 229. Additional Literature Reviews • Man-Ching Yuen, Irwin King, and Kwong-Sak Leung. A Survey of Crowdsourcing Systems. SocialCom 2011. • A. Doan, R. Ramakrishnan, A. Halevy. Crowdsourcing Systems on the World-Wide Web. Communications of the ACM, 2011. August 12, 2012 229
  • 230. More Books July 2010, kindle-only: “This book introduces you to the top crowdsourcing sites and outlines step by step with photos the exact process to get started as a requester on Amazon Mechanical Turk.“ August 12, 2012 230
  • 231. Resources A Few Blogs  Behind Enemy Lines (P.G. Ipeirotis, NYU)  Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)  CrowdFlower Blog  http://experimentalturk.wordpress.com  Jeff Howe A Few Sites  The Crowdsortium  Crowdsourcing.org  CrowdsourceBase (for workers)  Daily Crowdsource MTurk Forums and Resources  Turker Nation: http://turkers.proboards.com  http://www.turkalert.com (and its blog)  Turkopticon: report/avoid shady requestors  Amazon Forum for MTurk August 12, 2012 231
  • 232. Bibliography  J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.  Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.  Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics Interface (GI 2010), 39-46.  N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.  C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.  P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.  J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human in the Loop (ACVHL), June 2010.  M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.  D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579  JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.  J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.  P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.  J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.  P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT Workshop on Active Learning and NLP, 2009.  B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.  P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.  P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.  P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010) August 12, 2012 232
  • 233. Bibliography (2)  A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.  Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011  Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.  K. Krippendorff. "Content Analysis", Sage Publications, 2003  G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.  T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence. 2009.  W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.  J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.  A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009  J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting Demographics in Amazon Mechanical Turk”. CHI 2010.  F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.  R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks”. EMNLP-2008.  V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers” KDD 2008.  S. Weber. “The Success of Open Source”, Harvard University Press, 2004.  L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.  L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008. August 12, 2012 233
  • 234. Bibliography (3)  Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and Clustering on Teachers. AAAI 2010.  Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.  Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk. EMNLP 2011.  C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS), 2004.  C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.  Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.  Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.  S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds. CVPR 2011.  Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011. August 12, 2012 234
  • 235. Recent Work • Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511. • Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW). • Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837. • Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 259-268. • Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2):413–420. • Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce. • Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference on Artificial Intelligence. • Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and development in information retrieval. • Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). • Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint arXiv:1110.3564. • Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in Crowdsourcing Relevance Judgments.” Springer's Information Retrieval Journal: Special Issue on Crowdsourcing. August 12, 2012 235
  • 236. Recent Work (2) • Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI. • Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML). • Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T). • Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly. • Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal of Machine Learning Research 11:1297-1322. • Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual International Conference on Machine Learning (ICML), 889-896. • Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks.” Journal of Machine Learning Research 13:491-518. • Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural information processing systems (NIPS). • Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM symposium adjunct on User interface software and technology (UIST). • Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd Human Computation Workshop (HCOMP) at AAAI. • Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432. • Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32. • Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS). • Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International Conference on Machine Learning (ICML). August 12, 2012 236
  • 237. Crowdsourcing in IR: 2008-2010  2008  O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2.  2009  O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.  P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.  G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.  G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.  Law et al. “SearchWar”. HCOMP.  H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009.  2010  SIGIR Workshop on Crowdsourcing for Search Evaluation.  O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.  K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.  C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazon's Mechanical Turk.  Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.  G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.  G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)  Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.  Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.  Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.  T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010. August 12, 2012 237
  • 238. Crowdsourcing in IR: 2011  WSDM Workshop on Crowdsourcing for Search and Data Mining.  SIGIR Workshop on Crowdsourcing for Information Retrieval  1st TREC Crowdsourcing Track  O. Alonso and R. Baeza-Yates. “Design and Implementation of Relevance Assessments using Crowdsourcing, ECIR 2011.  Roi Blanco, Harry Halpin, Daniel Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran. “Repeatable and Reliable Search System Evaluation using Crowd-Sourcing”. SIGIR 2011.  Yen-Ta Huang, An-Jung Cheng, Liang-Chi Hsieh, Winston H. Hsu, Kuo-Wei Chang. “Region-Based Landmark Discovery by Crowdsourcing Geo-Referenced Photos.” SIGIR 2011.  Hyun Joon Jung, Matthew Lease . “Improving Consensus Accuracy via Z-score and Weighted Voting”. HCOMP 2011.  G. Kasneci, J. Van Gael, D. Stern, and T. Graepel, CoBayes: Bayesian Knowledge Corroboration with Assessors of Unknown Areas of Expertise, WSDM 2011.  Gabriella Kazai,. “In Search of Quality in Crowdsourcing for Search Engine Evaluation”, ECIR 2011.  Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling. “Crowdsourcing for Book Search Evaluation: Impact of Quality on Comparative System Ranking.” SIGIR 2011.  Abhimanu Kumar, Matthew Lease . “Learning to Rank From a Noisy Crowd”. SIGIR 2011.  Edith Law, Paul N. Bennett, and Eric Horvitz. “The Effects of Choice in Routing Relevance Judgments”. SIGIR 2011. August 12, 2012 238