6. Mechanical Turk for Social Science Awesome Sean Munson EytanBakshy An API made of people!
7. Overview Who are the Turkers? Tasks suitable for Mechanical Turk and workarounds for tasks that are semi-suitable Tasks from Turkers’ and requesters’ points of view Examples Classifying links Reacting to collections of links Practicalities Tools Paying Turkers at UMich Human Subjects Slides will be available online.
12. 300 Turker Survey from PanosIpeirotis Limited by self-selection issues (people who do tasks w/ only one available, and at that pay). By country: 76% US; 8% India; 3% UK; 2% Canada
13.
14.
15.
16. Ideal types of tasks Short duration Repetitive – Turker learns once, repeats many No particular expertise required From requester perspective: Human input is verifiable with less effort than it would take to do it yourself or to pay an expert, e.g. tasks that require people to write something assess quality using multiple raters but you can use it in other ways.
17. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task
18. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task
19. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task
20. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task
21. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task Create task type Load Task instances (prepay) Flickr:Michelle Gibson
22.
23.
24.
25.
26.
27. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task Create task type Load Task instances (prepay) Approve or reject tasks
29. Large-scale study of diffusion and influence on Twitter How does the spread of a URL over the twitter network depend on the content? What proportion of “influential” users are mass media vs. individuals Requires thousands of labels of URLs and users. Needs to be fast and cheap.
34. Turkers as Subjects – Challenges Hard to check answer quality when you want opinions! Screening & treatment randomization mTurk not optimized for 1x tasks
35.
36. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Get Paid Complete task Create task type Load Task instances (prepay) Approve or reject tasks
38. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Take Qualification Get Paid Complete task Create task type Load Task instances (prepay) Require 95% task approval rating Require US location Ask demographics, political preferences Approve or reject tasks
39. Automatically accept another task of this type, or go find a new task Task listing – Preview & select task Take Qualification Get Paid Complete task Create task type Load Task instances (prepay) Approve or reject tasks Evaluate Qualification: Grant or reject Create or use existing qualification
40. Checking for validity Couldn’t ask verifiable information (Kittur and Chi) about collection without affecting how the subjects look at the list Did have demographic info from qualification. Randomly selected a question to repeat removed people for gender changes, aging backwards, or major changes in political preferences
41. Total cost: $382 for 485 collection ratings Had to pay more (~$12/hr) because only one task available at a time, plus required (unpaid) qualification.
43. Tools Web interface: WYSIWYG editor, CSV upload of tasks. Many task templates to use as starting points. Very simple and fast to use, but limited in capability. Command line tools: Required to create custom qualifications or use multiple quals. Much more flexibility. Input format is XML. Documentation is adequate, overall experience is clunky. Other libraries(e.g. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=827&categoryID=85) 3rd party tools: Almost as easy to use as Amazon’s web interface & support nearly all features of command line tools. But they take a cut. CrowdFlower – from Dolores Labs: crowdflower.com Smartsheet: smartsheet.com/product/smartsourcing
44. Human subjects? Human subjects status varies with design Categorizing content: Not human subjects Asking for reactions to content: Human subjects. Informed Consent My preference has been to argue for waiver of informed consent. (Mechanical Turk terms of service prohibit collection of identifiable information.) You can use qualifications if you have a task where you feel informed consent is appropriate, have extended consent information and have repetitive tasks.
45. Subject payment mTurk handles all payment, but Associate your account with the University of Michigan employer ID number, in case any one person earns more than the IRS reporting limit from all Michigan mTurk studies.Stacy Callahan or I have more information.
46.
47.
48. Built in quals for location, reputationRequester Can assign people to dummy qualifications to allow them to take follow-up studies, and you can email them through mTurk. Also can exclude this way to maintain virgin sample.
49. Some references & resources General Dolores Labs blog: http://blog.doloreslabs.com/ Turker Nation forums: http://turkers.proboards.com 5 Study how-tos from Markus Jakobsson (PARC)http://blogs.parc.com/blog/2009/07/experimenting-on-mechanical-turk-5-how-tos/ Turker Demographics Survey by PanosIpeirotishttp://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html Turker demographics vs. Internet Demographicshttp://behind-the-enemy-lines.blogspot.com/2009/03/turker-demographics-vs-internet.html Why do people participatehttp://behind-the-enemy-lines.blogspot.com/2008/03/why-people-participate-on-mechanical.html Why do people participate (more)http://www.floozyspeak.com/blog/archives/2008/08/valley_of_the_t.html
50. Some references & resources Improving Answer quality AniketKittur, Ed H. Chi, and BongwonSuh (2008). “Crowdsourcing user studies with Mechanical Turk,” CHI 2008. Answer quality and dealing with bad answers Carpenter, Bob. 2008. Hierarchical Bayesian Models of Categorical Data Raykar et al. (2009) Supervised Learning from Multiple Experts: Whom to Trust when Everyone Lies a Bit, ICML. Worker quality & HIT difficultyhttp://behind-the-enemy-lines.blogspot.com/2008/08/mechanical-turk-worker-quality-and-hit.html Also see literature on scoring a test without an answer key
51. Some references & resources Turker effort, skills, participation rate, and pay W Mason, D Watts. (2009). Financial Incentives and the Performance of Crowds. KDD Workshop on Human Computation. Self report on skillshttp://behind-the-enemy-lines.blogspot.com/2009/01/how-good-are-you-turker.html Human Subjects Consent in qualification testshttp://behind-the-enemy-lines.blogspot.com/2009/08/get-consent-form-for-irb-on-mturk-using.html Discussionhttp://behind-the-enemy-lines.blogspot.com/2009/01/mechanical-turk-human-subjects-and-irbs.html
Notas do Editor
Tasks can be sorted by price or number of HITs available, among other things. To increase participation, you generally want to appear higher on at least one of these lists.
For this study, we wanted Conservative Republicans and Liberal Democrats, not people with neutral views, Liberal Republicans, or Conservative Democrats.
Restricting who can participate.
If not automatically scored, the qualification introduces an even bigger delay in the process, and you’ll lose workers. But scoring it yourself allows a lot more control, and lets you retain turker answer data.