Did you mean crowdsourcing for recommender systems?
1. Did you mean crowdsourcingfor recommender systems?
OMAR ALONSO
6-OCT-2014
CROWDREC 2014
2. Disclaimer
The views and opinions expressed in this talkare mine and do not necessarily reflect the official policy or position of Microsoft.
CROWDREC 2014
3. Outline
A bit on human computation
Crowdsourcingin informationretrieval
Opportunities for recommender systems
CROWDREC 2014
6. Human-based computation
Use humans as processors in a distributed system
Address problems that computers aren’t good
Games with a purpose
Examples
◦ESP game
◦Captcha
◦ReCaptcha
CROWDREC 2014
7. Some definitions
Human computation is a computation that is performed by a human
Human computation system is a system that organizes human efforts to carry out computation
Crowdsourcing is a tool that a human computation system can use to distribute tasks.
CROWDREC 2014
Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
8. HC at the core of RecSys
“In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” –Resnick and Varian (CACM 1997)
S. Perugini, M. Gonçalves, E. Fox: Recommender Systems Research: A Connection-Centric Survey. J. Intell. Inf. Syst. 23(2): 107-143 (2004) CROWDREC 2014
9. {where to go on vacation}
MTurk: 50 answers, $1.80
Quora: 2 answers
Y! Answers: 2 answers
FB: 1 answer
Tons of results
Read title + snippet + URL
Explore a few pages in detail
CROWDREC 2014
10. {where to go on vacation}
Countries
Cities
CROWDREC 2014
12. The rise of crowdsourcing in IR
Crowdsourcing is hot
Lots of interest in the research community
◦Articles showing good results
◦Journals special issues (IR, IEEE Internet Computing, etc.)
◦Workshops and tutorials (SIGIR, NACL, WSDM, WWW, VLDB, RecSys, CHI, etc.)
◦HCOMP
◦CrowdConf
Large companies using crowdsourcing
Big data
Start-ups
Venture capital investment
CROWDREC 2014
13. Why is this interesting?
Easy to prototype and test new experiments
Cheap and fast
No need to setup infrastructure
Introduce experimentation early in the cycle
In the context of IR, implement and experiment as you go
For new ideas, this is very helpful
CROWDREC 2014
14. Caveats and clarifications
Trust and reliability
Wisdom of the crowd re-visit
Adjust expectations
Crowdsourcing is another data point for your analysis
Complementary to other experiments
CROWDREC 2014
15. Why now?
The Web
Use humans as processors in a distributed system
Address problems that computers aren’t good
Scale
Reach
CROWDREC 2014
16. Motivating example: relevance judging
Relevance of search results is difficult to judge
◦Highly subjective
◦Expensive to measure
Professional editors commonly used
Potential benefits of crowdsourcing
◦Scalability (time and cost)
◦Diversity of judgments
CROWDREC 2014
Matt Lease and Omar Alonso. “Crowdsourcing for search evaluation and social-algorithmic search”, ACM SIGIR 2012 Tutorial.
17. Crowdsourcing and relevance evaluation
For relevance, it combines two main approaches
◦Explicit judgments
◦Automated metrics
Other features
◦Large scale
◦Inexpensive
◦Diversity
CROWDREC 2014
18. Development framework
Incremental approach
Measure, evaluate, and adjust as you go
Suitable for repeatable tasks
CROWDREC 2014
O. Alonso. “Implementing crowdsourcing-based relevance experimentation: an industrial perspective". Information Retrieval, (16)2, 2013
19. Asking questions
Ask the right questions
Part art, part science
Instructions are key
Workers may not be IR experts so don’t assume the same understanding in terms of terminology
Show examples
Hire a technical writer
◦Engineer writes the specification
◦Writer communicates
CROWDREC 2014
N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
20. UX design
Time to apply all those usability concepts
Experiment should be self-contained.
Keep it short and simple. Brief and concise.
Be very clear with the relevance task.
Engage with the worker. Avoid boring stuff.
Document presentation & design
Need to grab attention
Always ask for feedback (open-ended question) in an input box.
Localization
CROWDREC 2014
21. Other design principles
Text alignment
Legibility
Reading level: complexity of words and sentences
Attractiveness (worker’s attention & enjoyment)
Multi-cultural / multi-lingual
Who is the audience (e.g. target worker community)
Special needs communities (e.g. simple color blindness)
Cognitive load: mental rigor needed to perform task
Exposure effect
CROWDREC 2014
22. When to assess work quality?
Beforehand (prior to main task activity)
◦How: “qualification tests” or similar mechanism
◦Purpose: screening, selection, recruiting, training
During
◦How: assess labels as worker produces them
◦Like random checks on a manufacturing line
◦Purpose: calibrate, reward/penalize, weight
After
◦How: compute accuracy metrics post-hoc
◦Purpose: filter, calibrate, weight, retain (HR)
CROWDREC 2014
23. How do we measure work quality?
Compare worker’s label vs.
◦Known (correct, trusted) label
◦Other workers’ labels
◦Model predictions of workers and labels
Verify worker’s label
◦Yourself
◦Tiered approach (e.g. Find-Fix-Verify)
CROWDREC 2014
24. Comparing to known answers
AKA: gold, honey pot, verifiable answer, trap
Assumes you have known answers
Cost vs. Benefit
◦Producing known answers (experts?)
◦% of work spent re-producing them
Finer points
◦What if workers recognize the honey pots?
CROWDREC 2014
25. Comparing to other workers
AKA: consensus, plurality, redundant labeling
Well-known metrics for measuring agreement
Cost vs. Benefit: % of work that is redundant
Finer points
◦Is consensus “truth” or systematic bias of group?
◦What if no one really knows what they’re doing?
◦Low-agreement across workers indicates problem is with the task (or a specific example), not the workers
CROWDREC 2014
26. Methods for measuring agreement
What to look for
◦Agreement, reliability, validity
Inter-agreement level
◦Agreement between judges
◦Agreement between judges and the gold set
Some statistics
◦Percentage agreement
◦Cohen’s kappa (2 raters)
◦Fleiss’ kappa (any number of raters)
◦Krippendorff’s alpha
With majority vote, what if 2 say relevant, 3 say not?
◦Use expert to break ties
◦Collect more judgments as needed to reduce uncertainty
CROWDREC 2014
27. Pause
Crowdsourcing works
◦Fast turnaround, easy to experiment, few dollarsto test
◦But: you have to design experiments carefully, quality, platform limitations
Crowdsourcing in production
◦Large scale data sets (millions of labels)
◦Continuous execution
◦Difficult to debug
Multiple contingent factors
How do you knowthe experiment is working
Goal: framework for ensuring reliability on crowdsourcing tasks
CROWDREC 2014
O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
28. Labeling tweets –an example of a task
Is this tweet interesting?
Subjective activity
Not focused on specific events
Findings
◦Difficult problem, low inter-rater agreement (Fleiss’ k, Krippendorff’salpha)
◦Tested many designs, number of workers, platforms (MTurkand others)
Multiple contingent factors
◦Worker performance
◦Work
◦Task design
CROWDREC 2014
O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
29. Designs that include in-task CAPTCHA
Borrowed idea from reCAPTCHA -> use of control term
Adapt your labeling task
2 more questions as control
◦1 algorithmic
◦1 semantic
CROWDREC 2014
30. Production example #1
CROWDREC 2014
Q1 (k = 0.91, alpha = 0.91)
Q2 (k = 0.771, alpha = 0.771)
Q3 (k = 0.033, alpha = 0.035)
In-task captcha
Tweet de-branded
The main question
31. Production example #2
CROWDREC 2014
•Q3 Worthless (alpha = 0.033)
•Q3 Trivial (alpha = 0.043)
•Q3 Funny (alpha = -0.016)
•Q3 Makes me curious (alpha = 0.026)
•Q3 Contains useful info (alpha = 0.048)
•Q3 Important news (alpha = 0.207)
Q2 (k = 0.728, alpha = 0.728)
Q1 (k = 0.907, alpha = 0.907)
In-task captcha
Breakdown by categories to get better signal
Tweet de-branded
32. Findings from designs
No quality controlissues
Eliminating workers who did a poor job on question #1 didn’t affect inter-rateragreement for question #2 and #3.
Interestingness is a fully subjective notion
Wecan still build a classifier that identifies tweets that are interesting to a majority of users
CROWDREC 2014
33. Careful with That AxeData, Eugene
In the area of big data and machine learning:
◦labels -> features -> predictive model -> optimization
Labeling/experimentation perceived as boring
Don’t rush labeling
◦Human and machine
Label quality is veryimportant
◦Don’t outsource it
◦Own it end to end
◦Large scale
CROWDREC 2014
34. More on label quality
Data gathering is not a free lunch
You can’t outsource label acquisition and quality
Labels for the machine != labels for humans
Emphasis on algorithms, models/optimizations and mining from labels
Not so much on algorithms for ensuring high quality labels
Training sets
CROWDREC 2014
35. People are more than HPUs
Why is Facebook popular? People are social.
Information needs are contextually grounded in our social experiences and social networks
Our social networks also embody additional knowledge about us, our needs, and the world
We relate to recommendations
The social dimension complements computation
CROWDREC 2014
37. Humans in the loop
Computation loopsthat mix humans and machines
Kind of active learning
Double goal:
◦Humanchecking on the machine
◦Machinechecking on humans
Example:classifiers for social data
CROWDREC 2014
39. What’s in a label?
Clicks, reviews, ratings, etc.
Better or novel systems if we focus more on label quality?
New ways of collecting data
Training sets
Evaluation & measurement
CROWDREC 2014
40. Routing
Expertise detection and routing
Social load balancing
When to switchbetween machines and humans
CROWDREC 2014
41. Conclusions
Crowdsourcing at scale works but requires a solid framework
Three aspects that need attention: workers, work and task design
Labeling social data is hard
Traditional IR approaches don’t seem to work for Twitter data
Label quality
Outlined areas where RecSyscan benefit from crowdsourcing
CROWDREC 2014