My talk from Carnegie Mellon's HCII Seminar on April 24, 2013.
Abstract:
On some social media platforms, such as Twitter, Youtube, Pinterest, and tumblr, much of the content generated by users is publicly accessible and communication can be easily initiated between strangers who have never previously communicated before. The communities that have risen up around these platforms, particularly on Twitter, can also be inclusive and supportive of interactions between strangers. The public and open nature of these communities creates an opportunity to create a new kind of crowdsourcing system, where individuals are identified who may be good candidates to complete various tasks based on their published content. We explore the potential of such a system through several information collection tasks, examining the response rate and information quality that can be obtained through such a system. We also explore a means of leveraging users' previous social media content to predict their likelihood of response and optimize our system's collection behavior. At IBM Research - Almaden, we are now looking to extend these ideas to additional domains, including proactive and reactive customer support, and precision marketing campaigns.
1. Engaging with Users on
Public Social Media
Jeffrey Nichols
IBM Research – Almaden
jwnichols@us.ibm.com
2. IBM Research – Almaden
• 400+ research employees; 100+ students
and post-docs
• Research in Computer Science, Storage
Systems, Science and Technology, Services
Science
• User Focused Systems in CS
3. The Buzz of the Crowd
People are generating 1+ billion
status updates every day
Topics covered in status updates
are highly diverse:
• Weather, traffic, and other day-to-
day annoyances
• Experiences with products
• Reaction to events
How can we leverage this buzz to
do something useful?
* 1/2 billion updates every day on
Twitter as of October 2012
4. Challenge: The Information Iceberg
Information revealed through status updates
Useful information known to members of social network
GOAL
5. Example #1:
Learning more about customer incidents
to improve service
• What happened?
• Was it something in particular about this store?
• Could other people have the same experience?
• How can we make things right?
This information could be used to
improve the customer experience
6. Example #2:
Tracking crime to improve reporting and
better allocate resources
• Where was it stolen?
• Was a report filed with police?
Over time, this information could
suggest how to allocate officers or
funds to different areas of the city
7. • How long did it take to get through security?
This information could be used by
the security agency (TSA) to identify
problem spots and allocate officers.
It can also be used by consumers to
plan their air travel.
Example #3:
Tracking wait times at airport security checkpoints,
shows updates may indirectly suggest person has info.
8. Uses for Engagement on Social Media
The ability to actively identify and engage with the right people
at the right time on social media can empower an organization
• Collect just-in-time information from users
• Disseminate important information (broadcast or targeted)
• Motivate users to perform a task
• Seize timely business opportunities
(e.g., cross- or up selling)
9. Uses for Engagement on Social Media
The ability to actively identify and engage with the right people
at the right time on social media can empower an organization
• Collect just-in-time information from users
• Disseminate important information (broadcast or targeted)
• Motivate users to perform a task
• Seize timely business opportunities
(e.g., cross- or up selling)
11. Where might this be helpful?
• Questions that have spatial and/or temporal
specificity (e.g., about an event)
• Questions for which there might be a diversity
of opinion
• More?
12. Other
Advantages
• Information is easier to
extract from responses
because the question is
known
• Sample range can be
controlled by asking
questions from users with a
variety of different profiles
• No waiting needed…
questions can be asked in
real-time
• Potential answerers can be
primed with the question
before they have the answer
13. How feasible is this approach?
• Will people answer questions from strangers?
• Will use of an incentive increase responses?
• What is the quality of the answers?
14. Concrete Prototype: TSA Tracker
Crowdsourcing airport security wait times through Twitter
14
Step #1. Watch for people
tweeting about being
in airport
Step #2. Ask nicely if they
would share wait time to
help others
Step #3. Collect responses
and share relevant data
on web site
Step #4. Say thank you!
Key Question:
Will people respond to questions
from strangers?
http://tsatracker.org/
@tsatracker , @tsatracking
15. Questions
From @tsatracker (includes incentive)
“If you went through security at <airport code>, can you
reply with your wait time? Info will be used to help other
travelers”
From @tsatracking (no incentive)
“If you went through security at <airport code>, can you
reply with your wait time?”
16. Concrete Prototype: Product Reviews
Step 1. Identify owners of a product
Step 2. Ask focused question about product
• How is the image quality?
• Does it take good low light pictures?
• How quickly does it take a picture after pressing
the shutter button?
• How durable is it?
• What accessories are must haves?
• Etc…
Step 3-4. Ask more questions if user responds
Step 5. Visualize results as structured product review (future work)
Key Questions:
Will people respond to questions
in this different domain?
Will people respond to follow-up
questions at the same rate?
Do responses contain useful &
accurate information?
17. Product Review Scenarios
Samsung Galaxy Tab 10.1
• Popular consumer electronics product at the time of the study
(didn’t want to use iPad)
• Compared to reviews from Amazon.com
L.A.-area Food Trucks
• Vibrant scene and Twitter is a primary means of communication
• Food trucks usually identified in tweet by @handle
• Compared to reviews from Yelp.com
19. Quality Evaluation Methods
Human Coding
• Twitter responses & Traditional Reviews
• Relevance of response
• Information Types
Information Entropy
• Comparison between Twitter/Amazon, Twitter/Yelp
Mechanical Turk Questionnaire
• Usefulness, Objectiveness, Trustworthiness, Balance,
Readability
21. Suspended!
• @tsatracking account (no incentive condition)
given 1 week suspension after asking 150
questions
• Did not violate Twitter Terms of Use
• Exceeded threshold for blocks or message
marked as spam
• Neither of our other accounts were
suspended
22. Results
Answer:
42% response rate
44% of answers
received in 30 mins
No significant difference
between any conditions
(taking into account suspension)
Key Question:
Will people respond to
questions from strangers?
23. Follow-up Question Results
• Significant differences between all 4 questions (H=50.12, df=3, p < 0.0001, Kruskal-Wallis)
and just the 3 follow-ups (H=25.46, df=2, p < 0.0001, Kruskal-Wallis)
24. Qualitative Results
• @tsatracker account picked up 16
followers
• Many positive responses (“this will be
great for travelers”)
• Only one slightly negative response (“this
is creepy”), but that person also gave an
answer
26. Information Entropy
The Twitter method is dependent on the questions
• Despite trying to base our questions on the contents of Amazon reviews,
ose reviews still contained more information.
• Our food truck questions went beyond Yelp reviews
* Calculated using a shrinkage entropy estimator
Tablet Food Truck
Amazon Twitter Yelp Twitter
Information
(bits)
4.25 3.76 3.27 4.24
Tablet Food Truck
Amazon Twitter Yelp Twitter
Information
(bits)
4.09 3.73 3.27 3.02
All Information
Information In
Both Sets
27. Mechanical Turk Evaluation
Tablet
Amazon Twitter
Mann-
Whitney p
Usefulness 3.19 2.64 868.5 0.006
Objectiveness 2.94 2.53 814.5 0.042
Trustworthiness 2.94 2.39 861.0 0.008
Balance 3.00 2.11 936.0 0.001
Readability 2.92 2.61 741.5 0.270
Food Truck
Yelp Twitter
Mann-
Whitney p
Usefulness 2.86 2.56 734.0 0.309
Objectiveness 2.17 2.08 672.0 0.783
Trustworthiness 2.58 2.14 800.5 0.071
Balance 2.47 1.72 921.0 0.002
Readability 2.89 2.11 896.0 0.004
Completion Times
• Tablet
26.5 minutes for Amazon
25.8 minutes for Twitter
• Food Truck
19.9 minutes for Yelp
16.8 minutes for Twitter
Explanation of Results
• Few concrete examples of
experiences in Twitter
answers
• Limited information about
Twitter reviewers
28. Conclusions
• Response rates independent of
domain seem to be around 40-
45%
• Providing an explanation or
incentive does not seem to affect
response
• Answer quality is fairly high at
70-80%
• Quality seems to be tied to
targeting accuracy
• Most “bad” answers come from
people who didn’t know the answer
to our question
32. Problem
It’s difficult to identify relevant tweets from keywords alone
underspecified overspecified
Bridging the gap with regular expressions and rules can take hours or days of
authoring by a human expert
33. Use Crowd to Create Intelligent Filters
1. Collect sample of
relevant tweets
(keyword filter)
2. Collect ground truth filter
results from crowd on
Mechanical Turk
3. Machine learn a filter models
using SPSS Modeler
4. Use models to filter
tweets in real-time
Filter
Model
5. Social media dashboard
users can react faster and
more accurately
Filter
Model
Each filter requires
a few hours and
~$35 to create
34. Evaluation
Scenarios
• Customer service for Delta Airlines & Hertz Rent-a-car
• Relevance filter + Opinion filter
Evaluation Questions
• Quality of Crowd-Labeled Ground Truth
• Effectiveness of Filter Algorithms
• Usefulness by Users in Filtering Tasks
35. Evaluation
Scenarios
• Customer service for Delta Airlines & Hertz Rent-a-car
• Relevance filter + Opinion filter
Evaluation Questions
• Quality of Crowd-Labeled Ground Truth
• Effectiveness of Filter Algorithms
• Usefulness by Users in Filtering Tasks
38. Baseline: Human Judgement
How well can humans identify “willingness” and “readiness”?
Two surveys on CrowdFlower:
• Willingness: Asked each participant to predict if a displayed Twitter
user would be willing to respond to a given question, assuming that
the user has the ability to answer
• Readiness: Asked each participant to predict how soon (e.g. 1 hour, 1
day) the person would respond assuming that s/he is willing to
respond
100 participants for the first survey and 50 for the second
39. Willingness Result
• 29% correct when only tweets of a user was displayed
• 38% correct when complete twitter profile was displayed.
• Selecting users for question asking is also difficult for the crowd
40. Readiness Result
58% Correct
• Compared with the ground truth
• For example, if a participant predicted that person X will respond
within an hour, but the response was not received in time, the
prediction is then incorrect
41. Features for Machine Selection
• Responsiveness
• E.g., mean response time to other users’ mention
• Profile
• E.g., use particular words in profile description
• Activity
• E.g., number of tweets
• Readiness
• E.g., percentage of tweets occurring at each hour of the day
• Personality
43. Feature Analysis
Significant Features
• For TSA-tracker-1 dataset, we found 42 significant features (FDR was 2.8%).
• For Product dataset, we found 31 features as significant (FDR 4.2%)
• For TSA-tracker-2 dataset, we found 11 significant features (FDR 11.2%)
Top-4 Discriminative Features
- Top-4 features were found using
extensive experiments.
45. Live Experiment
Method
• Used Twitter’s Search API and a set of rules to find 500
users who mentioned airport and 500 for product
• Randomly asked 100 users for the security wait time
• Used our algorithm to identify 100 users for questioning from the
remaining 400 users
47. Engagement Continuum
• Scenario-based filtering
• Smart engagement recommendations
(e.g., based on location inference)
• Customizable engagement scenarios
• Domain-specific analytics
manual assisted automatic
System U
Humans do all the work Analytics streamline decisions:
“press button to engage”
System-driven engagement
Send this:
Send
• Rule-based engagement
• Exception identification
and notification
• Intelligent transition to
human-driven engagement
as desired
• Keyword filtering
• Unstructured engagement
• Domain-independent analytics
47
48.
49. To wrap up…
• Interaction on social media enables a variety of
applications
• Collecting information using this approach is
feasible and produces quality information
• Targeting can be improved flexibly through
crowd-assisted filtering
• Likely responders can be identified from their
social media content
52. Samsung Galaxy Tab 10.1
Questions
• 2 iterations
• First round Qs based on CNET and
Engadget editor reviews
• Second round modified based on
top 10 user reviews of tablets on
Amazon.com
Procedure
• Identified users from real-time
twitter stream
• Keywords and then manual human
inspection
• Questions chosen semi-randomly
based on content of tweet,
answers received so far
Round #2 Questions
54. Los Angeles Food Trucks
Questions
• Based on our own intuitions of
what information would be
interesting
Procedure
• Identified users from real-time
twitter stream
• @handles for food trucks and then
manual human inspection
• Asked questions for 90 active LA
food trucks at time of study
• Most traffic was concentrated for
just three (Kogi Taco, Grilled
Cheese, and GrillEmAll), and we
report results only for those
56. Example:
Real-time Viewer Insight
Real-time collection of relevant users
Historical Social Media
Comprehensive User Profile
Rule-based Facts
Deep Traits from Pyscholinguistic
Analysis
Lives in Chicago, IL
Loves Deception on NBC
Directed Engagement to Learn More
Collect opinion about a new
show
Market new product
Etc.
We often need information from below the surface We have to leverage the information above the surface to get what we want
Note that in this example, identifying the people with information is somewhat indirect. They are not telling us that they went through security, but simply that they’re at an airport. Given that they are at an airport, it is likely they are knowledgeable
TODO: replace with two graphs
Need to figure out how to describe information entropy and this slide
Two iterations of questions (maybe I should ignore this in the presentation?) - Removed clarification phrase in second step, added some additional content questions Derived from an analysis of amazon.com reviews Targeting turned out to be an issue
Two iterations of questions (maybe I should ignore this in the presentation?) - Removed clarification phrase in second step, added some additional content questions Derived from an analysis of amazon.com reviews Targeting turned out to be an issue