1. Crisis Computing
Finding relevant and credible information on social
media during disasters
Big Data Analytics Conference
Delhi, India, December 2014
8. 8
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
An earthquake hits a Twitter user
• When an earthquake strikes, the first tweets are
posted 20-30 seconds later
• Damaging seismic waves travel at 3-5 km/s, while
network communications are light speed on
fiber/copper + latency
• After ~100km seismic waves may be overtaken by
tweets about them
http://xkcd.com/723/
10. Alexandra Olteanu, Sarah Vieweg and Carlos Castillo: What to Expect When the
Unexpected Happens: Social Media Communications Across Crises.
To appear in CSCW 2015.
Examples of crisis tweets (cont.)
11. 11
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Fertile grounds for applied research
✔
Problems of global significance
✔
Solved with labor-intensive methods
✔
Better solution provides a public good
✔
Large and noisy data sets available
✔
Engage volunteer communities
12. 12
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Fertile grounds for applied research
✔
Problems of global significance
✔
Solved with labor-intensive methods
✔
Better solution provides a public good
✔
Large and noisy data sets available
✔
Engage volunteer communities
• Relevance to practitioners?
13. 13
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Current collaborators
Patrick Meier
– QCRI
Sarah Vieweg
– QCRI
Muhammad Imran
– QCRI
Irina Temnikova
– QCRI
Alexandra Olteanu
– EPFL
Aditi Gupta
– IIIT Delhi
“P.K.” Kumaraguru
– IIIT Delhi
Fernando Diaz
– Microsoft
15. Crisis maps from social media
Carlos Castillo, Fernando Diaz, and Hemant Purohit:
Leveraging Social Media and Web of Data to Assist Crisis Response Coordination
Tutorial at SDM, Philadelphia, PA, USA. April 2014.
Hemant Purohit, Carlos Castillo, Patrick Meier and Amit Sheth:
Crisis Mapping, Citizen Sensing and Social Media Analytics
Tutorial at ICWSM, May 2013.
16.
17.
18.
19.
20. Patrick Meier, Social Innovation Director @ QCRI – http://irevolution.net/
“What can speed humanitarian
response to tsunami-ravaged
coasts? Expose human rights
atrocities? Launch helicopters to
rescue earthquake victims?
Outwit corrupt regimes?
A map.”
28. Understanding Crisis Tweets
Alexandra Olteanu, Sarah Vieweg and Carlos Castillo: What to Expect When the
Unexpected Happens: Social Media Communications Across Crises.
To appear in CSCW 2015.
29. 29
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Types of Disaster
31. 31
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Filtering
Is disaster-
related?
Contributes to
situational
awareness?
Yes Yes
No No
32. 32
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Classification
Caution &
Advice
Information
Sources
Damage &
Casualties
Donations
Gov
Eyewitness
Media
NGO
Outsider
...
...
Filtered
tweets
33. 33
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
A large-scale study of crisis tweets
• Collect tweets from 26 disasters
• Classify according to:
●
Informative / Not informative
●
Information provided
●
Information source
34. 34
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Advice on labeling
• Your instructions will never be correct the first
time you try
– e.g. personal / eyewitness
– Instructions must be re-written reactively
– Perform small-scale labeling first
• Instructions must be concrete and brief
– If you can't do it, the task has to be divided
35. 35
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Information Provided in Crisis Tweets
N=26; Data available at http://crisislex.org/
36. 36
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
What do people tweet about?
• Affected individuals
– 20% on average (min. 5%, max. 57%)
– most prevalent in human-induced, focalized & instantaneous events
• Sympathy and emotional support
– 20% on average (min. 3%, max. 52%)
– most prevalent in instantaneous events
• Other useful information
– 32% on average (min. 7%, max. 59%)
– least prevalent in diffused events
37. 37
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
What do people tweet about? (cont.)
• Infrastructure and utilities
– 7% on average (min. 0%, max. 22%)
– most prevalent in diffused events, in particular floods
• Caution and advice
– 10% on average (min. 0%, max. 34%)
– least prevalent in instantaneous & human-induced events
• Donations and volunteering
– 10% on average (min. 0%, max. 44%)
– most prevalent in natural hazards
40. Extracting information and matching
emergency-related resources
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier:
Extracting Information Nuggets from Disaster-Related Messages in Social Media
In ISCRAM. Baden-Baden, Germany, 2013. Best paper award.
Hemant Purohit, Amit Sheth, Carlos Castillo, Patrick Meier, Fernando Diaz:
Emergency-Relief Coord. on Social Media: Auto. Matching Resource Requests and Offers
First Monday 19 (1), January 2014
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier:
Practical Extraction of Disaster-Relevant Information from Social Media
In SWDM. Rio de Janeiro, Brazil, 2013
41. 41
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Information Extraction
...
Classified
tweets
@JimFreund: Apparently we have no choice.
There is a tornado watch in effect
tonight.
42. 42
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Extraction
• #hashtags, @user mentions, URLs, etc.
– Regular expressions
– Text library from Twitter
• Temporal expressions
– Part-of-speech tagger + heuristics
– Natty library
• Supervised learning
43. 43
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Labels for extraction
• Type-dependent instruction
• Ask evaluators to copy-paste a word/phrase from
each tweet
44. 44
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Learning: Conditional Random Fields
• Used extensively in NLP for part-of-speech tagging
and information extraction
• Representation of observations is important
(capitalization, position, etc.)
HMM Linear-chain CRF
hidden
observed
45. 45
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Tool
• CMU ARK Twitter NLP
– Tokenization
– Feature extraction
– CRF learning
• Very easy to use: simply change the training set
(part-of-speech tags) into anything, and re-train
46. 46
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Output examples
RT @weatherchannel: .@NYGovCuomo orders closing of NYC bridges. Only
Staten Island bridges unaffected at this time. Bridges must close by 7pm. #Sandy
#NYC
Wow what a mess #Sandy has made. Be sure to check on the elderly and
homeless please! Thoughts and prayers to all affected
RT @twc_hurricane: Wind gusts over 60 mph are being reported at Central Park
and JFK airport in #NYC this hour. #Sandy
RT @mitchellreports: Red Cross tells us grateful for Romney donation but prefer
people send money or donate blood dont collect goods NOT best way to help
#Sandy
47. 47
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Extractor evaluation
Setting Rec Prec
Train 2/3 Joplin, Test 1/3 Joplin 78% 90%
Train 2/3 Sandy, Test 1/3 Sandy 41% 79%
Train Joplin, Test Sandy 11% 78%
Train Joplin + 10% Sandy, Test 90% Sandy 21% 81%
• Precision is: one word or more in common with
what humans extracted
48. 48
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Donations matching
• Identify and match requests/offers for donations
– Money, clothing, food, shelter, volunteers, blood
Average precision = 0.21 (0.16 if only text similarity is used)
49. Crowdsourced stream processing systems
Muhammad Imran, Ioanna Lykourentzou and Carlos Castillo:
Engineering Crowdsourced Stream Processing Systems
http://arxiv.org/abs/1310.5463
51. 51
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Design objectives and principles
Design principles
Design objective Example metric Automatic
components
Crowdsourced
components
Low latency End-to-end time Keep-items moving Trivial tasks
High throughput Output items per
unit of time
High-performance
processing
Task automation
Load adaptability Rate response
function
Load shedding, load
queueing
Task prioritization
Cost effectiveness Cost vs. quality,
throughput, etc.
N/A Task frugality
High quality Application-
dependent
Redudancy, aggregation and quality control
52. Design patterns
● QA loop
● Task assignment
● Process/verify
● Supervised learning
● Crowdwork sub-task
chaining
● Humans are not a
bottleneck
● Humans review every
output element
53. 53
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
http://aidr.qcri.org/
54. 54
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Self-service for crisis-related classification
Unstructured
text reports
Categorized
information
Automatic
classifier
Model
Builder
Crowdsourced
ground-truth
Library of
training data
55.
56.
57. Credibility and verification
Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo and Patrick Meier:
TweetCred: A Real-time Web-based System for Credibility of Content on Twitter
In SocInfo 2014. Runner-up for best paper award.
Carlos Castillo, Marcelo Mendoza, Barbara Poblete:
Predicting Information Credibility in Time-Sensitive Social Media
In Internet Research, Vol. 23, Issue 5. October 2013.
A. Popoola, D. Krasnoshtan, A. Toth, V. Naroditskiy, C. Castillo, P. Meier and I. Rahwan:
Information Verification during Natural Disasters
Social Web and Disaster Management (SWDM) workshop, 2013.
62. 62
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Crowdsourced verification: Veri.ly
• Frame crowdwork correctly
• Not upvoting/downvoting a claim
• Instead, providing evidence for/against
@VeriDotLy — http://veri.ly/
63.
64.
65. 65
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Examples of evidence provided
66. 66
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Automatic credibility evaluation: TweetCred
• Real-time web-based service
• Used as a Chrome extension
• Annotates Twitter's timeline with credibility
scores
67. 67
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
http://twitdigest.iiitd.edu.in/TweetCred/
68. 68
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Next steps
• Credibility facets
– Factually written
– Detailed
– Author on the ground
– ...
• Respond to searches about an event
71. 71
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Computationally
feasible
Supported by
data
Useful
Good projects in this space
72. 72
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Computationally
feasible
Supported by
data
Useful
Good projects in this space
Temptation! Danger!
Poorly planned
projects :-(
AI-complete
problems
73. 73
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Some venues
• SWDM – Workshop on Social Web
for Disaster Management
– Deadline: January 24th
• ISCRAM – International Conference on Information Systems
for Crisis Response and Management
+ the usual suspects, depending on your area ;-)
74. 74
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Possibility of large impact by using computer
science to support humanitarian work
=
Applied computing at its best
75. Thank you!
Carlos Castillo · chato@acm.org
http://www.chato.cl/research/
With thanks to Patrick Meier for several slides