Social media is a valuable source of information for different domains, since users share their opinion and knowledge in (near) real-time. Moreover, users usually use different words to refer to a particular event (e.g., a rain event). These words may be later employed to filter social media messages regarding new occurrences of the event and, thus, to reduce the number of unrelated messages. These words, however, may have different meanings and, thus, may not reduce the number of messages. In this work, we conduct a case study to measure which rain- or flood-related keywords are less relevant to reduce the number of unrelated messages. The results show that the keywords change over space, due to local language/culture, and time, specially in different time scales.
Does keyword noise change over space and time? A case study of flood- and rain-related social media messages
1. Does keyword noise change over space and
time?
A case study of flood- and rain-related social media messages
Sidgley Camargo de Andrade1
, L´ıvia Castro Degrossi2
Camilo Restrepo-
Estrada3
, Alexandre C. B. Delbem2
, Jo˜ao Porto de Albuquerque4
December 2018
(1)
Federal University of Technology - Paran´a (UTFPR), Toledo, Brazil
(2)
University of S˜ao Paulo (USP), S˜ao Carlos, Brazil
(3)
University of Antioquia, Medell´ın, Colombia
(4)
University of Warwick, Coventry, UK
Publication available at http://mtc-m16c.sid.inpe.br/col/sid.inpe.br/mtc-m16c/2018/12.27.18.33/
doc/p11.pdf
GEOINFO 2018 – Brazilian Symposium on Geoinformatics, Campina Grande, Para´ıba, Brazil
2. Big Data (Social Media) & Information Retrieval
• Volume
• Variety
• Veracity
• Velocity
• Noise and rumor
• Heterogeneity
• Uncertainty
• High processing
(on-line)
Retrieving relevant and meaningful data is not a straightforward task.
* Paraphrasing Dr. Jo˜ao Porto de Albuquerque’s ideia/analogy (Warwick/UK) who used the image of the roman god Janus.
1
3. How to filter ueseful information?
Within the context of social media analytics:
• Social media users utilize a variety of terms to refer to an event
that they observe.
• Keyword-based filtering approach has been widely employed to
reduce the “noise”, i.e., messages that contain event-related
keywords, but are not related to an event indeed or are duplicated.
• The noise usually occurs when the keywords have different
definitions and/or meanings (e.g. “Santos” can refer to the
coastal city or the soccer team).
• Variations, misspellings and typos are inherent in the social media
messages (text content).
2
4. Research
• Question: Does keyword noise change over space and time?
• Methodology: To carry out a case study supported by an
exploratory content analysis to measure the signal and noise rate
of the rain- and flood-related keywords on a data sample from
Twitter.
• Purpose: To measure which rain- or flood-related keywords are less
relevant to reduce the noise.
alagamento (flood), alagado (flooded), alagada (flooded), alagando (it’s flooding), alagou
(flooded), alagar (to flood), chove (rain), chova (rain), chovia (had been rained), chuva (rain),
chuvarada (rain), chuvosa (rain), chuvoso (rainy), chuvona (heavy rain), chuvinha (drizzle), chu-
visco (drizzle), chuvendo (it’s raining), dil´uvio (heavy rain), garoa (drizzle), inunda¸c˜ao (flood),
inundada (flooded), inundado (flooded), inundar (to flood), inundam (flood), inundou (flooded),
temporal (storm), temporais (storms)
3
5. Exploratory content analysis
• Study area: S˜ao Paulo city,
Brazil.
• Number of tweets retrieved:
11,848,923 million, from 7
Nov. 2016 to 28 Feb. 2017.
• Geotagged tweets: 891,367
thousand (7.52%).
• Keyword filtered tweets:
5,408 thousand.
• On-topic/Off-topic tweets:
3,964 and 1,444 thousand,
respectivelly. (Krippendorff’s
alpha coefficient of 0.72 – 5
raters)
4
6. Signal and Noise
To aggregate the signal and noise in different time scale across the
districts:
• Signalst
= nr. on-topic tweets
nr. filtered tweets
• Noisest = nr. off-topic tweets
nr. filtered tweets
where s and t correspond to district and time scale, respectively.
5
8. Signal and noise of the keywords (S´e and Barra Funda districts)
The signal and noise were measured as the fraction between on-topic and off-topic
tweets and all the tweets posted within the district (relative frequency) and, later,
rescaled to [-1, 1].
7
10. Signal and noise of the keywords (Itaquera and Cidade Dutra
districts)
The signal and noise were measured as the fraction between on-topic and off-topic
tweets and all the tweets posted within the district (relative frequency) and, later,
rescaled to [-1, 1]. 9
11. Discussions & Conclusions
• Keywords should be selected with caution since they are sensible
to time and space.
• At the first sight, all predefined keywords had potential to filter rain-
and flood-related messages; however, our analysis demonstrated that
some keywords are noisy and may introduce false-positive
messages.
• Local issues should be taken into account, such as
language/culture, and, specially, atypical events.
10
12. References
IBGE (2010). Censo Demogr´afico 2010. Brazilian Institute of Geography and Statistics, Rio de
Janeiro.
Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and
recommendations. Human Communications Research, 30(3):411–433.
Rzeszewski, M. (2018). Geosocial capta in geographical research – a critical analysis. Cartography
and Geographic Information Science, 45(1):18–30.
Vieweg, S., Hughes, A. L., Starbird, K., and Palen, L. (2010). Microblogging during two natural
hazards events: What twitter may contribute to situational awareness. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pages 1079–1088, New
York, NY, USA. ACM.
11
13. Thank you very much!
Sidgley Camargo de Andrade
Federal University of Technology – Paran´a – UTFPR-Toledo
sidgleyandrade@utfpr.edu.br
http://pessoal.utfpr.edu.br/sidgleyandrade/
http://www.agora.icmc.usp.br/
Acknowledgements/Funding
11