The growth of social media has populated the Web with valuable user generated content that can be exploited for many different and interesting purposes, such as, explaining or predicting real world outcomes through opinion mining. In this context, natural language
processing techniques are a key technology for analysing user generated content. Such content is characterised by its casual language, with short texts, misspellings, and set-phrases, among other characteristics that challenge content analysis. This paper shows the differences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources. In addition, we evaluate the performance of three natural language processing techniques (i.e., language identification, sentiment analysis, and topic identification) showing the differences
on accuracy when applying such techniques to different types of user generated content.
Comparing user generated content published in different social media sources
1. Comparing user generated content
published in different social media sources
Óscar Muñoz-García, Carlos Navarro
@NLP can u tag #user_generated_content ?! via lrec-conf.org
26 May 2012
2. Introduction
The growth of social media has populated the Web with valuable
UGC that can be exploited for many interesting purposes
E.g. explaining or predicting real world outcomes through opinion
mining
Advertising companies use social media content for market research
By mining users’ interests for focusing advertisement actions
By obtaining the opinion of customers about brands
NLP lets us automatizing social media content analysis
However, UGC presents differences on text quality w.r.t. content
source (e.g., Blogs vs. Twitter)
Such differences challenge existing NLP techniques
Comparing user generated content published in different social media sources ⎢2
3. Introduction
We show the differences of the language used in UGC w.r.t. social media sources
By analysing the distribution of PoS categories on different sources
We evaluate the performance of three NLP techniques
Language Identification
Sentiment Analysis
Topic Identification
Social media sources analysed
Blogs (e.g., Wordpress and Blogger posts)
Forums
Microblogs (e.g., Twitter)
Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)
Review Sites (e.g., Ciao and Dooyoo)
Audio-visual content publishing sites (e.g., Youtube and Vimeo)
News publishing sites (i.e., mainstream media)
Other sites
Comparing user generated content published in different social media sources ⎢3
4. Comparing user generated content published in different social media sources
Distribution of PoS categories
5. Distribution of PoS categories
Content analysed
Corpora with 10,000 posts extracted from heterogeneous SM sources
l written in Spanish
l related to telecommunications domain
The distribution has been obtained by using an automatic tagger
Tools used:
l PoS tagging:
TreeTagger [Schmid, 1994] with a Spanish parameterisation
l Annotation pipeline:
GATE [Cunningham et al., 2011]
Categories identified
Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …
Secondary: common noun, proper noun, negation adverb, personal pronoun, …
Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in
Language Processing, Manchester, UK.
Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of
Computer Science, April.
Comparing user generated content published in different social media sources ⎢5
6. Distribution of PoS categories
Microblogs: determiners and prepositions are used to a lesser extent
Limitation of length (140 characters)
Posts need to be written more concisely → Meaningless grammatical categories
tend to be used less
Social
News Blogs Video Reviews Microblogs Forums Other
networks
Nouns 31% 30% 29% 23% 34% 22% 27% 33%
Adjectives 9% 8% 6% 8% 9% 7% 8% 6%
Adverbs 2% 3% 3% 5% 4% 4% 4% 3%
Determiners 11% 10% 8% 8% 6% 8% 9% 7%
Conjunctions 6% 8% 7% 10% 6% 10% 9% 7%
Pronouns 2% 3% 5% 6% 5% 6% 4% 4%
Prepositions 15% 15% 12% 13% 8% 12% 13% 11%
Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11%
Verbs 12% 14% 17% 18% 19% 21% 16% 16%
Other particles 1% 1% 1% 1% 1% 1% 1% 1%
Comparing user generated content published in different social media sources ⎢6
7. Distribution of PoS categories
News and blogs present similar distributions
Because of similar writing styles
No limitations on the size of posts
Social
News Blogs Video Reviews Microblogs Forums Other
networks
Nouns 31% 30% 29% 23% 34% 22% 27% 33%
Adjectives 9% 8% 6% 8% 9% 7% 8% 6%
Adverbs 2% 3% 3% 5% 4% 4% 4% 3%
Determiners 11% 10% 8% 8% 6% 8% 9% 7%
Conjunctions 6% 8% 7% 10% 6% 10% 9% 7%
Pronouns 2% 3% 5% 6% 5% 6% 4% 4%
Prepositions 15% 15% 12% 13% 8% 12% 13% 11%
Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11%
Verbs 12% 14% 17% 18% 19% 21% 16% 16%
Other particles 1% 1% 1% 1% 1% 1% 1% 1%
Comparing user generated content published in different social media sources ⎢7
8. Distribution of PoS categories
Nouns
Common and proper nouns present similar distributions for all sources
PoS tagger fails when proper nouns are written in lower case
l In special in Forums and Reviews where discussion about specific products are raised
l Solution: use gazetteers
Improves entity detection
Domain dependent
Foreign words are less used in news that in other sources because of style rules
of Spanish mainstream media
l Avoid foreign words, as far as possible, whenever a Spanish word exists
Adjectives
Adjectives of quantity are the most used (47%) in all the channels
l Cardinals (30%) more used than ordinals (2%)
Multiplicative, partitive and indefinite quantity adjectives are used more frequently
in forums and review sites:
l Due to quantitative evaluations and comparison of products
Comparing user generated content published in different social media sources ⎢8
9. Distribution of PoS categories
Adverbs
There is a correlation with the distribution of adverbs of negation and the size of
the posts
l More used in channels with shorter texts
l Detection of negations is essential when performing sentiment analysis
Conjunctions
The distribution of coordinating conjunctions is higher in News and Blogs
l More used in channels with longer texts
l Coordinating conjunctions are used to identify opinion chunks as they were punctuation
marks.
Pronouns
The distribution of personal pronouns is higher in Microblogs, Reviews, Forums
and audio-visual content publishing sites
l Due to conversations between users vs. narrative style of News and Blogs
l Pronouns make it difficult to identify entities within opinions
Entities not explicitly mentioned
Comparing user generated content published in different social media sources ⎢9
10. Distribution of PoS categories
Punctuation marks
Full stop less used in news
l Sentences are longer than in other sources
Comma less used on Microblogs and Audio-visual content sites
Ellipses are more used in Microblogs
l To denote unfinished sentences
l Automatically truncated messages
Secondary punctuation marks less used in Microblogs
l Difficulty for introducing these characters on mobile terminals
l Content length limitation
Verbs
More used in Microblogs and Forums
l Intentions and actions are expressed more often
Past tenses less used in Microblogs
l Immediate experiences
Infinitive more used in Microblogs
Comparing user generated content published in different social media sources ⎢10
11. Comparing user generated content published in different social media sources
Performance of language
identification
12. Performance of Language Identification
Content analysed
3,368 tweets
2,768 posts extracted from other social media sources (not
Twitter)
Written in Spanish, Portuguese and English
Technique used
Implementation of an existing text categorization algorithm
l Analysis of the frequency of n-grams of characters within documents
[Cavnar and Trenkle, 1994]
Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
and Information Retrieval (pp. 161-175).
Comparing user generated content published in different social media sources ⎢12
13. Performance of Language Identification
Language identification method
Comparing user generated content published in different social media sources ⎢13
14. Performance of Language Identification
Evaluation Results
Overall accuracy
l Twitter: 93.02%
l Other sources: 96.76%
Kappa
l Twitter: 0.844
l Other sources: 0.916
Normalizing tweets does not improve performance
Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
1. Delete references to users at the beginning of the tweet
2. Delete “RT @user:” sequences
3. Delete hash tags found at the end of the tweet
4. Delete “#” at the beginning of hash tags
5. Delete URLs
6. Delete “…” followed by a URL
Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
Language Processing (ICON-2010).
Comparing user generated content published in different social media sources ⎢14
15. Comparing user generated content published in different social media sources
Performance of sentiment
analysis
16. Performance of Sentiment Analysis
Content analysed
1,859 tweets and 1,847 posts extracted from other social media sources (not
Twitter) written in Spanish
Technique used
Matching of linguistic expressions based on a Lexicon
l Each expression is a sequence of pairs (lemma, PoS)
E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}
Kind of expressions
l For detecting subjectivity (20 expressions)
Use to include specific verbs
l For detecting sentiment of opinions (1,480 expressions)
Negative expressions add a value in {-2,-1} to overall sentiment
Positive expressions add a value in {1,2} to overall sentiment
l For reversing sentiment (22)
Include negations
Multiply detected sentiment by (-1)
l For augmenting or reducing sentiment (32)
Use to include adverbs
Multiply detected sentiment by 1.5 or 0.75
Comparing user generated content published in different social media sources ⎢16
17. Performance of Sentiment Analysis
Evaluation Results
Overall accuracy
l Twitter: 66.92%
l Other sources: 80.17%
Kappa
l Twitter: 0.198
l Other sources: 0.31
Normalizing tweets does not improve performance
Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
1. Delete references to users at the beginning of the tweet
2. Delete “RT @user:” sequences
3. Delete hash tags found at the end of the tweet
4. Delete “#” at the beginning of hash tags
5. Delete URLs
6. Delete “…” followed by a URL
Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
Language Processing (ICON-2010).
Comparing user generated content published in different social media sources ⎢17
18. Comparing user generated content published in different social media sources
Performance of topic
identification
19. Performance of topic identification
Description of the method [Muñoz-García et al., 2011]
Input
PoS • “torino”, “art”, “media”, “user”, “cloud”
Filtering
• http://dbpedia.org/resource/Turin
• http://dbpedia.org/resource/Art
Topic
Recognition • http://dbpedia.org/resource/User_(computing)
Language
• “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
Filtering
Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social
Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages
81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH.
Comparing user generated content published in different social media sources ⎢19
20. Performance of topic identification
PoS filtering example
• But a hardware problem is more likely, especially if
you use the phone a lot while eating. The
Blackberry's tiny trackball could be suffering the
same accumulation of gunk and grime that can
plague a computer mouse that still uses a rubber
Input ball on the underside to roll around the desk.
• Blackberry, phone, trackball, computer,
problem, grime, hardware, mouse, desk,
PoS filtering rubber ball, gunk
example
Comparing user generated content published in different social media sources ⎢20
21. Performance of topic identification
Topic Recognition (Sem4Tags [García-Silva et al, 2010])
• Blackberry, phone, trackball, computer, problem, grime, hardware,
PoS mouse, desk, rubber ball, gunk
filtering
• Blackberry, {phone, hardware, trackball, mouse}
• Computer, {hardware, mouse, problem, desk}
Context
Selection • …
• http://dbpedia.org/resource/BlackBerry
• http://dbpedia.org/resource/Computer
Disambiguation
Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int.
Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October
Comparing user generated content published in different social media sources ⎢21
22. Performance of topic identification
Context Selection
For each keyword, a set of up to 4 related keywords that will help to
disambiguate the its meaning
4 is the number of words above which the context does not add more resolving
power to disambiguation [Kaplan, 1955]
We compute semantic relatedness (active context) taking into account the
co-ocurrence of words in web pages [Gracia et al, 2009]
Keyword Relatedness Keyword Relatedness
phone 0.347 hardware 0.347
trackball 0.311 mouse 0.311
computer 0.288 desk 0.287
problem 0.246 rubber ball 0.246
grime 0.190 gunk 0.168
Active context selection for blackberry keyword
A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46
Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In
Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09,
Identifying Topics in Social Media Posts using DBpedia ⎢22
23. Performance of topic identification
Disambiguation Criteria
OPTION 1: Most frequent sense for the ambiguous word
l Determined by Wikipedia editors (the first link in a disambiguation page)
OPTION 2: Vector space model
1. A vector containing the keyword and its context
2. A vector containing top N terms is created from each candidate sense is created using
TF-IDF (Term Frequency and Inverse Document Frequency)
3. The cosine similarity is used to determine which vectorised sense is more similar to
the vector associated to the keyword
DBpedia resource Definition Similarity
Is a line of mobile e-mail and
BlackBerry 0.224
smartphone
Blackberry is an edible fruit 0.15
BlackBerry_(song) is a song by the Black Crowes 0.0
BlackBerry_Township,
_Itasca_County, Is a towship in … Itasca County 0.0
_Minnesota
Comparing user generated content published in different social media sources ⎢23
24. Performance of topic identification
Evaluation settings
Evaluated a random sample of 1,816 posts (18,16%)
47 human evaluators
Each post and topics identified shown to 3 different evaluators
Evaluation options:
1. The topic is not related with the post
2. The topic is somehow related with the post
3. The topic is closely related with the post
4. The evaluator has not enough information for taking a decision
Fleiss’ kappa test
l Strength of agreement for 2 evaluators = 0.826 (very good)
l Strength of agreement for 3 evaluators = 0.493 (moderate)
Comparing user generated content published in different social media sources ⎢24
25. Performance of topic identification
Evaluation Results
Precision depends on the channel
l From 59.19% for social networks
More misspellings
More common nouns
l To 88.89% for review sites
Concrete products and brands
Proper nouns tend to have a Wikipedia entry
Context selection criteria also depends on the channel
l Active context selection better for microblogs and review sites
l Considering all the post keywords as context better for blogs
l Without context selection is better for the rest of the cases (almost all the channels)
Naïve default sense selection is effective
Comparing user generated content published in different social media sources ⎢25
27. Conclusions
We have found differences among social media sources for every
experiment executed
Distribution of PoS tagging vary across different sources
l Since PoS tagging is a previous step for many NLP techniques, the
performance of such techniques may be affected
E.g. Using nouns as context for performing term disambiguation.
More nouns → More context
E.g. Adjectives and adverbs for performing sentiment analysis
Language identification is less accurate for content extracted from
Twitter
Sentiment analysis is less accurate for content extracted from Twitter
Precision of topic identification also depends on the source
l With respect to context selection there is not a technique that performs
better for all the sources
Comparing user generated content published in different social media sources ⎢27