Comparing user generated content published in different social media sources

Comparing user generated content
published in different social media sources
Óscar Muñoz-García, Carlos Navarro

@NLP can u tag #user_generated_content ?! via lrec-conf.org

26 May 2012

Introduction

 The growth of social media has populated the Web with valuable
UGC that can be exploited for many interesting purposes
 E.g. explaining or predicting real world outcomes through opinion
mining

 Advertising companies use social media content for market research
 By mining users’ interests for focusing advertisement actions
 By obtaining the opinion of customers about brands

 NLP lets us automatizing social media content analysis
 However, UGC presents differences on text quality w.r.t. content
source (e.g., Blogs vs. Twitter)
 Such differences challenge existing NLP techniques

Comparing user generated content published in different social media sources ⎢2

Introduction

 We show the differences of the language used in UGC w.r.t. social media sources
 By analysing the distribution of PoS categories on different sources
 We evaluate the performance of three NLP techniques
 Language Identification
 Sentiment Analysis
 Topic Identification
 Social media sources analysed
 Blogs (e.g., Wordpress and Blogger posts)
 Forums
 Microblogs (e.g., Twitter)
 Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)
 Review Sites (e.g., Ciao and Dooyoo)
 Audio-visual content publishing sites (e.g., Youtube and Vimeo)
 News publishing sites (i.e., mainstream media)
 Other sites


Comparing user generated content published in different social media sources

Distribution of PoS categories


 Content analysed
 Corpora with 10,000 posts extracted from heterogeneous SM sources
l written in Spanish
l related to telecommunications domain
 The distribution has been obtained by using an automatic tagger
 Tools used:
l PoS tagging:
 TreeTagger [Schmid, 1994] with a Spanish parameterisation
l Annotation pipeline:
 GATE [Cunningham et al., 2011]

 Categories identified
 Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …
 Secondary: common noun, proper noun, negation adverb, personal pronoun, …

Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in
Language Processing, Manchester, UK.

Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of
Computer Science, April.


 Microblogs: determiners and prepositions are used to a lesser extent
 Limitation of length (140 characters)
 Posts need to be written more concisely → Meaningless grammatical categories
tend to be used less
Social
News Blogs Video Reviews Microblogs Forums Other
networks
Nouns 31% 30% 29% 23% 34% 22% 27% 33%
Adjectives 9% 8% 6% 8% 9% 7% 8% 6%
Adverbs 2% 3% 3% 5% 4% 4% 4% 3%
Determiners 11% 10% 8% 8% 6% 8% 9% 7%
Conjunctions 6% 8% 7% 10% 6% 10% 9% 7%
Pronouns 2% 3% 5% 6% 5% 6% 4% 4%
Prepositions 15% 15% 12% 13% 8% 12% 13% 11%
Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11%
Verbs 12% 14% 17% 18% 19% 21% 16% 16%
Other particles 1% 1% 1% 1% 1% 1% 1% 1%



 News and blogs present similar distributions
 Because of similar writing styles
 No limitations on the size of posts

Social
News Blogs Video Reviews Microblogs Forums Other
networks
Nouns 31% 30% 29% 23% 34% 22% 27% 33%
Adjectives 9% 8% 6% 8% 9% 7% 8% 6%
Adverbs 2% 3% 3% 5% 4% 4% 4% 3%
Determiners 11% 10% 8% 8% 6% 8% 9% 7%
Conjunctions 6% 8% 7% 10% 6% 10% 9% 7%
Pronouns 2% 3% 5% 6% 5% 6% 4% 4%
Prepositions 15% 15% 12% 13% 8% 12% 13% 11%
Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11%
Verbs 12% 14% 17% 18% 19% 21% 16% 16%
Other particles 1% 1% 1% 1% 1% 1% 1% 1%



 Nouns
 Common and proper nouns present similar distributions for all sources
 PoS tagger fails when proper nouns are written in lower case
l In special in Forums and Reviews where discussion about specific products are raised
l Solution: use gazetteers
 Improves entity detection
 Domain dependent
 Foreign words are less used in news that in other sources because of style rules
of Spanish mainstream media
l Avoid foreign words, as far as possible, whenever a Spanish word exists
 Adjectives
 Adjectives of quantity are the most used (47%) in all the channels
l Cardinals (30%) more used than ordinals (2%)
 Multiplicative, partitive and indefinite quantity adjectives are used more frequently
in forums and review sites:
l Due to quantitative evaluations and comparison of products



 Adverbs
 There is a correlation with the distribution of adverbs of negation and the size of
the posts
l More used in channels with shorter texts
l Detection of negations is essential when performing sentiment analysis
 Conjunctions
 The distribution of coordinating conjunctions is higher in News and Blogs
l More used in channels with longer texts
l Coordinating conjunctions are used to identify opinion chunks as they were punctuation
marks.
 Pronouns
 The distribution of personal pronouns is higher in Microblogs, Reviews, Forums
and audio-visual content publishing sites
l Due to conversations between users vs. narrative style of News and Blogs
l Pronouns make it difficult to identify entities within opinions
 Entities not explicitly mentioned



 Punctuation marks
 Full stop less used in news
l Sentences are longer than in other sources
 Comma less used on Microblogs and Audio-visual content sites
 Ellipses are more used in Microblogs
l To denote unfinished sentences
l Automatically truncated messages
 Secondary punctuation marks less used in Microblogs
l Difficulty for introducing these characters on mobile terminals
l Content length limitation
 Verbs
 More used in Microblogs and Forums
l Intentions and actions are expressed more often
 Past tenses less used in Microblogs
l Immediate experiences
 Infinitive more used in Microblogs



Performance of language
identification

Performance of Language Identification

 3,368 tweets
 2,768 posts extracted from other social media sources (not
Twitter)
 Written in Spanish, Portuguese and English

 Technique used
 Implementation of an existing text categorization algorithm
l Analysis of the frequency of n-grams of characters within documents
[Cavnar and Trenkle, 1994]

Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
and Information Retrieval (pp. 161-175).



 Language identification method



 Evaluation Results
 Overall accuracy
l Twitter: 93.02%
l Other sources: 96.76%
 Kappa
l Twitter: 0.844
l Other sources: 0.916

 Normalizing tweets does not improve performance
 Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
1. Delete references to users at the beginning of the tweet
2. Delete “RT @user:” sequences
3. Delete hash tags found at the end of the tweet
4. Delete “#” at the beginning of hash tags
5. Delete URLs
6. Delete “…” followed by a URL
Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
Language Processing (ICON-2010).


Performance of sentiment
analysis

Performance of Sentiment Analysis

 1,859 tweets and 1,847 posts extracted from other social media sources (not
Twitter) written in Spanish
 Technique used
 Matching of linguistic expressions based on a Lexicon
l Each expression is a sequence of pairs (lemma, PoS)
 E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}
 Kind of expressions
l For detecting subjectivity (20 expressions)
 Use to include specific verbs
l For detecting sentiment of opinions (1,480 expressions)
 Negative expressions add a value in {-2,-1} to overall sentiment
 Positive expressions add a value in {1,2} to overall sentiment
l For reversing sentiment (22)
 Include negations
 Multiply detected sentiment by (-1)
l For augmenting or reducing sentiment (32)
 Use to include adverbs
 Multiply detected sentiment by 1.5 or 0.75

Performance of Sentiment Analysis

 Overall accuracy
l Twitter: 66.92%
l Other sources: 80.17%
 Kappa
l Twitter: 0.198
l Other sources: 0.31

 Normalizing tweets does not improve performance
 Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
1. Delete references to users at the beginning of the tweet
2. Delete “RT @user:” sequences
3. Delete hash tags found at the end of the tweet
4. Delete “#” at the beginning of hash tags
5. Delete URLs
6. Delete “…” followed by a URL
Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
Language Processing (ICON-2010).



Performance of topic
identification

Performance of topic identification

 Description of the method [Muñoz-García et al., 2011]

Input

PoS • “torino”, “art”, “media”, “user”, “cloud”
Filtering

• http://dbpedia.org/resource/Turin
• http://dbpedia.org/resource/Art
Topic
Recognition • http://dbpedia.org/resource/User_(computing)

Language
• “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
Filtering

Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social
Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages
81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH.



 PoS filtering example

• But a hardware problem is more likely, especially if
you use the phone a lot while eating. The
Blackberry's tiny trackball could be suffering the
same accumulation of gunk and grime that can
plague a computer mouse that still uses a rubber
Input ball on the underside to roll around the desk.

• Blackberry, phone, trackball, computer,
problem, grime, hardware, mouse, desk,
PoS filtering rubber ball, gunk
example



 Topic Recognition (Sem4Tags [García-Silva et al, 2010])

• Blackberry, phone, trackball, computer, problem, grime, hardware,
PoS mouse, desk, rubber ball, gunk
filtering

• Blackberry, {phone, hardware, trackball, mouse}
• Computer, {hardware, mouse, problem, desk}
Context
Selection • …

• http://dbpedia.org/resource/BlackBerry
• http://dbpedia.org/resource/Computer
Disambiguation

Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int.
Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October



 Context Selection
 For each keyword, a set of up to 4 related keywords that will help to
disambiguate the its meaning
 4 is the number of words above which the context does not add more resolving
power to disambiguation [Kaplan, 1955]
 We compute semantic relatedness (active context) taking into account the
co-ocurrence of words in web pages [Gracia et al, 2009]
Keyword Relatedness Keyword Relatedness
phone 0.347 hardware 0.347
trackball 0.311 mouse 0.311
computer 0.288 desk 0.287
problem 0.246 rubber ball 0.246
grime 0.190 gunk 0.168

Active context selection for blackberry keyword
A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46

Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In
Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09,

Identifying Topics in Social Media Posts using DBpedia ⎢22


 Disambiguation Criteria
 OPTION 1: Most frequent sense for the ambiguous word
l Determined by Wikipedia editors (the first link in a disambiguation page)
 OPTION 2: Vector space model
1. A vector containing the keyword and its context
2. A vector containing top N terms is created from each candidate sense is created using
TF-IDF (Term Frequency and Inverse Document Frequency)
3. The cosine similarity is used to determine which vectorised sense is more similar to
the vector associated to the keyword

DBpedia resource Definition Similarity
Is a line of mobile e-mail and
BlackBerry 0.224
smartphone
Blackberry is an edible fruit 0.15
BlackBerry_(song) is a song by the Black Crowes 0.0
BlackBerry_Township,
_Itasca_County, Is a towship in … Itasca County 0.0
_Minnesota



 Evaluation settings
 Evaluated a random sample of 1,816 posts (18,16%)
 47 human evaluators
 Each post and topics identified shown to 3 different evaluators
 Evaluation options:
1. The topic is not related with the post
2. The topic is somehow related with the post
3. The topic is closely related with the post
4. The evaluator has not enough information for taking a decision
 Fleiss’ kappa test
l Strength of agreement for 2 evaluators = 0.826 (very good)
l Strength of agreement for 3 evaluators = 0.493 (moderate)




 Precision depends on the channel
l From 59.19% for social networks
 More misspellings
 More common nouns
l To 88.89% for review sites
 Concrete products and brands
 Proper nouns tend to have a Wikipedia entry
 Context selection criteria also depends on the channel
l Active context selection better for microblogs and review sites
l Considering all the post keywords as context better for blogs
l Without context selection is better for the rest of the cases (almost all the channels)
 Naïve default sense selection is effective



Conclusions

Conclusions

 We have found differences among social media sources for every
experiment executed
 Distribution of PoS tagging vary across different sources
l Since PoS tagging is a previous step for many NLP techniques, the
performance of such techniques may be affected
 E.g. Using nouns as context for performing term disambiguation.
 More nouns → More context
 E.g. Adjectives and adverbs for performing sentiment analysis
 Language identification is less accurate for content extracted from
Twitter
 Sentiment analysis is less accurate for content extracted from Twitter
 Precision of topic identification also depends on the source
l With respect to context selection there is not a technique that performs
better for all the sources


Thank you!
oscar.munoz@havasmedia.com

Comparing user generated content published in different social media sources

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Comparing user generated content published in different social media sources

Semelhante a Comparing user generated content published in different social media sources (20)

Mais de Óscar Muñoz García

Mais de Óscar Muñoz García (8)

Último

Último (20)

Comparing user generated content published in different social media sources