SlideShare uma empresa Scribd logo
1 de 28
Comparing user generated content
published in different social media sources
Óscar Muñoz-García, Carlos Navarro

@NLP can u tag #user_generated_content ?! via lrec-conf.org

26 May 2012
Introduction




 The growth of social media has populated the Web with valuable
      UGC that can be exploited for many interesting purposes
             E.g. explaining or predicting real world outcomes through opinion
              mining

 Advertising companies use social media content for market research
    By mining users’ interests for focusing advertisement actions
    By obtaining the opinion of customers about brands


 NLP lets us automatizing social media content analysis
    However, UGC presents differences on text quality w.r.t. content
     source (e.g., Blogs vs. Twitter)
    Such differences challenge existing NLP techniques



Comparing user generated content published in different social media sources ⎢2
Introduction



     We show the differences of the language used in UGC w.r.t. social media sources
        By analysing the distribution of PoS categories on different sources
     We evaluate the performance of three NLP techniques
        Language Identification
        Sentiment Analysis
        Topic Identification
     Social media sources analysed
             Blogs (e.g., Wordpress and Blogger posts)
             Forums
             Microblogs (e.g., Twitter)
             Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)
             Review Sites (e.g., Ciao and Dooyoo)
             Audio-visual content publishing sites (e.g., Youtube and Vimeo)
             News publishing sites (i.e., mainstream media)
             Other sites



Comparing user generated content published in different social media sources ⎢3
Comparing user generated content published in different social media sources


Distribution of PoS categories
Distribution of PoS categories




 Content analysed
   Corpora with 10,000 posts extracted from heterogeneous SM sources
      l written in Spanish
      l related to telecommunications domain
 The distribution has been obtained by using an automatic tagger
   Tools used:
      l  PoS tagging:
                            TreeTagger [Schmid, 1994] with a Spanish parameterisation
                l   Annotation pipeline:
                            GATE [Cunningham et al., 2011]

 Categories identified
   Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …
   Secondary: common noun, proper noun, negation adverb, personal pronoun, …

       Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in
       Language Processing, Manchester, UK.

       Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of
       Computer Science, April.
Comparing user generated content published in different social media sources ⎢5
Distribution of PoS categories


      Microblogs: determiners and prepositions are used to a lesser extent
        Limitation of length (140 characters)
        Posts need to be written more concisely → Meaningless grammatical categories
          tend to be used less
                                                                                                                            Social
                                 News              Blogs           Video           Reviews   Microblogs   Forums   Other
                                                                                                                           networks
        Nouns                     31%               30%             29%             23%         34%        22%     27%       33%
      Adjectives                   9%                8%              6%              8%         9%         7%       8%       6%
       Adverbs                     2%                3%              3%              5%         4%         4%       4%       3%
     Determiners                  11%               10%              8%              8%         6%         8%       9%       7%
     Conjunctions                  6%                8%              7%             10%         6%         10%      9%       7%
      Pronouns                     2%                3%              5%              6%         5%         6%       4%       4%
     Prepositions                 15%               15%             12%             13%         8%         12%     13%       11%
Punctuaction marks                11%                8%             13%              9%         8%         9%      10%       11%
         Verbs                    12%               14%             17%             18%         19%        21%     16%       16%
  Other particles                  1%                1%              1%              1%         1%         1%       1%       1%

 Comparing user generated content published in different social media sources ⎢6
Distribution of PoS categories


      News and blogs present similar distributions
        Because of similar writing styles
        No limitations on the size of posts


                                                                                                                            Social
                                 News              Blogs           Video           Reviews   Microblogs   Forums   Other
                                                                                                                           networks
        Nouns                     31%               30%             29%             23%         34%        22%     27%       33%
      Adjectives                   9%                8%              6%              8%         9%         7%       8%       6%
       Adverbs                     2%                3%              3%              5%         4%         4%       4%       3%
     Determiners                  11%               10%              8%              8%         6%         8%       9%       7%
     Conjunctions                  6%                8%              7%             10%         6%         10%      9%       7%
      Pronouns                     2%                3%              5%              6%         5%         6%       4%       4%
     Prepositions                 15%               15%             12%             13%         8%         12%     13%       11%
Punctuaction marks                11%                8%             13%              9%         8%         9%      10%       11%
         Verbs                    12%               14%             17%             18%         19%        21%     16%       16%
  Other particles                  1%                1%              1%              1%         1%         1%       1%       1%

 Comparing user generated content published in different social media sources ⎢7
Distribution of PoS categories




 Nouns
   Common and proper nouns present similar distributions for all sources
   PoS tagger fails when proper nouns are written in lower case
                l   In special in Forums and Reviews where discussion about specific products are raised
                l   Solution: use gazetteers
                            Improves entity detection
                            Domain dependent
            Foreign words are less used in news that in other sources because of style rules
             of Spanish mainstream media
                l   Avoid foreign words, as far as possible, whenever a Spanish word exists
 Adjectives
   Adjectives of quantity are the most used (47%) in all the channels
                l   Cardinals (30%) more used than ordinals (2%)
            Multiplicative, partitive and indefinite quantity adjectives are used more frequently
             in forums and review sites:
                l   Due to quantitative evaluations and comparison of products


Comparing user generated content published in different social media sources ⎢8
Distribution of PoS categories




 Adverbs
   There is a correlation with the distribution of adverbs of negation and the size of
    the posts
                l   More used in channels with shorter texts
                l   Detection of negations is essential when performing sentiment analysis
 Conjunctions
   The distribution of coordinating conjunctions is higher in News and Blogs
                l   More used in channels with longer texts
                l   Coordinating conjunctions are used to identify opinion chunks as they were punctuation
                    marks.
 Pronouns
   The distribution of personal pronouns is higher in Microblogs, Reviews, Forums
     and audio-visual content publishing sites
                l   Due to conversations between users vs. narrative style of News and Blogs
                l   Pronouns make it difficult to identify entities within opinions
                            Entities not explicitly mentioned



Comparing user generated content published in different social media sources ⎢9
Distribution of PoS categories




 Punctuation marks
   Full stop less used in news
                l   Sentences are longer than in other sources
        Comma less used on Microblogs and Audio-visual content sites
        Ellipses are more used in Microblogs
                l   To denote unfinished sentences
                l   Automatically truncated messages
            Secondary punctuation marks less used in Microblogs
                l   Difficulty for introducing these characters on mobile terminals
                l   Content length limitation
 Verbs
   More used in Microblogs and Forums
                l   Intentions and actions are expressed more often
            Past tenses less used in Microblogs
                l   Immediate experiences
            Infinitive more used in Microblogs

Comparing user generated content published in different social media sources ⎢10
Comparing user generated content published in different social media sources


Performance of language
identification
Performance of Language Identification




 Content analysed
            3,368 tweets
            2,768 posts extracted from other social media sources (not
             Twitter)
            Written in Spanish, Portuguese and English


 Technique used
            Implementation of an existing text categorization algorithm
                l   Analysis of the frequency of n-grams of characters within documents
                    [Cavnar and Trenkle, 1994]

       Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
       and Information Retrieval (pp. 161-175).



Comparing user generated content published in different social media sources ⎢12
Performance of Language Identification


 Language identification method




Comparing user generated content published in different social media sources ⎢13
Performance of Language Identification


 Evaluation Results
   Overall accuracy
                l    Twitter: 93.02%
                l    Other sources: 96.76%
            Kappa
                l    Twitter: 0.844
                l    Other sources: 0.916



 Normalizing tweets does not improve performance
   Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
                1.    Delete references to users at the beginning of the tweet
                2.    Delete “RT @user:” sequences
                3.    Delete hash tags found at the end of the tweet
                4.    Delete “#” at the beginning of hash tags
                5.    Delete URLs
                6.    Delete “…” followed by a URL
       Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
       Language Processing (ICON-2010).
Comparing user generated content published in different social media sources ⎢14
Comparing user generated content published in different social media sources


Performance of sentiment
analysis
Performance of Sentiment Analysis

     Content analysed
       1,859 tweets and 1,847 posts extracted from other social media sources (not
         Twitter) written in Spanish
     Technique used
       Matching of linguistic expressions based on a Lexicon
           l  Each expression is a sequence of pairs (lemma, PoS)
                            E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}
            Kind of expressions
               l For detecting subjectivity (20 expressions)
                            Use to include specific verbs
                l   For detecting sentiment of opinions (1,480 expressions)
                            Negative expressions add a value in {-2,-1} to overall sentiment
                            Positive expressions add a value in {1,2} to overall sentiment
                l   For reversing sentiment (22)
                            Include negations
                            Multiply detected sentiment by (-1)
                l   For augmenting or reducing sentiment (32)
                            Use to include adverbs
                            Multiply detected sentiment by 1.5 or 0.75
Comparing user generated content published in different social media sources ⎢16
Performance of Sentiment Analysis


 Evaluation Results
   Overall accuracy
                l    Twitter: 66.92%
                l    Other sources: 80.17%
            Kappa
                l    Twitter: 0.198
                l    Other sources: 0.31


 Normalizing tweets does not improve performance
   Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
                1.    Delete references to users at the beginning of the tweet
                2.    Delete “RT @user:” sequences
                3.    Delete hash tags found at the end of the tweet
                4.    Delete “#” at the beginning of hash tags
                5.    Delete URLs
                6.    Delete “…” followed by a URL
       Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
       Language Processing (ICON-2010).



Comparing user generated content published in different social media sources ⎢17
Comparing user generated content published in different social media sources


Performance of topic
identification
Performance of topic identification


     Description of the method [Muñoz-García et al., 2011]


  Input




   PoS           • “torino”, “art”, “media”, “user”, “cloud”
 Filtering


                • http://dbpedia.org/resource/Turin
                • http://dbpedia.org/resource/Art
  Topic
Recognition     • http://dbpedia.org/resource/User_(computing)



Language
                 • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
 Filtering



             Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social
             Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages
             81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH.


   Comparing user generated content published in different social media sources ⎢19
Performance of topic identification




 PoS filtering example

                                                  • But a hardware problem is more likely, especially if
                                                    you use the phone a lot while eating. The
                                                    Blackberry's tiny trackball could be suffering the
                                                    same accumulation of gunk and grime that can
                                                    plague a computer mouse that still uses a rubber
                             Input                  ball on the underside to roll around the desk.




                                  • Blackberry, phone, trackball, computer,
                                    problem, grime, hardware, mouse, desk,
                     PoS filtering rubber ball, gunk
                         example



Comparing user generated content published in different social media sources ⎢20
Performance of topic identification


     Topic Recognition (Sem4Tags [García-Silva et al, 2010])

                    • Blackberry, phone, trackball, computer, problem, grime, hardware,
     PoS              mouse, desk, rubber ball, gunk
  filtering


          • Blackberry, {phone, hardware, trackball, mouse}
          • Computer, {hardware, mouse, problem, desk}
 Context
Selection • …


                    • http://dbpedia.org/resource/BlackBerry
                    • http://dbpedia.org/resource/Computer
Disambiguation




           Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int.
           Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October


    Comparing user generated content published in different social media sources ⎢21
Performance of topic identification


 Context Selection
        For each keyword, a set of up to 4 related keywords that will help to
         disambiguate the its meaning
        4 is the number of words above which the context does not add more resolving
         power to disambiguation [Kaplan, 1955]
        We compute semantic relatedness (active context) taking into account the
         co-ocurrence of words in web pages [Gracia et al, 2009]
                                     Keyword                 Relatedness      Keyword       Relatedness
                                      phone                     0.347         hardware         0.347
                                      trackball                 0.311          mouse           0.311
                                     computer                   0.288            desk          0.287
                                      problem                   0.246         rubber ball      0.246
                                        grime                   0.190           gunk           0.168


                  Active context selection for blackberry keyword
      A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46

      Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In
      Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09,

Identifying Topics in Social Media Posts using DBpedia ⎢22
Performance of topic identification




  Disambiguation Criteria
              OPTION 1: Most frequent sense for the ambiguous word
                 l        Determined by Wikipedia editors (the first link in a disambiguation page)
              OPTION 2: Vector space model
                     1.   A vector containing the keyword and its context
                     2.   A vector containing top N terms is created from each candidate sense is created using
                          TF-IDF (Term Frequency and Inverse Document Frequency)
                     3.   The cosine similarity is used to determine which vectorised sense is more similar to
                          the vector associated to the keyword

  DBpedia resource                           Definition                       Similarity
                                Is a line of mobile e-mail and
BlackBerry                                                                          0.224
                                smartphone
Blackberry                      is an edible fruit                                  0.15
BlackBerry_(song)               is a song by the Black Crowes                        0.0
BlackBerry_Township,
_Itasca_County,                 Is a towship in … Itasca County                      0.0
_Minnesota



 Comparing user generated content published in different social media sources ⎢23
Performance of topic identification




 Evaluation settings
    Evaluated a random sample of 1,816 posts (18,16%)
    47 human evaluators
    Each post and topics identified shown to 3 different evaluators
    Evaluation options:
                 1.     The topic is not related with the post
                 2.     The topic is somehow related with the post
                 3.     The topic is closely related with the post
                 4.     The evaluator has not enough information for taking a decision
               Fleiss’ kappa test
                 l      Strength of agreement for 2 evaluators = 0.826 (very good)
                 l      Strength of agreement for 3 evaluators = 0.493 (moderate)




Comparing user generated content published in different social media sources ⎢24
Performance of topic identification




 Evaluation Results




             Precision depends on the channel
                 l    From 59.19% for social networks
                              More misspellings
                              More common nouns
                 l    To 88.89% for review sites
                              Concrete products and brands
                              Proper nouns tend to have a Wikipedia entry
             Context selection criteria also depends on the channel
                 l    Active context selection better for microblogs and review sites
                 l    Considering all the post keywords as context better for blogs
                 l    Without context selection is better for the rest of the cases (almost all the channels)
                              Naïve default sense selection is effective

Comparing user generated content published in different social media sources ⎢25
Comparing user generated content published in different social media sources


Conclusions
Conclusions




 We have found differences among social media sources for every
      experiment executed
             Distribution of PoS tagging vary across different sources
                 l    Since PoS tagging is a previous step for many NLP techniques, the
                      performance of such techniques may be affected
                              E.g. Using nouns as context for performing term disambiguation.
                                      More nouns → More context
                              E.g. Adjectives and adverbs for performing sentiment analysis
          Language identification is less accurate for content extracted from
           Twitter
          Sentiment analysis is less accurate for content extracted from Twitter
          Precision of topic identification also depends on the source
                 l    With respect to context selection there is not a technique that performs
                      better for all the sources



Comparing user generated content published in different social media sources ⎢27
Thank you!
 oscar.munoz@havasmedia.com

Mais conteúdo relacionado

Semelhante a Comparing user generated content published in different social media sources

(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...Carolyn Hank
 
Social CRM becoming a reality
Social CRM becoming a reality Social CRM becoming a reality
Social CRM becoming a reality BisnodeInteract
 
2010 Social Networking Report
2010 Social Networking Report2010 Social Networking Report
2010 Social Networking ReportTom Blefko
 
Lee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesLee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesnvbonline
 
Seattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachSeattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachMicrosoft
 
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkIncreasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkColleen Carrington
 
Social Media 2009
Social Media 2009Social Media 2009
Social Media 2009frozenfrogs
 
Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Jeni Putalavage-Ross
 
The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0Visible Technologies
 
Transforming Public Engagement
Transforming Public EngagementTransforming Public Engagement
Transforming Public EngagementCraig Thomler
 
Mobile devcon metrics of the mobile web
Mobile devcon   metrics of the mobile webMobile devcon   metrics of the mobile web
Mobile devcon metrics of the mobile webAvenga Germany GmbH
 
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Elizabeth Lupfer
 
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...Carolyn Hank
 
Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Cyndy Parr
 

Semelhante a Comparing user generated content published in different social media sources (20)

Social Media Strategy Roadmap
Social Media Strategy RoadmapSocial Media Strategy Roadmap
Social Media Strategy Roadmap
 
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
 
The Rise of E-Reading
The Rise of E-ReadingThe Rise of E-Reading
The Rise of E-Reading
 
Social CRM becoming a reality
Social CRM becoming a reality Social CRM becoming a reality
Social CRM becoming a reality
 
2010 Social Networking Report
2010 Social Networking Report2010 Social Networking Report
2010 Social Networking Report
 
Lee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesLee Rainie - The new impact of libraries
Lee Rainie - The new impact of libraries
 
Seattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachSeattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and Seach
 
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkIncreasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
 
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITSTHE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
 
Social Media 2009
Social Media 2009Social Media 2009
Social Media 2009
 
Libraries Transformed: Research on the changing role of libraries
Libraries Transformed:Research on the changing role of librariesLibraries Transformed:Research on the changing role of libraries
Libraries Transformed: Research on the changing role of libraries
 
Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909
 
The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0
 
The changing world of libraries
The changing world of librariesThe changing world of libraries
The changing world of libraries
 
Transforming Public Engagement
Transforming Public EngagementTransforming Public Engagement
Transforming Public Engagement
 
Mobile devcon metrics of the mobile web
Mobile devcon   metrics of the mobile webMobile devcon   metrics of the mobile web
Mobile devcon metrics of the mobile web
 
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
 
Normalizing twitter
Normalizing twitterNormalizing twitter
Normalizing twitter
 
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
 
Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...
 

Mais de Óscar Muñoz García

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaÓscar Muñoz García
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media AgenciesÓscar Muñoz García
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?Óscar Muñoz García
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Óscar Muñoz García
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesÓscar Muñoz García
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesÓscar Muñoz García
 

Mais de Óscar Muñoz García (8)

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social Media
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media Agencies
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
Big Data and Marketing Technology
Big Data and Marketing TechnologyBig Data and Marketing Technology
Big Data and Marketing Technology
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes Sociales
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relaciones
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Comparing user generated content published in different social media sources

  • 1. Comparing user generated content published in different social media sources Óscar Muñoz-García, Carlos Navarro @NLP can u tag #user_generated_content ?! via lrec-conf.org 26 May 2012
  • 2. Introduction  The growth of social media has populated the Web with valuable UGC that can be exploited for many interesting purposes  E.g. explaining or predicting real world outcomes through opinion mining  Advertising companies use social media content for market research  By mining users’ interests for focusing advertisement actions  By obtaining the opinion of customers about brands  NLP lets us automatizing social media content analysis  However, UGC presents differences on text quality w.r.t. content source (e.g., Blogs vs. Twitter)  Such differences challenge existing NLP techniques Comparing user generated content published in different social media sources ⎢2
  • 3. Introduction  We show the differences of the language used in UGC w.r.t. social media sources  By analysing the distribution of PoS categories on different sources  We evaluate the performance of three NLP techniques  Language Identification  Sentiment Analysis  Topic Identification  Social media sources analysed  Blogs (e.g., Wordpress and Blogger posts)  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)  Review Sites (e.g., Ciao and Dooyoo)  Audio-visual content publishing sites (e.g., Youtube and Vimeo)  News publishing sites (i.e., mainstream media)  Other sites Comparing user generated content published in different social media sources ⎢3
  • 4. Comparing user generated content published in different social media sources Distribution of PoS categories
  • 5. Distribution of PoS categories  Content analysed  Corpora with 10,000 posts extracted from heterogeneous SM sources l written in Spanish l related to telecommunications domain  The distribution has been obtained by using an automatic tagger  Tools used: l PoS tagging:  TreeTagger [Schmid, 1994] with a Spanish parameterisation l Annotation pipeline:  GATE [Cunningham et al., 2011]  Categories identified  Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …  Secondary: common noun, proper noun, negation adverb, personal pronoun, … Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of Computer Science, April. Comparing user generated content published in different social media sources ⎢5
  • 6. Distribution of PoS categories  Microblogs: determiners and prepositions are used to a lesser extent  Limitation of length (140 characters)  Posts need to be written more concisely → Meaningless grammatical categories tend to be used less Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11% Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢6
  • 7. Distribution of PoS categories  News and blogs present similar distributions  Because of similar writing styles  No limitations on the size of posts Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11% Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢7
  • 8. Distribution of PoS categories  Nouns  Common and proper nouns present similar distributions for all sources  PoS tagger fails when proper nouns are written in lower case l In special in Forums and Reviews where discussion about specific products are raised l Solution: use gazetteers  Improves entity detection  Domain dependent  Foreign words are less used in news that in other sources because of style rules of Spanish mainstream media l Avoid foreign words, as far as possible, whenever a Spanish word exists  Adjectives  Adjectives of quantity are the most used (47%) in all the channels l Cardinals (30%) more used than ordinals (2%)  Multiplicative, partitive and indefinite quantity adjectives are used more frequently in forums and review sites: l Due to quantitative evaluations and comparison of products Comparing user generated content published in different social media sources ⎢8
  • 9. Distribution of PoS categories  Adverbs  There is a correlation with the distribution of adverbs of negation and the size of the posts l More used in channels with shorter texts l Detection of negations is essential when performing sentiment analysis  Conjunctions  The distribution of coordinating conjunctions is higher in News and Blogs l More used in channels with longer texts l Coordinating conjunctions are used to identify opinion chunks as they were punctuation marks.  Pronouns  The distribution of personal pronouns is higher in Microblogs, Reviews, Forums and audio-visual content publishing sites l Due to conversations between users vs. narrative style of News and Blogs l Pronouns make it difficult to identify entities within opinions  Entities not explicitly mentioned Comparing user generated content published in different social media sources ⎢9
  • 10. Distribution of PoS categories  Punctuation marks  Full stop less used in news l Sentences are longer than in other sources  Comma less used on Microblogs and Audio-visual content sites  Ellipses are more used in Microblogs l To denote unfinished sentences l Automatically truncated messages  Secondary punctuation marks less used in Microblogs l Difficulty for introducing these characters on mobile terminals l Content length limitation  Verbs  More used in Microblogs and Forums l Intentions and actions are expressed more often  Past tenses less used in Microblogs l Immediate experiences  Infinitive more used in Microblogs Comparing user generated content published in different social media sources ⎢10
  • 11. Comparing user generated content published in different social media sources Performance of language identification
  • 12. Performance of Language Identification  Content analysed  3,368 tweets  2,768 posts extracted from other social media sources (not Twitter)  Written in Spanish, Portuguese and English  Technique used  Implementation of an existing text categorization algorithm l Analysis of the frequency of n-grams of characters within documents [Cavnar and Trenkle, 1994] Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161-175). Comparing user generated content published in different social media sources ⎢12
  • 13. Performance of Language Identification  Language identification method Comparing user generated content published in different social media sources ⎢13
  • 14. Performance of Language Identification  Evaluation Results  Overall accuracy l Twitter: 93.02% l Other sources: 96.76%  Kappa l Twitter: 0.844 l Other sources: 0.916  Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010). Comparing user generated content published in different social media sources ⎢14
  • 15. Comparing user generated content published in different social media sources Performance of sentiment analysis
  • 16. Performance of Sentiment Analysis  Content analysed  1,859 tweets and 1,847 posts extracted from other social media sources (not Twitter) written in Spanish  Technique used  Matching of linguistic expressions based on a Lexicon l Each expression is a sequence of pairs (lemma, PoS)  E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}  Kind of expressions l For detecting subjectivity (20 expressions)  Use to include specific verbs l For detecting sentiment of opinions (1,480 expressions)  Negative expressions add a value in {-2,-1} to overall sentiment  Positive expressions add a value in {1,2} to overall sentiment l For reversing sentiment (22)  Include negations  Multiply detected sentiment by (-1) l For augmenting or reducing sentiment (32)  Use to include adverbs  Multiply detected sentiment by 1.5 or 0.75 Comparing user generated content published in different social media sources ⎢16
  • 17. Performance of Sentiment Analysis  Evaluation Results  Overall accuracy l Twitter: 66.92% l Other sources: 80.17%  Kappa l Twitter: 0.198 l Other sources: 0.31  Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010). Comparing user generated content published in different social media sources ⎢17
  • 18. Comparing user generated content published in different social media sources Performance of topic identification
  • 19. Performance of topic identification  Description of the method [Muñoz-García et al., 2011] Input PoS • “torino”, “art”, “media”, “user”, “cloud” Filtering • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art Topic Recognition • http://dbpedia.org/resource/User_(computing) Language • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ... Filtering Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages 81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH. Comparing user generated content published in different social media sources ⎢19
  • 20. Performance of topic identification  PoS filtering example • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberry's tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber Input ball on the underside to roll around the desk. • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, PoS filtering rubber ball, gunk example Comparing user generated content published in different social media sources ⎢20
  • 21. Performance of topic identification  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) • Blackberry, phone, trackball, computer, problem, grime, hardware, PoS mouse, desk, rubber ball, gunk filtering • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} Context Selection • … • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/Computer Disambiguation Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int. Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October Comparing user generated content published in different social media sources ⎢21
  • 22. Performance of topic identification  Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46 Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09, Identifying Topics in Social Media Posts using DBpedia ⎢22
  • 23. Performance of topic identification  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity Is a line of mobile e-mail and BlackBerry 0.224 smartphone Blackberry is an edible fruit 0.15 BlackBerry_(song) is a song by the Black Crowes 0.0 BlackBerry_Township, _Itasca_County, Is a towship in … Itasca County 0.0 _Minnesota Comparing user generated content published in different social media sources ⎢23
  • 24. Performance of topic identification  Evaluation settings  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluators  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate) Comparing user generated content published in different social media sources ⎢24
  • 25. Performance of topic identification  Evaluation Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effective Comparing user generated content published in different social media sources ⎢25
  • 26. Comparing user generated content published in different social media sources Conclusions
  • 27. Conclusions  We have found differences among social media sources for every experiment executed  Distribution of PoS tagging vary across different sources l Since PoS tagging is a previous step for many NLP techniques, the performance of such techniques may be affected  E.g. Using nouns as context for performing term disambiguation.  More nouns → More context  E.g. Adjectives and adverbs for performing sentiment analysis  Language identification is less accurate for content extracted from Twitter  Sentiment analysis is less accurate for content extracted from Twitter  Precision of topic identification also depends on the source l With respect to context selection there is not a technique that performs better for all the sources Comparing user generated content published in different social media sources ⎢27