SlideShare uma empresa Scribd logo
1 de 80
Baixar para ler offline
http://www.almaden.ibm.com/cs/projects/iis/sound/




                 BBC SoundIndex
     Pulse of the Online Music Populace
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social
Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on
           "Data Management and Mining for Social Networks and Social Media", 2010
!  Netizens do not always

   The Vision                                       buy their music, let alone
                                                        buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                    !  Traditional sales figures
                                                     are a poor indicator of
      What is ‘really’ hot?                              music popularity.
!  Netizens do not always

   The Vision                                       buy their music, let alone
                                                        buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                    !  Traditional sales figures
                                                     are a poor indicator of
      What is ‘really’ hot?                              music popularity.

      BBC: Are online music communities
      good proxies for popular music listings?!
!  Netizens do not always

   The Vision                                       buy their music, let alone
                                                        buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                    !  Traditional sales figures
                                                     are a poor indicator of
      What is ‘really’ hot?                              music popularity.

      BBC: Are online music communities
      good proxies for popular music listings?!
      IBM: Well, lets build and find out!
!  Netizens do not always

   The Vision                                       buy their music, let alone
                                                        buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                    !  Traditional sales figures
                                                     are a poor indicator of
      What is ‘really’ hot?                              music popularity.

      BBC: Are online music communities
      good proxies for popular music listings?!
      IBM: Well, lets build and find out!
      BBC SoundIndex - “A pioneering project to tap into
      the online buzz surrounding artists and songs, by
      leveraging several popular online sources”
!  Netizens do not always

   The Vision                                       buy their music, let alone
                                                        buy in a CD store.
http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                    !  Traditional sales figures
                                                     are a poor indicator of
      What is ‘really’ hot?                              music popularity.

      BBC: Are online music communities
      good proxies for popular music listings?!
      IBM: Well, lets build and find out!
      BBC SoundIndex - “A pioneering project to tap into
      the online buzz surrounding artists and songs, by
      leveraging several popular online sources”
!  Netizens do not always

      The Vision                                       buy their music, let alone
                                                           buy in a CD store.
   http://www.almaden.ibm.com/cs/projects/iis/sound/
                                                       !  Traditional sales figures
                                                        are a poor indicator of
         What is ‘really’ hot?                              music popularity.

         BBC: Are online music communities
         good proxies for popular music listings?!
         IBM: Well, lets build and find out!
         BBC SoundIndex - “A pioneering project to tap into
         the online buzz surrounding artists and songs, by
         leveraging several popular online sources”

“one chart for everyone” is so old!
“Multimodal Social Intelligence in a Real-Time
                                   Dashboard System”, VLDB Journal 2010 Special Issue:
                                   Data Management and Mining for Social Networks and
                                   Social Media.




User metadata,   unstructured,
 Artist/Track structured attention
  Metadata         metadata




                   right data source, right crowd, timeliness of data..?
“Multimodal Social Intelligence in a Real-Time
                Dashboard System”, VLDB Journal 2010 Special Issue:
                Data Management and Mining for Social Networks and
                Social Media.



                  Album/Track identification
                  Sentiment Identification
                  Spam and off-topic comments

                      UIMA Analytics Environment




right data source, right crowd, timeliness of data..?
“Multimodal Social Intelligence in a Real-Time
                Dashboard System”, VLDB Journal 2010 Special Issue:
                Data Management and Mining for Social Networks and
                Social Media.




                     Exracted concepts into
                    explorable datastructures



right data source, right crowd, timeliness of data..?
“Multimodal Social Intelligence in a Real-Time
                Dashboard System”, VLDB Journal 2010 Special Issue:
                Data Management and Mining for Social Networks and
                Social Media.

                   What are 18 year olds in London
                            listening to?




right data source, right crowd, timeliness of data..?
“Multimodal Social Intelligence in a Real-Time
                  Dashboard System”, VLDB Journal 2010 Special Issue:
                  Data Management and Mining for Social Networks and
                  Social Media.

                     What are 18 year olds in London
                              listening to?




Validating crowd-sourced preferences
  right data source, right crowd, timeliness of data..?
Imagine doing this for a local business!




  Pulse of the ‘foodie’ populace!
  Where are 20 something year olds going? Why?
4




                            SoundIndex Architecture
Fig. 2 SoundIndex architecture. Data sources are ingested and if necessary transformed into structured data using the MusicBrainz
RDF and data miners. The resulting structured data is stored in a database and periodically extracted to update the front end.


Is the radio still the most popular medium for music? In          of “plays” such as YouTube videos and LastFM tracks,
2007 less than half of all teenagers purchased a CD6 and          and purely structured data such as the number of sales
with the advent of portable MP3 players, fewer people             from iTunes. We then clean this data, using spam re-
are listening to the radio.                                       moval, off-topic detection, and performed analytics to
    With the rise of new ways in which communities                spot songs, artists and sentiment expressions. The cleaned
                                                                  and annotated data is combined, adjudicating for dif-
UIMA Annotators
UIMA Annotators
 MySpace user comments (~ Twitter)
       Named Entity Recognition
       Sentiment Recognition
       Spam Elimination
             I heart your new song Umbrella..
madge..ur pics on celebration concert with jesus r awesome!


 Challenges, intuitions, findings, results..
Recognizing Named Entities
  Cultural entities, Informal text, Context-
  poor utterances, restricted to the music
  sense..




              “Ohh these sour times... rock!”
Recognizing Named Entities
  Cultural entities, Informal text, Context-
  poor utterances, restricted to the music
  sense..
  Problem Defn: Semantic Annotation of
  album/track names (using
  MusicBrainz) [at ISWC ‘09]


              “Ohh these sour times... rock!”
Got to Be There
      Ben
NER   Music and Me
      Forever Michael
      Off the Wall
      Thriller
      Bad
      Dangerous
Got to Be There
                                   Ben
                NER                Music and Me
                                   Forever Michael
                                   Off the Wall
                                   Thriller
Spot and Disambiguate Paradigm
                                   Bad
  Generate Candidate entities      Dangerous

    “Thriller was my most fav MJ album”
    “this is bad news, ill miss you MJ”
  Disambiguate spots/mentions in
  context
Got to Be There
                                         Ben
                 NER                     Music and Me
                                         Forever Michael
                                         Off the Wall
                                         Thriller
Spot and Disambiguate Paradigm
                                         Bad
  Generate Candidate entities            Dangerous

    “Thriller was my most fav MJ album”
    “this is bad news, ill miss you MJ”
  Disambiguate spots/mentions in
  context
     Disambiguation Intensive Approach
Challenge 1: Multiple Senses, Same
             Domain

                        60 songs with Merry
                        Christmas
                        3600 songs with Yesterday
                        195 releases of American Pie
                        31 artists covering
                        American Pie
  “Happy 25th! Loved
  your song Smile ..”
Challenge 1: Multiple Senses, Same
             Domain

                        60 songs with Merry
                        Christmas
                        3600 songs with Yesterday
                        195 releases of American Pie
                        31 artists covering
                        American Pie
  “Happy 25th! Loved
  your song Smile ..”
Intuition: Scoped Graphs
                                  This new Merry Christmas
                                      tune.. SO GOOD!
Scoped Relationship graphs
   using cues from the content,
   webpage title, url..




                                   Which ‘Merry Christmas’?
                          9
                                   ‘So Good’ is also a song!
Intuition: Scoped Graphs
                                   This new Merry Christmas
                                       tune.. SO GOOD!
Scoped Relationship graphs
    using cues from the content,
    webpage title, url..
Reduce potential entity spot size
    Generate candidate entities
    Disambiguate in context
                                    Which ‘Merry Christmas’?
                           9
                                    ‘So Good’ is also a song!
What Content Cues to Exploit?
   “I heart your new album Habits”

“Happy 25th lilly, love ur song smile”

”Congrats on your rising star award”..Experimenting   with
                                          Restrictions
                                         Career Length
Releases that are not new;               Album Release
artists who are at least 25;
                                         No. of Albums
        new careers..
                                                ...
                                         Specific Artist
Gold Truth Dataset
1800+ spots in MySpace user comments from
artist pages
Keep your SMILE on!
  good spot, bad spot, inconclusive spot?

4-way annotator agreements across spots
  Madonna 90% agreement
  Rihanna 84% agreement
  Lily Allen 53% agreement
Sample Restrictions, Spot Precision

                         3 artists, 1800+ spots

                          Experimenting with
                             Restrictions
                            Career Length
                            Album Release
                            No. of Albums
                                   ...
                            Specific Artist
Sample Restrictions, Spot Precision

From all of MusicBrainz (281890 artists, 6220519 tracks)                                              3 artists, 1800+ spots
                  to tracks of one artist
!"""#$     !""#$      !"#$        !#$      #$       #"$         #""$                                   Experimenting with
                %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
                                                                   #""$                                   Restrictions




                                                                           !"#$%&%'()'*)+,#)-.'++#"
                *****789!9$*,/):0+0-%;                             #"$                                   Career Length
                                     )A&:.23*8*&2>?@+
                                         &.*2)&+.*8*&2>?@+
                                                                                                         Album Release
                                                                   #$
                                                                                                         No. of Albums
                                                                   !#$
   &/.0+.+*<5-+)*=0/+.*&2>?@*<&+*                                                                               ...
              0%*.5)*,&+.*8*3)&/+
                                                                   !"#$                                  Specific Artist
         *&22*&/.0+.+*<5-*/)2)&+)1*&%*
           &2>?@*0%*.5)*,&+.*8*3)&/+                               !""#$
         *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
                                                                   !"""#$
Sample Restrictions, Spot Precision
Closely follows distribution of random restrictions, conforming loosely to
                            a Zipf distribution
From all of MusicBrainz (281890 artists, 6220519 tracks)                                  3 artists, 1800+ spots
                  to tracks of one artist
        !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
  !"!!!#$ !"!!#$   !"!#$      !"#$    #$   #!$   #!!$                                      Experimenting with
                                                    #!!$
                                                                                              Restrictions
                                                      #!$                                    Career Length

                                                               !#"$404(%'()'&*"'53(&&"#
                                                      #$
                                                                                             Album Release
                                                                                             No. of Albums
                                                      !"#$
                                                                                                    ...
             %&'()*+(*,-'.,-.(/&
             012,-,34.51'&6.,-                        !"!#$                                  Specific Artist
             7)(*'(.8/1)1+(&)*'(*+'
             19&)1:&.&2;&+(&6                         !"!!#$
             *3;),9&3&-(..<=*;>
             6*'()*5?(*,-@
                                                      !"!!!#$
Sample Restrictions, Spot Precision
Closely follows distribution of random restrictions, conforming loosely to
                            a Zipf distribution
From all of MusicBrainz (281890 artists, 6220519 tracks)                                  3 artists, 1800+ spots
                  to tracks of one artist
        !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
  !"!!!#$ !"!!#$   !"!#$      !"#$    #$   #!$   #!!$                                      Experimenting with
                                                    #!!$
                                                                                              Restrictions
                                                      #!$                                    Career Length

                                                               !#"$404(%'()'&*"'53(&&"#
                                                      #$
                                                                                             Album Release
                                                                                             No. of Albums
                                                      !"#$
                                                                                                    ...
             %&'()*+(*,-'.,-.(/&
             012,-,34.51'&6.,-                        !"!#$                                  Specific Artist
             7)(*'(.8/1)1+(&)*'(*+'
             19&)1:&.&2;&+(&6                         !"!!#$
             *3;),9&3&-(..<=*;>
             6*'()*5?(*,-@
                                                      !"!!!#$

  Choosing which constraints to implement is simple - pick easiest first
Madonna’s

Scoped Entity Lists                         tracks




  User comments are on MySpace artist pages
    Restriction: Artist name
    Assumption: no other artist/work mention
Madonna’s

Scoped Entity Lists                              tracks




  User comments are on MySpace artist pages
    Restriction: Artist name
    Assumption: no other artist/work mention
  Naive spotter has advantage of spotting all
  possible mentions (modulo spelling errors)
    Generates several false positives

                        “this is bad news, ill miss you MJ”
Challenge 2: Disambiguating Spots

    Got your new album Smile. Loved it!
    Keep your SMILE on!


    Separating Valid and invalid mentions
    of music named entities.
Challenge 2: Disambiguating Spots

    Got your new album Smile. Loved it!
    Keep your SMILE on!


    Separating Valid and invalid mentions
    of music named entities.
Intuition: Using Natural Language Cues
                   Got your new album Smile. Loved it!
Syntactic features                                           Notation-S
+
  POS tag of s                                               s.POS
POS tag of one token before s                                s.POSb
POS tag of one token after s                                 s.POSa
Typed dependency between s and sentiment word                s.POS-TDsent ∗
Typed dependency between s and domain-specific term           s.POS-TDdom ∗
Boolean Typed dependency between s and sentiment             s.B-TDsent ∗
Boolean Typed dependency between s and domain-specific term   s.B-TDdom ∗
Word-level features                                          Notation-W
+
  Capitalization of spot s                                   s.allCaps
+
  Capitalization of first letter of s                         s.firstCaps
+
  s in Quotes                                                s.inQuotes
Domain-specific features                                      Notation-D
Sentiment expression in the same sentence as s               s.Ssent
Sentiment expression elsewhere in the comment                s.Csent
Domain-related term in the same sentence as s                s.Sdom
Domain-related term elsewhere in the comment                 s.Cdom
+
  Refers to basic features, others are advanced features
∗
  These features apply only to one-word-long spots.
                 Generic syntactic, by the SVM learner
                  Table 6. Features used spot-level, domain features
                                           15
Intuition: Using Natural Language Cues
                 Got your new album Smile. Loved it!
Syntactic features                                         1. Notation-S Expressions:
                                                              Sentiment
+
  POS tag of s                                            Slang sentiment gazetteer
                                                              s.POS
POS tag of one token before s                                 s.POSb
POS tag of one token after s                                usinga Urban Dictionary
                                                              s.POS
Typed dependency between s and sentiment word                 s.POS-TDsent ∗
Typed dependency between s and domain-specific term         2. s.POS-TDdom ∗
                                                               Domain specific terms
Boolean Typed dependency between s and sentiment            music,sent ∗
                                                              s.B-TD album, concert..
Boolean Typed dependency between s and domain-specific term    s.B-TDdom ∗
Word-level features                                           Notation-W
+
  Capitalization of spot s                                    s.allCaps
+
  Capitalization of first letter of s                          s.firstCaps
+
  s in Quotes                                                 s.inQuotes
Domain-specific features                                       Notation-D
Sentiment expression in the same sentence as s                s.Ssent
Sentiment expression elsewhere in the comment                 s.Csent
Domain-related term in the same sentence as s                 s.Sdom
Domain-related term elsewhere in the comment                  s.Cdom
+
  Refers to basic features, others are advanced features
∗
  These features apply only to one-word-long spots.
                Generic syntactic, by the SVM learner
                 Table 6. Features used spot-level, domain features
                                         15
Boolean Typed dependency between s and domain-specific term     s.B-TDdom ∗
                    Word-level features                                            Notation-W
                    +
                      Capitalization of spot s                                     s.allCaps
                    +
                      Capitalization of first letter of s                           s.firstCaps

 Intuition: Using Natural Language Cues
                    +
                      s in Quotes                                                  s.inQuotes
                    Domain-specific features                                        Notation-D
                    Got yourexpression in the same sentenceit! s
                    Sentiment new album Smile. Loved as                            s.Ssent
                                                                   1. Sentiment Expressions:
Syntactic features Sentiment expression elsewhere in the comment Notation-S        s.Csent
                    Domain-related term in the same sentence as s s.POS            s.Sdom
+
  POS tag of s
                    Domain-related term elsewhere in the comment   Slang sentimentdom
                                                                                   s.C gazetteer
POS tag of one token+before s                                          s.POSb
                      Refers to basic features, others are advanced features Urban Dictionary
                                                                     usinga
POS tag of one token∗ after s                                          s.POS
                      These features apply only to one-word-long spots.
Typed dependency between s and sentiment word                     s.POS-TDsent   ∗

                                                               2. s.POS-TDdom
                                                                   Domain specific terms
Typed dependency between s and domain-specificFeatures used by the SVM learner∗
                                           Table 6. term
Boolean Typed dependency between s and sentiment                music,sent ∗
                                                                  s.B-TD album, concert..
Boolean Typed dependency between s and domain-specific term        s.B-TDdom ∗
Word-level features   Typed Dependencies:                   Valid Notation-W new album Smile.
                                                                   spot: Got your
+
  Capitalization of spot sWe also captured the typed de-    Simplys.allCaps
                                                                    loved it!
+
  Capitalization of first letter of paths (grammatical rela-
                      pendency s                            Encoding: nsubj(loved-8, Smile-5) imply-
                                                                  s.firstCaps
+
  s in Quotes         tions)        via     the      s.POS- ing that Smile is the nominal subject of
                                                                  s.inQuotes
                                                            the expression loved.
Domain-specific features and s.POS-TDdom boolean
                      TDsent                                      Notation-D
Sentiment expressionfeatures. These were as s
                       in the same sentence obtained be-          s.Ssent
                                                            Invalid spot: Keep your smile on. You’ll
Sentiment expressiontween a spot and comment
                       elsewhere in the co-occurring senti-       s.C
                                                            do great! sent
Domain-related termment and domain-specific words by
                       in the same sentence as s                  s.Sdom
                                                            Encoding: No typed dependency between
Domain-related termthe Stanford the comment
                       elsewhere in parser[12] (see exam-   smile s.Cdom
                                                                   and great
+                     ple in 7). We also encode a boolean
  Refers to basic features, others are advanced features
∗
  These features apply only indicating whether a relation
                      value to one-word-long spots.          Table 7. Typed Dependencies Example
                      was found at all using the s.B-TDsent
                    Generic syntactic, byThis SVM learneraccommodate parse errors given the
                      Table 6. Features used spot-level, us to
                      and s.B-TDdom features. the allows domain features
                                                   15
                      informal and often non-grammatical English in this corpus.
Valid Music
Binary Classification                    mention or not?
                                        Keep ur SMILE on!
                    Got your new album Smile. Loved it!

 SVM Binary Classifiers
 Training Set:
   Positive examples (+1): 550 valid spots
   Negative examples (-1): 550 invalid spots

 Test Set:
   120 valid spots, 458 invalid spots
Efficacy of Features
Precision intensive




                           Recall intensive
Efficacy of Features
Precision intensive




                      42-91




                                78-50




                                              90-35        Identified 90% of valid spots
                                                           Eliminated 35% of invalid spots


                                        Recall intensive
Efficacy of Features
Precision intensive




                      42-91
                                                           - Feature combinations were
                                                           most stable, best performing
                                                           - Gazetteer matched domain words
                                78-50
                                                            and sentiment expressions proved
                                                                       to be useful

                                              90-35         Identified 90% of valid spots
                                                            Eliminated 35% of invalid spots


                                        Recall intensive
Efficacy of Features
Precision intensive




                      42-91
                                                           - Feature combinations were
                                                           most stable, best performing
                                                           - Gazetteer matched domain words
                                78-50
                                                            and sentiment expressions proved
                                                                       to be useful

                                              90-35         Identified 90% of valid spots
                                                            Eliminated 35% of invalid spots


                                        Recall intensive


    PR tradeoffs: choosing feature combinations depending on end application
How did we do overall ?
                           '!!"


                                &!"
           5('*%$%63)7)8'*#""




                                %!"


                                $!"


                                #!"                                                                 71,89-9/(:;/1:<9=>:?==,(
 Madonna’s                                                                                          71,89-9/(:;/1:@9A)(()
                                                                                                    71,89-9/(:;/1:B)C/(()
                                                                                                    @,8)==:D)==:0A1,,E
 track spots- !"
                                      -./00,1

                                                2!345

                                                        6&35!

                                                                    6$3$!

                                                                            6#345

                                                                                    6'36!

                                                                                            6!35%

                                                                                                    %#3&$

                                                                                                            %'36!

                                                                                                                    %!3&5

                                                                                                                            $53&%

                                                                                                                                    $#32'
23% precision
                                       ()*+,




                                                                !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14




  Step 1: Spot with naive spotter,
    restricted knowledge base
How did we do overall ?
                           '!!"


                                &!"
           5('*%$%63)7)8'*#""




                                %!"


                                $!"


                                #!"                                                                 71,89-9/(:;/1:<9=>:?==,(
 Madonna’s                                                                                          71,89-9/(:;/1:@9A)(()
                                                                                                    71,89-9/(:;/1:B)C/(()
                                                                                                    @,8)==:D)==:0A1,,E
 track spots- !"
                                      -./00,1

                                                2!345

                                                        6&35!

                                                                    6$3$!

                                                                            6#345

                                                                                    6'36!

                                                                                            6!35%

                                                                                                    %#3&$

                                                                                                            %'36!

                                                                                                                    %!3&5

                                                                                                                            $53&%

                                                                                                                                    $#32'
23% precision                                                                                                                                 42-91 : “All
                                       ()*+,




                                                                !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
                                                                                                                                            features” setting


  Step 1: Spot with naive spotter,                                                           Step 2: Disambiguate using NL
    restricted knowledge base                                                                   features (SVM classifier)
How did we do overall ?
                           '!!"


                                &!"
           5('*%$%63)7)8'*#""




                                %!"                                                                                                          Madonna’s
                                                                                                                                             track spots-
                                $!"
                                                                                                                                                ~60%
                                #!"                                                                 71,89-9/(:;/1:<9=>:?==,(                   precision
 Madonna’s                                                                                          71,89-9/(:;/1:@9A)(()
                                                                                                    71,89-9/(:;/1:B)C/(()
                                                                                                    @,8)==:D)==:0A1,,E
 track spots- !"
                                      -./00,1

                                                2!345

                                                        6&35!

                                                                    6$3$!

                                                                            6#345

                                                                                    6'36!

                                                                                            6!35%

                                                                                                    %#3&$

                                                                                                            %'36!

                                                                                                                    %!3&5

                                                                                                                            $53&%

                                                                                                                                    $#32'
23% precision                                                                                                                                 42-91 : “All
                                       ()*+,




                                                                !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
                                                                                                                                            features” setting


  Step 1: Spot with naive spotter,                                                           Step 2: Disambiguate using NL
    restricted knowledge base                                                                   features (SVM classifier)
Lessons Learned..
                   Using domain knowledge especially for
                   cultural entities is non-trivial
                   %!!
                    Computationally                    intensive NLP is prohibitive
23*$,&3(#&)#$",$




                   $!!           0129;4+32-.84-
                    Two stage approach: NL learners over
                                 '!+,-./+012/+324/+56,176+,-.+8.7967:



                   #!!
                    dictionary-based naive spotters
                   "!! allows more time-intensive NL analytics to run on less than
                       the full set of input data
                     !
                       control over precision"(!! recall of final result
                     "&!!         "'!!        and        ")!!        "*!!    #!!!
                              !"#$%&'()#&*+&)#$",$&*#&+*-./".0&'()#&*+&1)./
Lessons Learned..
Using domain knowledge especially for
cultural entities is non-trivial
Computationally intensive NLP is prohibitive
Two stage approach: NL learners over
dictionary-based naive spotters
  allows more time-intensive NL analytics to run on less than
  the full set of input data
  control over precision and recall of final result
Sentiment Expressions
    NER allows us to track and
    trend online mentions
    Popularity assessment: sentiment
    associated with the target entity
    Exploratory Project: what kinds of
    sentiments, distribution of positive and
    negative expressions..
Observations, Challenges
Negations, sarcasms, refutations are rare
Short sentences: OK to overlook target attachment
Target demographic: Teenagers!
  Slang expressions: Wicked, tight, “the shit”,
  tropical, bad..
What are common positive and negative
expressions this demographic uses?
      Urbandictionary.com to the rescue!
Urbandictionary.com
                   Slang
                 Dictionary!

                 Related tags
                  bad appears with
                 terrible and good!




                   Glossary
Mining a Slang Sentiment Dictionary
  Map frequently used sentiment expressions to their orientations
Mining a Slang Sentiment Dictionary
  Map frequently used sentiment expressions to their orientations


   Seed positive and negative oriented dictionary
   entries
       good, terrible..
Mining a Slang Sentiment Dictionary
  Map frequently used sentiment expressions to their orientations


   Seed positive and negative oriented dictionary
   entries
       good, terrible..
   Step 1:For each seed, query UD, get related tags
       Good ->awesome, sweet, fun, bad, rad..
       Terrible ->bad, horrible, awful, shit..
Mining a Slang Sentiment Dictionary

  Good ->awesome, sweet, fun, bad, rad
  Step 2: Calculate semantic orientation of each
  related tag [Turney]

    SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
Mining a Slang Sentiment Dictionary

  Good ->awesome, sweet, fun, bad, rad
  Step 2: Calculate semantic orientation of each
  related tag [Turney]

    SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)

  PMI over the Web vs. UD
    SOweb(rock) = -0.112     SOud(rock) = 0.513
Mining a Slang Sentiment Dictionary

  Step 3: Record orientation; add to dictionary
    {good, rad}
    {terrible}

  Continue for other related tags and all new
  entries ..
  Mined dictionary: 300+ entries (manually
  verified for noise)
Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse              Your/PRP$ new/JJ album/NN is/VBZ
                                          wicked/JJ
  verbs, adjectives   [Hatzivassiloglou 97]
Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse              Your/PRP$ new/JJ album/NN is/VBZ
                                          wicked/JJ
  verbs, adjectives   [Hatzivassiloglou 97]

Look up word in mined dictionary, record
orientation
Sentiment Expressions in Text
“Your new album is wicked”
Shallow NL parse               Your/PRP$ new/JJ album/NN is/VBZ
                                           wicked/JJ
   verbs, adjectives   [Hatzivassiloglou 97]

Look up word in mined dictionary, record
orientation
Why dont we just spot using the dictionary!?
   fuuunn! (coverage issues)
   “super man is in town”; “i heard amy at tropical cafe yday..”
Dictionary Coverage Issues

  Resort to corpus (transliterations)
    Co-occurence in corpus (tight:top scoring
    dictionary entries)
    “tight-awesome”:456, “tight-sweet”:136,
    “tight-hot” 429
Miscellaneous
Presence of entity improves confidence of
identified sentiment
  Short sentences, high coherence
     Associate sentiment with all entities

If no entity spotted, associate with artist (whose
page we are on)
  “you are awesome!” on Madonna’s page
based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
        Evaluations - Mined
              Identifying Orientation
            Annotator Type            Precision   Recall
          Positive Sentiment            0.81       0.9
          Negative Sentiment             0.5       1.0
      PS excluding transliterations     0.84       0.67
Table 3 Transliteration accuracy impact



    These results indicate that the syntax and seman-
tics of sentiment expression in informal text is difficult
to determine. Words that were incorrectly identified as
based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
        Evaluations - Mined
              Identifying Orientation
           Annotator Type                   Precision      Recall
         Positive Sentiment                   0.81          0.9
         Negative Sentiment                    0.5          1.0
     PS excluding transliterations            0.84          0.67

      Negative sentiments: slang orientations
Table 3 Transliteration accuracy impact                        out of
      context!
         you were terrible last night at SNL but I still <3 you!
    These results indicate that the syntax and seman-
tics of sentiment expression in informal text is difficult
to determine. Words that were incorrectly identified as
based on experiments. Table 3 shows the accuracy of
the annotator and illustrates the importance of using
transliterations in such corpora. Dictionary +
        Evaluations - Mined
              Identifying Orientation
           Annotator Type                   Precision      Recall
         Positive Sentiment                   0.81          0.9
         Negative Sentiment                    0.5          1.0
     PS excluding transliterations            0.84          0.67

      Negative sentiments: slang orientations
Table 3 Transliteration accuracy impact                        out of
      context!
         you were terrible last night at SNL but I still <3 you!
    These results indicate that the syntax and seman-
        Precision excluding corpus transliterations:
tics of sentiment expression in informal text is difficult
        Incorrect NL parses, selectivity
to determine. Words that were incorrectly identified as
Crucial Finding


                < 4%



                -ve    +ve

       Sentiment Expression Frequencies
         in 60k comments, 26 weeks
Forget about the sentiment
Crucial Finding       annotator! Just use entity
                          mention volumes!



                < 4%



                -ve     +ve

       Sentiment Expression Frequencies
         in 60k comments, 26 weeks
Spam, Off-topic, Promotions
Special type of spam: unrelated to artists’ work
  Paul McCartney’s divorce; Rihanna’s Abuse; Madge
  and Jesus

Self-Promotions
  “check out my new cool sounding tracks..”
  music domain, similar keywords, harder to tell apart

Standard Spam
  “Buy cheap cellphones here..”
16



Observations
                                     SPAM: 80% have 0 sentiments

                               CHECK US OUT!!! ADD US!!!
                               PLZ ADD ME!

   60k comments, 26            IF YOU LIKE THESE GUYS ADD US!!!

   weeks
                              NON-SPAM: 50% have at least 3 sentiments
   Common 4 grams
                              Your music is really bangin!
   pulled out several         You’re a genius! Keep droppin bombs!
   ‘spammy’ phrases           u doin it up 4 real. i really love the album.
                              keep doin wat u do best. u r so bad!
   Phrases -> Spam, Non-      hey just hittin you up showin love to one of
   spam buckets               chi-town’s own. MADD LOVE.



The spam annotator should Examples of of other annotatornon-spam com
                    Fig. 8 be aware sentiment in spam and results!
Spam Elimination
Aggregate function
  Phrases indicative of spam (4-way
  annotator agreements)
  Rules over previous annotator results
     if a spam phrase, artist/track name and a
    positive sentiment were spotted, the
    comment is probably not spam.
Performance
          Annotator Type     Precision   Recall
              Spam             0.76       0.8
            Non-Spam           0.83       0.88

      Directly proportional to previous annotator
      Table 4: Spam annotator performance
      results
        incorrect entity, sentiment spots
the comment did not have a spam pattern, and the fir
annotator spotted incorrect tracks, the spam annotator in
       Some self-promotions are just clever!
terpreted the comment to be related to music and classifie
          ”like umbrella, ull love this song. . . “
 t as non-spam. Other cases included more clever prom
Do not Confuse Activity with Popularity!

           non-spam          spam         Artist popularity ranking
                                           changed dramatically
                                            after excluding spam!
ns. Tun-     noise could significantly impact the data analysi
 rmined      ordering of artists if it is not accounted for.
 racy of                % Spam, 60k comments over 26 weeks
of using              Gorillaz      54%    Placebo          39%
                      Coldplay      42%    Amy Winehouse    38%
                      Lily Allen    40%    Lady Sovereign   37%
                      Keane         40%    Joss Stone       36%
all
Workflow

Unstructured chatter to
 explorable hypercube
Hypercube, List Projection

Exploring dimensions of popularity
Data Hypercube (DB2) from
  structured data in posts (user demographics, location,
  age, gender, timestamp)
  unstructured data (entity, sentiment, spam flag)

Intersection of dimensions => non-spam
comment counts
artist), and from the measurements generated by the above
        annotator methods. ThisRecall
              Annotator Type Precision     annotates eachwhich is then used to sorta a
                                                                 comment with the
                   Spam          0.76     0.8                  gregate and analyze the hyperc
        series of tags from unstructured and structured data. The on
                Non-Spam         0.83     0.88                 dimensional data operations
        resulting tuple is then placed into a startially custom in which for
                                                                 schema popular lists
     Hypercube to One-Dimensional List
           Table 4: Spam annotator performance
                                         sort the artists, tracks, etc. to to musical bill
                                                               addition
                   which is then used to relevance with regardsWe can ag-
        the primary measure is a
                                                                           the traditional
                                                               can slice and project (margina
        topics. didgregate equivalent the defining a function. asof multi- hot in
                 This is andspam pattern, hypercube usinglistsvariety “What is
                                analyze
                                         of and the first        a such
    the comment       dimensional data operations on it to derive males?” and “Who are th
                     not have a                                       old what are essen-
    annotator spotted incorrect tracks, the spam annotator in-
                      tially custom popular lists for particularSan Francisco?” They transla
                                                                       musical topics. In
    terpreted the commentGender, Location, T ime, Artist, ...) → M
            M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1)
 nce                                    the to music and classified
                      addition toincluded more clever promo-
    it as non-spam. Other cases
                                                                      operations:

    tional comments can included the actual artist tracks, gen-
                      that slice have stored the aggregation ofthe cube for
                                   and project (marginalize dimensions) X
        In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M
                   case,very limited spam content. in ‘like
                            we                                                 occurrences
                      lists such
    uine sentiments and                                               L City        19 = 19,
nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from
    theof non-spammales?” and at theare the most popular T,...
          first        old comment “Who amount of                              artists values
nnotator in- available at hand in addition to translatethis way makes it easy to
        of the WhatFrancisco?” They grammatically 19 year X
                      San is hot in New York for
    information hypercube. Storing the data to following mathematical
    poor sentences necessitates more sophisticated techniques
nd classified
                                                                                   old
                      operations:
        examine rankings over various time intervals,(X) :              L2 weight variousL =
                                                                                          M (A, G,
ever promo- males?
    for spam identification.
                                                                                T,A,G,...
       Given the amount of effort involved etc. Once (and if) a total ordering
                                                the obvious question
 tracks,dimensions differently, goal of using comment
         gen-                    X
    arises – why filter spam? For our end                                where,
                         (X) : on the list, = 19, Gspam is key. “N ewY orkCity  , might
                      Lis fixed this intermediateLdata staging stepT, X, ...)
. (e.g. approach positions
         ‘like
    counts to lead to 1                 M (A filtering = M, =
 e amount corroborated by the volume of spam and non-spam
    This is of
        be eliminated.           T,...
    content observed over a period of 26 weeks for 8 artists; see
ammatically                                                                                (2)
                                                                      X = Name of the artist
                                        X
    Table 5. The chart indicates that some artists had at least       T = Timestamp
        4.4 Projecting to a list
  techniques
                        L2 (X) :               M (A, G, L = “SanF rancisco ,of the commenter
    half as many spam as non-spam comments on their page.             A = Age X, ...) (3)
    This level of noise would significantly impact the ordering
                                 T,A,G,...                           G = Gender
 us questionif it were not accounted seeking to generate
            Ultimately we are for.
    of artists                                                   a   “one dimensional”
                                                                     L = Location
ng comment             where,
artist), and from the measurements generated by the above
        annotator methods. ThisRecall
              Annotator Type Precision     annotates eachwhich is then used to sorta a
                                                                 comment with the
                   Spam          0.76     0.8                  gregate and analyze the hyperc
        series of tags from unstructured and structured data. The on
                Non-Spam         0.83     0.88                 dimensional data operations
        resulting tuple is then placed into a startially custom in which for
                                                                 schema popular lists
     Hypercube to One-Dimensional List
           Table 4: Spam annotator performance
                                         sort the artists, tracks, etc. to to musical bill
                                                               addition
                   which is then used to relevance with regardsWe can ag-
        the primary measure is a
                                                                           the traditional
                                                               can slice and project (margina
        topics. didgregate equivalent the defining a function. asof multi- hot in
                 This is andspam pattern, hypercube usinglistsvariety “What is
                                analyze
                                         of and the first        a such
    the comment       dimensional data operations on it to derive males?” and “Who are th
                     not have a                                       old what are essen-
    annotator spotted incorrect tracks, the spam annotator in-
                      tially custom popular lists for particularSan Francisco?” They transla
                                                                       musical topics. In
    terpreted the commentGender, Location, T ime, Artist, ...) → M
            M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1)
 nce                                    the to music and classified
                      addition toincluded more clever promo-
    it as non-spam. Other cases
                                                                      operations:

    tional comments can included the actual artist tracks, gen-
                      that slice have stored the aggregation ofthe cube for
                                   and project (marginalize dimensions) X
        In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M
                   case,very limited spam content. in ‘like
                            we                                                 occurrences
                      lists such
    uine sentiments and                                               L City        19 = 19,
nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from
    theof non-spammales?” and at theare the most popular T,...
          first        old comment “Who amount of                              artists values
nnotator in- available at hand in addition to translatethis way makes it easy to
        of the WhatFrancisco?” They grammatically 19 year X
                      San is hot in New York for
    information hypercube. Storing the data to following mathematical
    poor sentences necessitates more sophisticated techniques
nd classified
                                                                                   old
                      operations:
        examine rankings over various time intervals,(X) :              L2 weight variousL =
                                                                                          M (A, G,
ever promo- males?
    for spam identification.
                                                                                T,A,G,...
       Given the amount of effort involved etc. Once (and if) a total ordering
                                                the obvious question
 tracks,dimensions differently, goal of using comment
         gen-                    X
    arises – why filter spam? For our end                                where,
                         (X) : on the list, = 19, Gspam is key. “N ewY orkCity  , might
                      Lis fixed this intermediateLdata staging stepT, X, ...)
. (e.g. approach positions
         ‘like
    counts to lead to 1                 M (A filtering = M, =
 e amount corroborated by the volume of spam and non-spam
    This is of
        be eliminated.           T,...
    content observed over a period of 26 weeks for 8 artists; see
ammatically                                                                                (2)
                                                                      X = Name of the artist
                                        X
    Table 5. The chart indicates that some artists had at least       T = Timestamp
        4.4 Projecting to a list
  techniques
                        L2 (X) :               M (A, G, L = “SanF rancisco ,of the commenter
    half as many spam as non-spam comments on their page.             A = Age X, ...) (3)
    This level of noise would significantly impact the ordering
                                 T,A,G,...                           G = Gender
 us questionif it were not accounted seeking to generate
            Ultimately we are for.
    of artists                                                   a   “one dimensional”
                                                                     L = Location
ng comment             where,
The Word on the Street

 comments Billboards Top 50 Singles
            were spam                                     Billboard.com    MySpace Analysis
 comments
 comments chart during the week of
            had positive sentiments
            had negative sentiments                       Soulja Boy       T.I.
 comments
          Sept 22-28 2007
            had no identifiable sentiments
on Statistics
                                                          Kanye West
                                                          Timbaland
                                                                           Soulja Boy
                                                                           Fall Out Boy
                                                          Fergie           Rihanna
        (Unique) Top 45                                   J. Holiday       Keyshia Cole
                                                          50 Cent          Avril Lavigne
        MySpace artist pages
 in Section 8, the structured metadata                    Keyshia Cole     Timbaland
mestamp, etc.) and annotation results                     Nickelback       Pink
m, sentiment, etc.) were loaded in Hypercube,
        Crawl, Annotate, Build the                        Pink             50 Cent
                                                          Colbie Caillat   Alicia Keys
        Project lists
 resented by each cell of the cube is the   Table 8 Billboard’s Top Artists vs. our generated list
 ents for a given artist. The dimension-                      Showing Top 10
 e is dependent on what variables we
                                            1 was comprised of respondents between ages 8
The Word on the Street
    * Top artists appear in both lists
    * Several Overlaps 50 Singles
       Billboards Top
 comments were spam                                    Billboard.com     MySpace Analysis

    * Artists with long the week of
       chart during history/body
 comments had positive sentiments
 comments had negative sentiments                      Soulja Boy        T.I.
    of work vs. ‘up and coming’ artists
       Sept 22-28 2007
 comments had no identifiable sentiments                Kanye West        Soulja Boy
on Statistics                                          Timbaland         Fall Out Boy
                                                       Fergie            Rihanna
       *(Unique) Top of MySpace -
         Predictive power 45                           J. Holiday        Keyshia Cole
                                                       50 Cent           Avril Lavigne
        MySpace artist pages
 in Section 8, the next week looked a lot like
       Billboard structured metadata                   Keyshia Cole      Timbaland
mestamp, etc.) MySpace this week..
                and annotation results                 Nickelback        Pink
m, sentiment, etc.) were loaded in Hypercube,
        Crawl, Annotate, Build the                     Pink              50 Cent
                                                       Colbie Caillat    Alicia Keys
         Project lists big music influencers
       Teenagerscell of the cube is the
                    are
 resented by each                        Table 8   Billboard’s Top Artists vs. our generated list
 ents for a given [MediaMark2004]
                  artist. The dimension-                    Showing Top 10
 e is dependent on what variables we
                                         1 was comprised of respondents between ages 8
Casual Preference Poll Results

 “Which list more accurately reflects the
 artists that were more popular last week?”
 75 participants
 Overall 2:1 preference for MySpace list
 Younger age groups: 6:1 (8-15 yrs)
                   38%    of   total   comments   were spam                                     Billboard.com    MySpace Analysis
                   61%    of   total   comments   had positive sentiments
                    4%    of   total   comments   had negative sentiments                       Soulja Boy       T.I.
                   35%    of   total   comments   had no identifiable sentiments                 Kanye West       Soulja Boy
                Table 7 Annotation Statistics                                                   Timbaland        Fall Out Boy
                                                                                                Fergie           Rihanna
                                                                                                J. Holiday       Keyshia Cole
                                                                                                50 Cent          Avril Lavigne
                    As described in Section 8, the structured metadata
 Challenging traditional polling methods!
                (artist name, timestamp, etc.) and annotation results
                (spam/non-spam, sentiment, etc.) were loaded in the
                                                                                                Keyshia Cole
                                                                                                Nickelback
                                                                                                Pink
                                                                                                                 Timbaland
                                                                                                                 Pink
                                                                                                                 50 Cent
                                                                                                Colbie Caillat   Alicia Keys
                hypercube.
                    The data represented by each cell of the cube is the          Table 8 Billboard’s Top Artists vs. our generated list
Lessons Learned, Miles to go
 Informal (teen-authored) text is not your
 average blog content..
 Quality check is not on a day to day spot/
 comment level precision, but at a system level
 - are we missing a source/artist, is the crawl behaving,
 adjudication techniques for multiple data sources..

    Leveraging SI for BI is difficult - Validation is key,
    continuous fine tweaking with experts

Mais conteúdo relacionado

Semelhante a BBCSoundIndex

Musik di era digital
Musik di era digitalMusik di era digital
Musik di era digitalRobin Malau
 
Bobi Digital Shelf
Bobi Digital ShelfBobi Digital Shelf
Bobi Digital Shelfekoshyun
 
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?Alek Nybro
 
4 a&r, the artist and marx
4 a&r, the artist and marx4 a&r, the artist and marx
4 a&r, the artist and marxppermaul
 
June 2009 to june 2012 past papers v2
June 2009 to june 2012 past papers v2June 2009 to june 2012 past papers v2
June 2009 to june 2012 past papers v2ppermaul
 
Steve jobs interview
Steve jobs interviewSteve jobs interview
Steve jobs interview灿辉 葛
 
Engagement 1.0 - Understanding Fan Interaction
Engagement 1.0 - Understanding Fan InteractionEngagement 1.0 - Understanding Fan Interaction
Engagement 1.0 - Understanding Fan InteractionIvan Askwith
 
Cataldi Public Relations - Creativity Credentials January 2019
Cataldi Public Relations - Creativity Credentials January 2019Cataldi Public Relations - Creativity Credentials January 2019
Cataldi Public Relations - Creativity Credentials January 2019Sal Cataldi
 
Staying Ahead of Your Napster: Kellogg School of Management, November 2017
Staying Ahead of Your Napster: Kellogg School of Management, November 2017Staying Ahead of Your Napster: Kellogg School of Management, November 2017
Staying Ahead of Your Napster: Kellogg School of Management, November 2017John Greene
 
Online Collaboration Music
Online Collaboration MusicOnline Collaboration Music
Online Collaboration MusicGavin McLean
 
MUC110 LEC8 Music.Technology
MUC110 LEC8 Music.TechnologyMUC110 LEC8 Music.Technology
MUC110 LEC8 Music.TechnologyMUC110
 
Sound, Brands and Social Change
Sound, Brands and Social ChangeSound, Brands and Social Change
Sound, Brands and Social ChangeNoel Franus
 
Today Is Opposite Day: Music in the Network Age
Today Is Opposite Day: Music in the Network AgeToday Is Opposite Day: Music in the Network Age
Today Is Opposite Day: Music in the Network AgeAram Sinnreich
 
Things to Watch - Music Edition
Things to Watch -  Music EditionThings to Watch -  Music Edition
Things to Watch - Music EditionMediaCom Informer
 
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...Tom Tenney
 

Semelhante a BBCSoundIndex (20)

Detecting Signals from Real-time Social Web
Detecting Signals from Real-time Social Web Detecting Signals from Real-time Social Web
Detecting Signals from Real-time Social Web
 
Detecting Signals from Real-time Social Web
Detecting Signals from Real-time Social Web Detecting Signals from Real-time Social Web
Detecting Signals from Real-time Social Web
 
Musik di era digital
Musik di era digitalMusik di era digital
Musik di era digital
 
Bobi Digital Shelf
Bobi Digital ShelfBobi Digital Shelf
Bobi Digital Shelf
 
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?
Digital Streaming, Big Data, and Local Music: When Is There Enough Cowbell?
 
4 a&r, the artist and marx
4 a&r, the artist and marx4 a&r, the artist and marx
4 a&r, the artist and marx
 
Nokia Workshop
Nokia WorkshopNokia Workshop
Nokia Workshop
 
June 2009 to june 2012 past papers v2
June 2009 to june 2012 past papers v2June 2009 to june 2012 past papers v2
June 2009 to june 2012 past papers v2
 
Steve jobs interview
Steve jobs interviewSteve jobs interview
Steve jobs interview
 
Engagement 1.0 - Understanding Fan Interaction
Engagement 1.0 - Understanding Fan InteractionEngagement 1.0 - Understanding Fan Interaction
Engagement 1.0 - Understanding Fan Interaction
 
Cataldi Public Relations - Creativity Credentials January 2019
Cataldi Public Relations - Creativity Credentials January 2019Cataldi Public Relations - Creativity Credentials January 2019
Cataldi Public Relations - Creativity Credentials January 2019
 
Staying Ahead of Your Napster: Kellogg School of Management, November 2017
Staying Ahead of Your Napster: Kellogg School of Management, November 2017Staying Ahead of Your Napster: Kellogg School of Management, November 2017
Staying Ahead of Your Napster: Kellogg School of Management, November 2017
 
Online Collaboration Music
Online Collaboration MusicOnline Collaboration Music
Online Collaboration Music
 
MUC110 LEC8 Music.Technology
MUC110 LEC8 Music.TechnologyMUC110 LEC8 Music.Technology
MUC110 LEC8 Music.Technology
 
Sound, Brands and Social Change
Sound, Brands and Social ChangeSound, Brands and Social Change
Sound, Brands and Social Change
 
Today Is Opposite Day: Music in the Network Age
Today Is Opposite Day: Music in the Network AgeToday Is Opposite Day: Music in the Network Age
Today Is Opposite Day: Music in the Network Age
 
Your Sister's Canary
Your Sister's CanaryYour Sister's Canary
Your Sister's Canary
 
Things to Watch: Music Edition (October 2011)
Things to Watch: Music Edition (October 2011)Things to Watch: Music Edition (October 2011)
Things to Watch: Music Edition (October 2011)
 
Things to Watch - Music Edition
Things to Watch -  Music EditionThings to Watch -  Music Edition
Things to Watch - Music Edition
 
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...
Radio Free Brooklyn Presentation for 'Resistance Radio' at Uniondocs; Brookly...
 

Último

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 

Último (20)

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 

BBCSoundIndex

  • 1. http://www.almaden.ibm.com/cs/projects/iis/sound/ BBC SoundIndex Pulse of the Online Music Populace Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media", 2010
  • 2. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity.
  • 3. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?!
  • 4. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out!
  • 5. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
  • 6. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
  • 7. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources” “one chart for everyone” is so old!
  • 8. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. User metadata, unstructured, Artist/Track structured attention Metadata metadata right data source, right crowd, timeliness of data..?
  • 9. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. Album/Track identification Sentiment Identification Spam and off-topic comments UIMA Analytics Environment right data source, right crowd, timeliness of data..?
  • 10. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. Exracted concepts into explorable datastructures right data source, right crowd, timeliness of data..?
  • 11. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. What are 18 year olds in London listening to? right data source, right crowd, timeliness of data..?
  • 12. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. What are 18 year olds in London listening to? Validating crowd-sourced preferences right data source, right crowd, timeliness of data..?
  • 13. Imagine doing this for a local business! Pulse of the ‘foodie’ populace! Where are 20 something year olds going? Why?
  • 14. 4 SoundIndex Architecture Fig. 2 SoundIndex architecture. Data sources are ingested and if necessary transformed into structured data using the MusicBrainz RDF and data miners. The resulting structured data is stored in a database and periodically extracted to update the front end. Is the radio still the most popular medium for music? In of “plays” such as YouTube videos and LastFM tracks, 2007 less than half of all teenagers purchased a CD6 and and purely structured data such as the number of sales with the advent of portable MP3 players, fewer people from iTunes. We then clean this data, using spam re- are listening to the radio. moval, off-topic detection, and performed analytics to With the rise of new ways in which communities spot songs, artists and sentiment expressions. The cleaned and annotated data is combined, adjudicating for dif-
  • 16. UIMA Annotators MySpace user comments (~ Twitter) Named Entity Recognition Sentiment Recognition Spam Elimination I heart your new song Umbrella.. madge..ur pics on celebration concert with jesus r awesome! Challenges, intuitions, findings, results..
  • 17. Recognizing Named Entities Cultural entities, Informal text, Context- poor utterances, restricted to the music sense.. “Ohh these sour times... rock!”
  • 18. Recognizing Named Entities Cultural entities, Informal text, Context- poor utterances, restricted to the music sense.. Problem Defn: Semantic Annotation of album/track names (using MusicBrainz) [at ISWC ‘09] “Ohh these sour times... rock!”
  • 19. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Bad Dangerous
  • 20. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Spot and Disambiguate Paradigm Bad Generate Candidate entities Dangerous “Thriller was my most fav MJ album” “this is bad news, ill miss you MJ” Disambiguate spots/mentions in context
  • 21. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Spot and Disambiguate Paradigm Bad Generate Candidate entities Dangerous “Thriller was my most fav MJ album” “this is bad news, ill miss you MJ” Disambiguate spots/mentions in context Disambiguation Intensive Approach
  • 22. Challenge 1: Multiple Senses, Same Domain 60 songs with Merry Christmas 3600 songs with Yesterday 195 releases of American Pie 31 artists covering American Pie “Happy 25th! Loved your song Smile ..”
  • 23. Challenge 1: Multiple Senses, Same Domain 60 songs with Merry Christmas 3600 songs with Yesterday 195 releases of American Pie 31 artists covering American Pie “Happy 25th! Loved your song Smile ..”
  • 24. Intuition: Scoped Graphs This new Merry Christmas tune.. SO GOOD! Scoped Relationship graphs using cues from the content, webpage title, url.. Which ‘Merry Christmas’? 9 ‘So Good’ is also a song!
  • 25. Intuition: Scoped Graphs This new Merry Christmas tune.. SO GOOD! Scoped Relationship graphs using cues from the content, webpage title, url.. Reduce potential entity spot size Generate candidate entities Disambiguate in context Which ‘Merry Christmas’? 9 ‘So Good’ is also a song!
  • 26. What Content Cues to Exploit? “I heart your new album Habits” “Happy 25th lilly, love ur song smile” ”Congrats on your rising star award”..Experimenting with Restrictions Career Length Releases that are not new; Album Release artists who are at least 25; No. of Albums new careers.. ... Specific Artist
  • 27. Gold Truth Dataset 1800+ spots in MySpace user comments from artist pages Keep your SMILE on! good spot, bad spot, inconclusive spot? 4-way annotator agreements across spots Madonna 90% agreement Rihanna 84% agreement Lily Allen 53% agreement
  • 28. Sample Restrictions, Spot Precision 3 artists, 1800+ spots Experimenting with Restrictions Career Length Album Release No. of Albums ... Specific Artist
  • 29. Sample Restrictions, Spot Precision From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ Experimenting with %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** #""$ Restrictions !"#$%&%'()'*)+,#)-.'++#" *****789!9$*,/):0+0-%; #"$ Career Length )A&:.23*8*&2>?@+ &.*2)&+.*8*&2>?@+ Album Release #$ No. of Albums !#$ &/.0+.+*<5-+)*=0/+.*&2>?@*<&+* ... 0%*.5)*,&+.*8*3)&/+ !"#$ Specific Artist *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ !""#$ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%; !"""#$
  • 30. Sample Restrictions, Spot Precision Closely follows distribution of random restrictions, conforming loosely to a Zipf distribution From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"# !"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with #!!$ Restrictions #!$ Career Length !#"$404(%'()'&*"'53(&&"# #$ Album Release No. of Albums !"#$ ... %&'()*+(*,-'.,-.(/& 012,-,34.51'&6.,- !"!#$ Specific Artist 7)(*'(.8/1)1+(&)*'(*+' 19&)1:&.&2;&+(&6 !"!!#$ *3;),9&3&-(..<=*;> 6*'()*5?(*,-@ !"!!!#$
  • 31. Sample Restrictions, Spot Precision Closely follows distribution of random restrictions, conforming loosely to a Zipf distribution From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"# !"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with #!!$ Restrictions #!$ Career Length !#"$404(%'()'&*"'53(&&"# #$ Album Release No. of Albums !"#$ ... %&'()*+(*,-'.,-.(/& 012,-,34.51'&6.,- !"!#$ Specific Artist 7)(*'(.8/1)1+(&)*'(*+' 19&)1:&.&2;&+(&6 !"!!#$ *3;),9&3&-(..<=*;> 6*'()*5?(*,-@ !"!!!#$ Choosing which constraints to implement is simple - pick easiest first
  • 32. Madonna’s Scoped Entity Lists tracks User comments are on MySpace artist pages Restriction: Artist name Assumption: no other artist/work mention
  • 33. Madonna’s Scoped Entity Lists tracks User comments are on MySpace artist pages Restriction: Artist name Assumption: no other artist/work mention Naive spotter has advantage of spotting all possible mentions (modulo spelling errors) Generates several false positives “this is bad news, ill miss you MJ”
  • 34. Challenge 2: Disambiguating Spots Got your new album Smile. Loved it! Keep your SMILE on! Separating Valid and invalid mentions of music named entities.
  • 35. Challenge 2: Disambiguating Spots Got your new album Smile. Loved it! Keep your SMILE on! Separating Valid and invalid mentions of music named entities.
  • 36. Intuition: Using Natural Language Cues Got your new album Smile. Loved it! Syntactic features Notation-S + POS tag of s s.POS POS tag of one token before s s.POSb POS tag of one token after s s.POSa Typed dependency between s and sentiment word s.POS-TDsent ∗ Typed dependency between s and domain-specific term s.POS-TDdom ∗ Boolean Typed dependency between s and sentiment s.B-TDsent ∗ Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment s.Csent Domain-related term in the same sentence as s s.Sdom Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Generic syntactic, by the SVM learner Table 6. Features used spot-level, domain features 15
  • 37. Intuition: Using Natural Language Cues Got your new album Smile. Loved it! Syntactic features 1. Notation-S Expressions: Sentiment + POS tag of s Slang sentiment gazetteer s.POS POS tag of one token before s s.POSb POS tag of one token after s usinga Urban Dictionary s.POS Typed dependency between s and sentiment word s.POS-TDsent ∗ Typed dependency between s and domain-specific term 2. s.POS-TDdom ∗ Domain specific terms Boolean Typed dependency between s and sentiment music,sent ∗ s.B-TD album, concert.. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment s.Csent Domain-related term in the same sentence as s s.Sdom Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Generic syntactic, by the SVM learner Table 6. Features used spot-level, domain features 15
  • 38. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps Intuition: Using Natural Language Cues + s in Quotes s.inQuotes Domain-specific features Notation-D Got yourexpression in the same sentenceit! s Sentiment new album Smile. Loved as s.Ssent 1. Sentiment Expressions: Syntactic features Sentiment expression elsewhere in the comment Notation-S s.Csent Domain-related term in the same sentence as s s.POS s.Sdom + POS tag of s Domain-related term elsewhere in the comment Slang sentimentdom s.C gazetteer POS tag of one token+before s s.POSb Refers to basic features, others are advanced features Urban Dictionary usinga POS tag of one token∗ after s s.POS These features apply only to one-word-long spots. Typed dependency between s and sentiment word s.POS-TDsent ∗ 2. s.POS-TDdom Domain specific terms Typed dependency between s and domain-specificFeatures used by the SVM learner∗ Table 6. term Boolean Typed dependency between s and sentiment music,sent ∗ s.B-TD album, concert.. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Typed Dependencies: Valid Notation-W new album Smile. spot: Got your + Capitalization of spot sWe also captured the typed de- Simplys.allCaps loved it! + Capitalization of first letter of paths (grammatical rela- pendency s Encoding: nsubj(loved-8, Smile-5) imply- s.firstCaps + s in Quotes tions) via the s.POS- ing that Smile is the nominal subject of s.inQuotes the expression loved. Domain-specific features and s.POS-TDdom boolean TDsent Notation-D Sentiment expressionfeatures. These were as s in the same sentence obtained be- s.Ssent Invalid spot: Keep your smile on. You’ll Sentiment expressiontween a spot and comment elsewhere in the co-occurring senti- s.C do great! sent Domain-related termment and domain-specific words by in the same sentence as s s.Sdom Encoding: No typed dependency between Domain-related termthe Stanford the comment elsewhere in parser[12] (see exam- smile s.Cdom and great + ple in 7). We also encode a boolean Refers to basic features, others are advanced features ∗ These features apply only indicating whether a relation value to one-word-long spots. Table 7. Typed Dependencies Example was found at all using the s.B-TDsent Generic syntactic, byThis SVM learneraccommodate parse errors given the Table 6. Features used spot-level, us to and s.B-TDdom features. the allows domain features 15 informal and often non-grammatical English in this corpus.
  • 39. Valid Music Binary Classification mention or not? Keep ur SMILE on! Got your new album Smile. Loved it! SVM Binary Classifiers Training Set: Positive examples (+1): 550 valid spots Negative examples (-1): 550 invalid spots Test Set: 120 valid spots, 458 invalid spots
  • 40. Efficacy of Features Precision intensive Recall intensive
  • 41. Efficacy of Features Precision intensive 42-91 78-50 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive
  • 42. Efficacy of Features Precision intensive 42-91 - Feature combinations were most stable, best performing - Gazetteer matched domain words 78-50 and sentiment expressions proved to be useful 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive
  • 43. Efficacy of Features Precision intensive 42-91 - Feature combinations were most stable, best performing - Gazetteer matched domain words 78-50 and sentiment expressions proved to be useful 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive PR tradeoffs: choosing feature combinations depending on end application
  • 44. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" 71,89-9/(:;/1:<9=>:?==,( Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 Step 1: Spot with naive spotter, restricted knowledge base
  • 45. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" 71,89-9/(:;/1:<9=>:?==,( Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision 42-91 : “All ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 features” setting Step 1: Spot with naive spotter, Step 2: Disambiguate using NL restricted knowledge base features (SVM classifier)
  • 46. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" Madonna’s track spots- $!" ~60% #!" 71,89-9/(:;/1:<9=>:?==,( precision Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision 42-91 : “All ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 features” setting Step 1: Spot with naive spotter, Step 2: Disambiguate using NL restricted knowledge base features (SVM classifier)
  • 47. Lessons Learned.. Using domain knowledge especially for cultural entities is non-trivial %!! Computationally intensive NLP is prohibitive 23*$,&3(#&)#$",$ $!! 0129;4+32-.84- Two stage approach: NL learners over '!+,-./+012/+324/+56,176+,-.+8.7967: #!! dictionary-based naive spotters "!! allows more time-intensive NL analytics to run on less than the full set of input data ! control over precision"(!! recall of final result "&!! "'!! and ")!! "*!! #!!! !"#$%&'()#&*+&)#$",$&*#&+*-./".0&'()#&*+&1)./
  • 48. Lessons Learned.. Using domain knowledge especially for cultural entities is non-trivial Computationally intensive NLP is prohibitive Two stage approach: NL learners over dictionary-based naive spotters allows more time-intensive NL analytics to run on less than the full set of input data control over precision and recall of final result
  • 49. Sentiment Expressions NER allows us to track and trend online mentions Popularity assessment: sentiment associated with the target entity Exploratory Project: what kinds of sentiments, distribution of positive and negative expressions..
  • 50. Observations, Challenges Negations, sarcasms, refutations are rare Short sentences: OK to overlook target attachment Target demographic: Teenagers! Slang expressions: Wicked, tight, “the shit”, tropical, bad.. What are common positive and negative expressions this demographic uses? Urbandictionary.com to the rescue!
  • 51. Urbandictionary.com Slang Dictionary! Related tags bad appears with terrible and good! Glossary
  • 52. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations
  • 53. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations Seed positive and negative oriented dictionary entries good, terrible..
  • 54. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations Seed positive and negative oriented dictionary entries good, terrible.. Step 1:For each seed, query UD, get related tags Good ->awesome, sweet, fun, bad, rad.. Terrible ->bad, horrible, awful, shit..
  • 55. Mining a Slang Sentiment Dictionary Good ->awesome, sweet, fun, bad, rad Step 2: Calculate semantic orientation of each related tag [Turney] SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
  • 56. Mining a Slang Sentiment Dictionary Good ->awesome, sweet, fun, bad, rad Step 2: Calculate semantic orientation of each related tag [Turney] SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”) PMI over the Web vs. UD SOweb(rock) = -0.112 SOud(rock) = 0.513
  • 57. Mining a Slang Sentiment Dictionary Step 3: Record orientation; add to dictionary {good, rad} {terrible} Continue for other related tags and all new entries .. Mined dictionary: 300+ entries (manually verified for noise)
  • 58. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97]
  • 59. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97] Look up word in mined dictionary, record orientation
  • 60. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97] Look up word in mined dictionary, record orientation Why dont we just spot using the dictionary!? fuuunn! (coverage issues) “super man is in town”; “i heard amy at tropical cafe yday..”
  • 61. Dictionary Coverage Issues Resort to corpus (transliterations) Co-occurence in corpus (tight:top scoring dictionary entries) “tight-awesome”:456, “tight-sweet”:136, “tight-hot” 429
  • 62. Miscellaneous Presence of entity improves confidence of identified sentiment Short sentences, high coherence Associate sentiment with all entities If no entity spotted, associate with artist (whose page we are on) “you are awesome!” on Madonna’s page
  • 63. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Table 3 Transliteration accuracy impact These results indicate that the syntax and seman- tics of sentiment expression in informal text is difficult to determine. Words that were incorrectly identified as
  • 64. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Negative sentiments: slang orientations Table 3 Transliteration accuracy impact out of context! you were terrible last night at SNL but I still <3 you! These results indicate that the syntax and seman- tics of sentiment expression in informal text is difficult to determine. Words that were incorrectly identified as
  • 65. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Negative sentiments: slang orientations Table 3 Transliteration accuracy impact out of context! you were terrible last night at SNL but I still <3 you! These results indicate that the syntax and seman- Precision excluding corpus transliterations: tics of sentiment expression in informal text is difficult Incorrect NL parses, selectivity to determine. Words that were incorrectly identified as
  • 66. Crucial Finding < 4% -ve +ve Sentiment Expression Frequencies in 60k comments, 26 weeks
  • 67. Forget about the sentiment Crucial Finding annotator! Just use entity mention volumes! < 4% -ve +ve Sentiment Expression Frequencies in 60k comments, 26 weeks
  • 68. Spam, Off-topic, Promotions Special type of spam: unrelated to artists’ work Paul McCartney’s divorce; Rihanna’s Abuse; Madge and Jesus Self-Promotions “check out my new cool sounding tracks..” music domain, similar keywords, harder to tell apart Standard Spam “Buy cheap cellphones here..”
  • 69. 16 Observations SPAM: 80% have 0 sentiments CHECK US OUT!!! ADD US!!! PLZ ADD ME! 60k comments, 26 IF YOU LIKE THESE GUYS ADD US!!! weeks NON-SPAM: 50% have at least 3 sentiments Common 4 grams Your music is really bangin! pulled out several You’re a genius! Keep droppin bombs! ‘spammy’ phrases u doin it up 4 real. i really love the album. keep doin wat u do best. u r so bad! Phrases -> Spam, Non- hey just hittin you up showin love to one of spam buckets chi-town’s own. MADD LOVE. The spam annotator should Examples of of other annotatornon-spam com Fig. 8 be aware sentiment in spam and results!
  • 70. Spam Elimination Aggregate function Phrases indicative of spam (4-way annotator agreements) Rules over previous annotator results if a spam phrase, artist/track name and a positive sentiment were spotted, the comment is probably not spam.
  • 71. Performance Annotator Type Precision Recall Spam 0.76 0.8 Non-Spam 0.83 0.88 Directly proportional to previous annotator Table 4: Spam annotator performance results incorrect entity, sentiment spots the comment did not have a spam pattern, and the fir annotator spotted incorrect tracks, the spam annotator in Some self-promotions are just clever! terpreted the comment to be related to music and classifie ”like umbrella, ull love this song. . . “ t as non-spam. Other cases included more clever prom
  • 72. Do not Confuse Activity with Popularity! non-spam spam Artist popularity ranking changed dramatically after excluding spam! ns. Tun- noise could significantly impact the data analysi rmined ordering of artists if it is not accounted for. racy of % Spam, 60k comments over 26 weeks of using Gorillaz 54% Placebo 39% Coldplay 42% Amy Winehouse 38% Lily Allen 40% Lady Sovereign 37% Keane 40% Joss Stone 36% all
  • 73. Workflow Unstructured chatter to explorable hypercube
  • 74. Hypercube, List Projection Exploring dimensions of popularity Data Hypercube (DB2) from structured data in posts (user demographics, location, age, gender, timestamp) unstructured data (entity, sentiment, spam flag) Intersection of dimensions => non-spam comment counts
  • 75. artist), and from the measurements generated by the above annotator methods. ThisRecall Annotator Type Precision annotates eachwhich is then used to sorta a comment with the Spam 0.76 0.8 gregate and analyze the hyperc series of tags from unstructured and structured data. The on Non-Spam 0.83 0.88 dimensional data operations resulting tuple is then placed into a startially custom in which for schema popular lists Hypercube to One-Dimensional List Table 4: Spam annotator performance sort the artists, tracks, etc. to to musical bill addition which is then used to relevance with regardsWe can ag- the primary measure is a the traditional can slice and project (margina topics. didgregate equivalent the defining a function. asof multi- hot in This is andspam pattern, hypercube usinglistsvariety “What is analyze of and the first a such the comment dimensional data operations on it to derive males?” and “Who are th not have a old what are essen- annotator spotted incorrect tracks, the spam annotator in- tially custom popular lists for particularSan Francisco?” They transla musical topics. In terpreted the commentGender, Location, T ime, Artist, ...) → M M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1) nce the to music and classified addition toincluded more clever promo- it as non-spam. Other cases operations: tional comments can included the actual artist tracks, gen- that slice have stored the aggregation ofthe cube for and project (marginalize dimensions) X In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M case,very limited spam content. in ‘like we occurrences lists such uine sentiments and L City 19 = 19, nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from theof non-spammales?” and at theare the most popular T,... first old comment “Who amount of artists values nnotator in- available at hand in addition to translatethis way makes it easy to of the WhatFrancisco?” They grammatically 19 year X San is hot in New York for information hypercube. Storing the data to following mathematical poor sentences necessitates more sophisticated techniques nd classified old operations: examine rankings over various time intervals,(X) : L2 weight variousL = M (A, G, ever promo- males? for spam identification. T,A,G,... Given the amount of effort involved etc. Once (and if) a total ordering the obvious question tracks,dimensions differently, goal of using comment gen- X arises – why filter spam? For our end where, (X) : on the list, = 19, Gspam is key. “N ewY orkCity , might Lis fixed this intermediateLdata staging stepT, X, ...) . (e.g. approach positions ‘like counts to lead to 1 M (A filtering = M, = e amount corroborated by the volume of spam and non-spam This is of be eliminated. T,... content observed over a period of 26 weeks for 8 artists; see ammatically (2) X = Name of the artist X Table 5. The chart indicates that some artists had at least T = Timestamp 4.4 Projecting to a list techniques L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter half as many spam as non-spam comments on their page. A = Age X, ...) (3) This level of noise would significantly impact the ordering T,A,G,... G = Gender us questionif it were not accounted seeking to generate Ultimately we are for. of artists a “one dimensional” L = Location ng comment where,
  • 76. artist), and from the measurements generated by the above annotator methods. ThisRecall Annotator Type Precision annotates eachwhich is then used to sorta a comment with the Spam 0.76 0.8 gregate and analyze the hyperc series of tags from unstructured and structured data. The on Non-Spam 0.83 0.88 dimensional data operations resulting tuple is then placed into a startially custom in which for schema popular lists Hypercube to One-Dimensional List Table 4: Spam annotator performance sort the artists, tracks, etc. to to musical bill addition which is then used to relevance with regardsWe can ag- the primary measure is a the traditional can slice and project (margina topics. didgregate equivalent the defining a function. asof multi- hot in This is andspam pattern, hypercube usinglistsvariety “What is analyze of and the first a such the comment dimensional data operations on it to derive males?” and “Who are th not have a old what are essen- annotator spotted incorrect tracks, the spam annotator in- tially custom popular lists for particularSan Francisco?” They transla musical topics. In terpreted the commentGender, Location, T ime, Artist, ...) → M M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1) nce the to music and classified addition toincluded more clever promo- it as non-spam. Other cases operations: tional comments can included the actual artist tracks, gen- that slice have stored the aggregation ofthe cube for and project (marginalize dimensions) X In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M case,very limited spam content. in ‘like we occurrences lists such uine sentiments and L City 19 = 19, nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from theof non-spammales?” and at theare the most popular T,... first old comment “Who amount of artists values nnotator in- available at hand in addition to translatethis way makes it easy to of the WhatFrancisco?” They grammatically 19 year X San is hot in New York for information hypercube. Storing the data to following mathematical poor sentences necessitates more sophisticated techniques nd classified old operations: examine rankings over various time intervals,(X) : L2 weight variousL = M (A, G, ever promo- males? for spam identification. T,A,G,... Given the amount of effort involved etc. Once (and if) a total ordering the obvious question tracks,dimensions differently, goal of using comment gen- X arises – why filter spam? For our end where, (X) : on the list, = 19, Gspam is key. “N ewY orkCity , might Lis fixed this intermediateLdata staging stepT, X, ...) . (e.g. approach positions ‘like counts to lead to 1 M (A filtering = M, = e amount corroborated by the volume of spam and non-spam This is of be eliminated. T,... content observed over a period of 26 weeks for 8 artists; see ammatically (2) X = Name of the artist X Table 5. The chart indicates that some artists had at least T = Timestamp 4.4 Projecting to a list techniques L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter half as many spam as non-spam comments on their page. A = Age X, ...) (3) This level of noise would significantly impact the ordering T,A,G,... G = Gender us questionif it were not accounted seeking to generate Ultimately we are for. of artists a “one dimensional” L = Location ng comment where,
  • 77. The Word on the Street comments Billboards Top 50 Singles were spam Billboard.com MySpace Analysis comments comments chart during the week of had positive sentiments had negative sentiments Soulja Boy T.I. comments Sept 22-28 2007 had no identifiable sentiments on Statistics Kanye West Timbaland Soulja Boy Fall Out Boy Fergie Rihanna (Unique) Top 45 J. Holiday Keyshia Cole 50 Cent Avril Lavigne MySpace artist pages in Section 8, the structured metadata Keyshia Cole Timbaland mestamp, etc.) and annotation results Nickelback Pink m, sentiment, etc.) were loaded in Hypercube, Crawl, Annotate, Build the Pink 50 Cent Colbie Caillat Alicia Keys Project lists resented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list ents for a given artist. The dimension- Showing Top 10 e is dependent on what variables we 1 was comprised of respondents between ages 8
  • 78. The Word on the Street * Top artists appear in both lists * Several Overlaps 50 Singles Billboards Top comments were spam Billboard.com MySpace Analysis * Artists with long the week of chart during history/body comments had positive sentiments comments had negative sentiments Soulja Boy T.I. of work vs. ‘up and coming’ artists Sept 22-28 2007 comments had no identifiable sentiments Kanye West Soulja Boy on Statistics Timbaland Fall Out Boy Fergie Rihanna *(Unique) Top of MySpace - Predictive power 45 J. Holiday Keyshia Cole 50 Cent Avril Lavigne MySpace artist pages in Section 8, the next week looked a lot like Billboard structured metadata Keyshia Cole Timbaland mestamp, etc.) MySpace this week.. and annotation results Nickelback Pink m, sentiment, etc.) were loaded in Hypercube, Crawl, Annotate, Build the Pink 50 Cent Colbie Caillat Alicia Keys Project lists big music influencers Teenagerscell of the cube is the are resented by each Table 8 Billboard’s Top Artists vs. our generated list ents for a given [MediaMark2004] artist. The dimension- Showing Top 10 e is dependent on what variables we 1 was comprised of respondents between ages 8
  • 79. Casual Preference Poll Results “Which list more accurately reflects the artists that were more popular last week?” 75 participants Overall 2:1 preference for MySpace list Younger age groups: 6:1 (8-15 yrs) 38% of total comments were spam Billboard.com MySpace Analysis 61% of total comments had positive sentiments 4% of total comments had negative sentiments Soulja Boy T.I. 35% of total comments had no identifiable sentiments Kanye West Soulja Boy Table 7 Annotation Statistics Timbaland Fall Out Boy Fergie Rihanna J. Holiday Keyshia Cole 50 Cent Avril Lavigne As described in Section 8, the structured metadata Challenging traditional polling methods! (artist name, timestamp, etc.) and annotation results (spam/non-spam, sentiment, etc.) were loaded in the Keyshia Cole Nickelback Pink Timbaland Pink 50 Cent Colbie Caillat Alicia Keys hypercube. The data represented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list
  • 80. Lessons Learned, Miles to go Informal (teen-authored) text is not your average blog content.. Quality check is not on a day to day spot/ comment level precision, but at a system level - are we missing a source/artist, is the crawl behaving, adjudication techniques for multiple data sources.. Leveraging SI for BI is difficult - Validation is key, continuous fine tweaking with experts