2. Generally speaking, sentiment analysis aims to determine the attitude
of a speaker or a writer with respect to some topic or the overall
contextual polarity of a document
~ Wikipedia[1]
Levels[2] at which sentiments can be expressed:
Phrase
Sentence
Paragraph
Document
About a Subject
3. User’s Opinions
Bob: It's a great movie (Positive sentiment)
Alice: Nah!! I didn't like it at all (Negative sentiment)
Bob: I am not so sure about the movie. You may like it,
or may be not ! (Neutral!! Confused!!)
4.
5.
6. Understanding public opinion on products, movies etc.
Ex: There is 67% negative opinion on the color of
Amazon’s new version of Kindle.
Using this knowledge to
Make predictions in market trends, results of election
polls etc.
Make decisions !
Ex: Changing the color in subsequent versions
Personalization!
Ex: Recommeding products depending on what your
friends feel.
7. Binary
Positive
Negative
Ordinal values
Ex: rating from 1 to 5
Complex polarity
Detect the source, target and attitude
Ex: Obama offers comfort after colorado shooting.
Subject : Obama, Target: People , Attitude: comfort
8. NLP
Use of semantics to understand the language
Uses lexicons, dictionaries, ontologies
Ex: I feel great today. (Understands that user’s feeling is
great)
Machine Learning
Don’t have to understand the meaning.
Uses classifiers such as Naïve Bayes, SVM, Max Ent
etc.
Ex: I feel great today (Doesn’t have to understand what user
is feeling. It’s just that word great appears in positive or
negative set, is good enough to classify the sentence as
positive or negative)
9. Apple Ipod Review
Alice : Apple ipod is a great music player. It’s better
than any other product I have bought
Great – Positive
Better – Positive
Total Positives = 2
Total Negatives =0
Net Score = 2-0=2
Hence the review is Positive
10. Apple Ipod Review
Alice : Apple ipod is not bad at all. You can buy it.
Not – Negative
Bad – Negative
Total Positives = 0
Total Negatives =2
Net Score = 0-2=-2
Hence the review is Negative
Note: This can be solved by a preprocessing stage such as
converting “Not bad ” to “Good”. But preprocessing for
NLP is complex.
11. Requires a good classifier
Requires a training set for each class.
In our case:
2 classes, Positive and Negative
Require pre-classified training set for both these
classes.
12. Training data for Movie Domain
Positive class
Sleepy Hollow is an awesome movie. Every one should
watch it.
Christopher Nolan is such a great director that he can
convert any script into a block buster.
Great actors, great direction and a great movie.
Negative class
Nothing can make this movie better. It can win the
stupidest movie of the year award, if there is such a thing.
13. Advantages
Don’t have to create a sentiment lexicon (great is
80% positive, bad is 75% negative etc…)
Categorization of proper nouns as well
(Ex: Cameron Diaz)
Generic and can be applied for various domains
Language independent models
(Ex: J'aime le film "Amélie")
Disadvantage:
Should have large sets of training data
14. Preparing Train
Training Set Classifier
Yelp
Data Pre-processing
Collection
Test
classifier
Preparing
City Test Set
Grid
15. City Grid Media
CityGrid Media is an online media company that
connects web and mobile publishers with local
businesses by linking them through CityGrid
Provides
Restful API
Ratings (0-10)
Reviews
Domain
Restaurant
16. Tokenization
Case Conversion
Word conversion to full forms (“Don’t” to “do not”,
“I’ll” to “I will”)
Removal of punctuations
Stop word filter using Lucene
Length filter – to remove words with less than 3
characters
17. Reviews with ratings > 8 - Positive Class
Reviews with ratings < 3 - Negative Class
Training
Positive reviews – 20,000
Negative reviews – 20,000
Considering the same scale with out bias
Test Set
Positive reviews – 1,000
Negative reviews – 1,000
18. Tokenization
Splitting the sentences into words.
Vectorization
A vector for each review in the vector space model
Training and Test Sets
Store the files corresponding to Training and Test
sets on HDFS
Train the classifier
./bin/mahout trainclassifier -i /restaurants/bayes-
train-input -o /restaurants/bayes-model -type
bayes -ng 1 -source hdfs
19. Unigram
Considers only one token
Ex: It is a good movie.
{It, is, a, good, movie}
Bigram
Considers two consecutive tokens
Ex: It is not bad movie
{It is, is not, not bad, bad movie}
20. Reviews for sea food restaurants
This restaurant makes good crab dishes. Crab is a kind of
sea food isn't it?
The is a good sea food restaurant.
Nay!! don't go there if you want sea food. Try going to
Marina or some other restaurant.
Reviews for breakfast
The English breakfast is very good in this restaurant.
Crepes are yummy.
Eww! I hate sea food. I can survive the entire day on my
breakfast
21. Considering the case of Unigram
Word frequency in each class
Sea food Breakfast
Seafood - 3 1
crabs 1 0
breakfast 0 1
crepes 0 1
Compute prior probabilities according to this table
22. Which place should I go to order crepes? Seafood or
breakfast place?
Naïve Bayes Formula
p(c/w)= [p(w/c)p(c)]/p(w)
Solution
Crepes (Important extracted word from query- all other words being
unimportant) – classify
Probablity
For sea food = [0* (4/7)/ (1/7)] = 0
For BreakFast = (1/3 * (3/7)/(1/7))=1
26. Precision= True positives / (True Positives + False
Positives)
Recall = True Positives / (True Positives + False
Negatives)
F - score= 2*P*R/(P+R)
The results show that Bi-gram model does better
than unigram model
27. Dark Knight rises is a good movie
Dark knight rises is an awesome movie
Both are positive
But, second expresses more positive ness
NLP is better than Machine Learning
Machine learning cannot understand the semantics
Need of a lexicon
Also to differentiate between
I like the food
The food is awesome and it’s worth every penny of your money. The
staff is very friendly and we received a very warm welcome.
(Twitter is restricted to 150 word tweets while many review sites allow users to
enter as many words as possible. This Intensity calculation is useful in such cases)
28. Intensity Models
Review Level Intensity
The Intensity calculated according to the number/type of
senti-words in the review
Corpus Level Intensity for the review.
The Intensity of the review with respect to the entire
corpus of reviews. This depends on the corpus distribution
29. Uniform weightage Model
Positive emotion word is given a positive score of 1 and
negative emotion word is given a negative score of 1
Net Score = ∑Positive Score – ∑Negative Score.
Using Lexicon
Weighted Net Score =∑ Weighted Positive Score – ∑
Weighted Negative Score.
The intensity values are obtained from Sentiwordnet [5].
30. Applying Gaussian Distribution over entire corpus
of reviews.
Note: It doesn’t fall under Gaussian Distribution, but the log
frequencies does.
31. Positive Reviews
Average Positive Words/Review: 4.1
Average Negative Words/Review: 1.1
Negative Reviews
Average Positive Words/Review: 1.7
Average Negative Words/Review: 4.2
Note: We use the property of Gaussian Distribution that 1-sigma
deviation from Mean corresponds to 68% of the density, and 2-sigma
deviation corresponds to 95% density.
32. Corpus Level intensities
The more the number of positive senti-words in a review, the
more is its positive intensity. Similarly, the more the number of
negative senti-words in a review, the more is its negative
intensity
33. Total Intensity = [(Review Level Intensity + Corpus Level
Intensity)]/2
I Like the food
Sentiments : (food)
Score = (100 + 1)/2 = 50.5
The food is awesome and it’s worth every penny of your
money. The staff is very friendly and we received a very
warm welcome.
Sentiments : (Awesome, worth, friendly, warm)
Score = (100 + 80)/2 = 90
34. Aspects [6] are the features which define a product/Item etc.
Samsung Galaxy Prevail Android Smartphone (Boost Mobile)
--Amazon
Features of Smart Phone:
Design
Size
Speed
Sound
Music Player
Camera/cam
Battery
35. Aspects can be extracted with the help of a POS
Tagger
Stanford POS Tagger [7] :
This restaurant has good ambiance
Parse Tree
(ROOT (S (NP (DT This) (NN restaurant))
(VP (VBZ has)
(NP (JJ good) (NN ambiance))))
NP- Noun Phrase , JJ- Adjective , NN - Noun
36. Extracting Adjective-Noun Pair from reviews(for the previous
product):
This would enable us to identify the aspects and their
corresponding sentiments
Reviews
Attractive design & compact size
Good speed, not the slowest nor the fastest
Clear sound for phone calls & decent music player
Fixed focus low res cam (2MP) no LED
Battery, this is an issue with all smart phones
Aspects – {Design (attractive), Size(compact), Speed(Good),
Sound(clear), Music Player(decent), Cam(low resolution),
Battery(negative) }
37. Used Stanford POS tagger to extract Adjective-Noun
pair from the corpus of all the restaurant reviews
Restaurant Domain
I – 2548
We- 1342
They- 955
It- 911
Food- 347
Services- 291
Place- 248
Foods- 229
Service- 210
experiences- 131
Waitress- 122 … pizza-51
Problem : Apart from the aspects/features of restaurants such as Food,
Place, service, there is high number of pronouns. These pronouns can
represent any thing
38. The high frequency counts of pronouns shows that we
need to de-reference them and extract the corresponding
nouns
This restaurant has good ambiance, but it is not as good as
described by my friends
Replacing all the “it”s in this sentence with ambiance
“This” with restaurant.
Note: Stanford NLP tool kit has de-referencing API
39. Is –A Relation Ship
Another problem faced.
Sentiments attached to sub-categories than the main
categories.
Ex: The pizza in this restaurant is good.
Good is attached to Pizza
Pizza is a type of Food
Hence all the sentiments about Pizza should be pointed to
food
This kind of relationships are given by Graph
Database(Entity relationships) called freebase
40. Algorithm
Use POS tagger to extract nouns attached to
adjectives
Dereference the personal pronouns
Remove the existing pronouns
Use freebase dump to find IS-A relation
Merge frequencies of plural and singular words and
use singulars
Find the adjectives associated with the nouns. This
would give an indication of the sentiment
41. Restaurant- 816
Food- 719
Service- 613
experience- 219
Waitress- 122 (Further have to establish a relation ship between
waitress and service. Need of an ontology for each domain or can use wordnet
to find the distance between waitress and service )
Review – 91
Drink - 64
42. [1] http://en.wikipedia.org/wiki/Sentiment_analysis
[2] R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar, “Structured models
for fine-tocoarse sentiment analysis,” Proceedings of the Association for
Computational Linguistics (ACL), pp. 432–439, Prague, Czech Republic: June 2007.
[3] WILSON,T., J.WIEBE, and P.HOFFMANN. 2005. Recognizing contextual polarity in
phrase-level sentiment analysis. In Proceedings of Human Language Technologies
Conference/Conference on Empirical Methods in Natural Language Processing
(HLT/EMNLP 2005), pp. 347–354, Vancouver, Canada.
[4] https://cwiki.apache.org/MAHOUT/naivebayes.html
[5] http://sentiwordnet.isti.cnr.it/search.php?q=greatest
[6] http://sentic.net/sentire/2011/ott.pdf
[7] http://nlp.stanford.edu:8080/parser/index.jsp
[8] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification
using machine learning techniques,” in Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 79–86, 2002.