Sentiment analysis

S.V.Giri
 -
(Venkata.giri.s@gmail.com)
-

Generally speaking, sentiment analysis aims to determine the attitude
of a speaker or a writer with respect to some topic or the overall
contextual polarity of a document
~ Wikipedia[1]

Levels[2] at which sentiments can be expressed:
 Phrase
 Sentence
 Paragraph
 Document
 About a Subject

User’s Opinions

Bob: It's a great movie (Positive sentiment)
Alice: Nah!! I didn't like it at all (Negative sentiment)

Bob: I am not so sure about the movie. You may like it,
or may be not ! (Neutral!! Confused!!)

Understanding public opinion on products, movies etc.
Ex: There is 67% negative opinion on the color of
Amazon’s new version of Kindle.

Using this knowledge to
 Make predictions in market trends, results of election
polls etc.
 Make decisions !
Ex: Changing the color in subsequent versions
 Personalization!
Ex: Recommeding products depending on what your
friends feel.

Binary
 Positive
 Negative

Ordinal values
Ex: rating from 1 to 5

Complex polarity
Detect the source, target and attitude
Ex: Obama offers comfort after colorado shooting.
Subject : Obama, Target: People , Attitude: comfort

NLP
 Use of semantics to understand the language
 Uses lexicons, dictionaries, ontologies
Ex: I feel great today. (Understands that user’s feeling is
great)

Machine Learning
 Don’t have to understand the meaning.
 Uses classifiers such as Naïve Bayes, SVM, Max Ent
etc.
Ex: I feel great today (Doesn’t have to understand what user
is feeling. It’s just that word great appears in positive or
negative set, is good enough to classify the sentence as
positive or negative)

Apple Ipod Review

Alice : Apple ipod is a great music player. It’s better
than any other product I have bought

Great – Positive
Better – Positive
Total Positives = 2
Total Negatives =0
Net Score = 2-0=2
Hence the review is Positive

Apple Ipod Review

Alice : Apple ipod is not bad at all. You can buy it.
Not – Negative
Bad – Negative
Total Positives = 0
Total Negatives =2
Net Score = 0-2=-2
Hence the review is Negative
Note: This can be solved by a preprocessing stage such as
converting “Not bad ” to “Good”. But preprocessing for
NLP is complex.

Requires a good classifier
Requires a training set for each class.

In our case:
2 classes, Positive and Negative
Require pre-classified training set for both these
classes.

Training data for Movie Domain

Positive class
 Sleepy Hollow is an awesome movie. Every one should
watch it.
 Christopher Nolan is such a great director that he can
convert any script into a block buster.
 Great actors, great direction and a great movie.

Negative class
 Nothing can make this movie better. It can win the
stupidest movie of the year award, if there is such a thing.

Advantages
 Don’t have to create a sentiment lexicon (great is
80% positive, bad is 75% negative etc…)
 Categorization of proper nouns as well
(Ex: Cameron Diaz)
 Generic and can be applied for various domains
 Language independent models
(Ex: J'aime le film "Amélie")
Disadvantage:
 Should have large sets of training data

Preparing Train
Training Set Classifier

Yelp

Data Pre-processing
Collection
Test
classifier
Preparing
City Test Set
Grid

City Grid Media
CityGrid Media is an online media company that
connects web and mobile publishers with local
businesses by linking them through CityGrid

Provides
 Restful API
 Ratings (0-10)
 Reviews

Domain
 Restaurant

 Tokenization
 Case Conversion
 Word conversion to full forms (“Don’t” to “do not”,
“I’ll” to “I will”)
 Removal of punctuations
 Stop word filter using Lucene
 Length filter – to remove words with less than 3
characters

Reviews with ratings > 8 - Positive Class
Reviews with ratings < 3 - Negative Class

Training
Positive reviews – 20,000
Negative reviews – 20,000
Considering the same scale with out bias

Test Set
Positive reviews – 1,000
Negative reviews – 1,000

Tokenization
Splitting the sentences into words.
Vectorization
A vector for each review in the vector space model
Training and Test Sets
Store the files corresponding to Training and Test
sets on HDFS
Train the classifier
./bin/mahout trainclassifier -i /restaurants/bayes-
train-input -o /restaurants/bayes-model -type
bayes -ng 1 -source hdfs

Unigram
Considers only one token
 Ex: It is a good movie.
{It, is, a, good, movie}

Bigram
Considers two consecutive tokens
Ex: It is not bad movie
{It is, is not, not bad, bad movie}

Reviews for sea food restaurants
 This restaurant makes good crab dishes. Crab is a kind of
sea food isn't it?
 The is a good sea food restaurant.
 Nay!! don't go there if you want sea food. Try going to
Marina or some other restaurant.

Reviews for breakfast
 The English breakfast is very good in this restaurant.
 Crepes are yummy.
 Eww! I hate sea food. I can survive the entire day on my
breakfast

Considering the case of Unigram

Word frequency in each class

Sea food Breakfast
Seafood - 3 1
crabs 1 0
breakfast 0 1
crepes 0 1

Compute prior probabilities according to this table

Which place should I go to order crepes? Seafood or
breakfast place?

Naïve Bayes Formula
p(c/w)= [p(w/c)p(c)]/p(w)

Solution
Crepes (Important extracted word from query- all other words being
unimportant) – classify

Probablity
For sea food = [0* (4/7)/ (1/7)] = 0
For BreakFast = (1/3 * (3/7)/(1/7))=1

N-gram 1
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
964 36 | 1000 a = (Positive)
82 918 | 1000 b = (Negative)
================================================

N-gram 2
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
969 31 0 | 1000 a = (Positive)
62 938 0 | 1000 b = (Negative)
===========================================
=====

Precision= True positives / (True Positives + False
Positives)
Recall = True Positives / (True Positives + False
Negatives)

F - score= 2*P*R/(P+R)

The results show that Bi-gram model does better
than unigram model

 Dark Knight rises is a good movie
 Dark knight rises is an awesome movie

 Both are positive
 But, second expresses more positive ness
 NLP is better than Machine Learning
 Machine learning cannot understand the semantics
 Need of a lexicon

Also to differentiate between
 I like the food
 The food is awesome and it’s worth every penny of your money. The
staff is very friendly and we received a very warm welcome.

(Twitter is restricted to 150 word tweets while many review sites allow users to
enter as many words as possible. This Intensity calculation is useful in such cases)

Intensity Models

 Review Level Intensity
The Intensity calculated according to the number/type of
senti-words in the review

 Corpus Level Intensity for the review.
The Intensity of the review with respect to the entire
corpus of reviews. This depends on the corpus distribution

Uniform weightage Model
Positive emotion word is given a positive score of 1 and
negative emotion word is given a negative score of 1

Net Score = ∑Positive Score – ∑Negative Score.

Using Lexicon
Weighted Net Score =∑ Weighted Positive Score – ∑
Weighted Negative Score.

The intensity values are obtained from Sentiwordnet [5].

Applying Gaussian Distribution over entire corpus
of reviews.
Note: It doesn’t fall under Gaussian Distribution, but the log
frequencies does.

Positive Reviews
 Average Positive Words/Review: 4.1
 Average Negative Words/Review: 1.1

Negative Reviews
 Average Positive Words/Review: 1.7
 Average Negative Words/Review: 4.2

Note: We use the property of Gaussian Distribution that 1-sigma
deviation from Mean corresponds to 68% of the density, and 2-sigma
deviation corresponds to 95% density.

Corpus Level intensities
The more the number of positive senti-words in a review, the
more is its positive intensity. Similarly, the more the number of
negative senti-words in a review, the more is its negative
intensity

Total Intensity = [(Review Level Intensity + Corpus Level
Intensity)]/2

I Like the food
Sentiments : (food)
Score = (100 + 1)/2 = 50.5

The food is awesome and it’s worth every penny of your
money. The staff is very friendly and we received a very
warm welcome.

Sentiments : (Awesome, worth, friendly, warm)
Score = (100 + 80)/2 = 90

Aspects [6] are the features which define a product/Item etc.

Samsung Galaxy Prevail Android Smartphone (Boost Mobile)
--Amazon

Features of Smart Phone:
 Design
 Size
 Speed
 Sound
 Music Player
 Camera/cam
 Battery

Aspects can be extracted with the help of a POS
Tagger
Stanford POS Tagger [7] :

This restaurant has good ambiance
Parse Tree
(ROOT (S (NP (DT This) (NN restaurant))
(VP (VBZ has)
(NP (JJ good) (NN ambiance))))

NP- Noun Phrase , JJ- Adjective , NN - Noun

Extracting Adjective-Noun Pair from reviews(for the previous
product):

This would enable us to identify the aspects and their
corresponding sentiments

Reviews
 Attractive design & compact size
 Good speed, not the slowest nor the fastest
 Clear sound for phone calls & decent music player
 Fixed focus low res cam (2MP) no LED
 Battery, this is an issue with all smart phones

Aspects – {Design (attractive), Size(compact), Speed(Good),
Sound(clear), Music Player(decent), Cam(low resolution),
Battery(negative) }

Used Stanford POS tagger to extract Adjective-Noun
pair from the corpus of all the restaurant reviews
Restaurant Domain
I – 2548
We- 1342
They- 955
It- 911
Food- 347
Services- 291
Place- 248
Foods- 229
Service- 210
experiences- 131
Waitress- 122 … pizza-51

Problem : Apart from the aspects/features of restaurants such as Food,
Place, service, there is high number of pronouns. These pronouns can
represent any thing

The high frequency counts of pronouns shows that we
need to de-reference them and extract the corresponding
nouns

This restaurant has good ambiance, but it is not as good as
described by my friends

Replacing all the “it”s in this sentence with ambiance
“This” with restaurant.

Note: Stanford NLP tool kit has de-referencing API

Is –A Relation Ship
Another problem faced.
 Sentiments attached to sub-categories than the main
categories.
Ex: The pizza in this restaurant is good.
 Good is attached to Pizza
 Pizza is a type of Food
Hence all the sentiments about Pizza should be pointed to
food

This kind of relationships are given by Graph
Database(Entity relationships) called freebase

Algorithm

 Use POS tagger to extract nouns attached to
adjectives
 Dereference the personal pronouns
 Remove the existing pronouns
 Use freebase dump to find IS-A relation
 Merge frequencies of plural and singular words and
use singulars
 Find the adjectives associated with the nouns. This
would give an indication of the sentiment

Restaurant- 816
Food- 719
Service- 613
experience- 219
Waitress- 122 (Further have to establish a relation ship between
waitress and service. Need of an ontology for each domain or can use wordnet
to find the distance between waitress and service )

Review – 91
Drink - 64

[1] http://en.wikipedia.org/wiki/Sentiment_analysis
[2] R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar, “Structured models
for fine-tocoarse sentiment analysis,” Proceedings of the Association for
Computational Linguistics (ACL), pp. 432–439, Prague, Czech Republic: June 2007.
[3] WILSON,T., J.WIEBE, and P.HOFFMANN. 2005. Recognizing contextual polarity in
phrase-level sentiment analysis. In Proceedings of Human Language Technologies
Conference/Conference on Empirical Methods in Natural Language Processing
(HLT/EMNLP 2005), pp. 347–354, Vancouver, Canada.
[4] https://cwiki.apache.org/MAHOUT/naivebayes.html
[5] http://sentiwordnet.isti.cnr.it/search.php?q=greatest
[6] http://sentic.net/sentire/2011/ott.pdf
[7] http://nlp.stanford.edu:8080/parser/index.jsp
[8] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification
using machine learning techniques,” in Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 79–86, 2002.

Sentiment analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Sentiment analysis

Similar to Sentiment analysis (20)

Recently uploaded

Recently uploaded (20)

Sentiment analysis