Sentiment Analysis

Sentiment Analysis
Yasen Kiprov
PhD Student, Intelligent Systems
R&D Engineer, NLP

AGENDA
● Introduction to NLP
● Text Classification & Sentiment Analysis
● Engineering approach
● Supervised Machine Learning
● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings

Natural Language Processing
● Enables interaction between computers and
humans through natural languages
or “The branch of information science that deals
with natural language information”
● Natural language understanding - enabling
computers to derive meaning from human input
● Natural language generation
(Not neuro linguistic programming, still some magic applies)

NLP is everywhere
Google translate
Google ads
Google search
Siri / Question Answering
Chat bots
Spam generation / spam filtering
Gene and protein detection
Surveillance / marketing

Text Classification
● Automatically assign a piece of text to one or
more classes.
● History: Guess the author based on text
specifics and author style
1901: “One author prefers “em” as a short for
“them”- let's use this as feature!”
1970s: Who wrote “The Federalist Papers”?

Text Classification
● Spam or not spam
● News analysis: politics, sports, business
● Google ads verticals
26 root categories, 2200 subcategories
● Terrorist or not
Yes, they read your facebook and yes, they know...

Also Text Classification
● Detect truth / lie / sarcasm / joke
● Determine medical condition from hospital
records, patient description
● Guess stock prices
● “How will this press release affect company shares price”
● Sentiment analysis

Sentiment Analysis
● Determining writer's attitude
● Overall document: positive / negative / neutral
“We totally enjoyed our stay there!”
● Towards a target:
“Battery sucks, bends really well though”
● Detecting emotions: sad, happy, angry, excited
● Scales:
● Number of stars / -10 to +10 / percentage
● Subjective vs Objective

Classification for engineers
● Why bother with AI, keep it simple:
IF text contains “ em ”
AND NOT text contains “ them “
author is X
ELSE author is Y
● But what if...

● If author X decided to use “them” once?
Let's try a list of words that only author X uses
IF text contains a word from listX
author is X
ELSE try other rules
Find all the features !!!

● Build a super smart system of if-else
statements to classify correctly each document
● Solving the problem algorithmically
● An “expert system”
● Still used in practice for many applications
● Twitter “sentiment analysis” only rule: if text contains :) or :(

When to do engineering
● For very narrow tasks
● Determine if text is a url or email address
● For a very specific domain
● “If text contains a name of any US president, it's a legislation”
● To create a proof-of-concept
● Twitter “sentiment analysis” only rule: if text contains :) or :(
● When it's hard to get enough data (explained later)

AGENDA
● Introduction to NLP
● Text Classification & Sentiment Analysis
● How it's done (by engineers)
● Supervised Machine Learning
● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings

Supervised learning - Regression
“In statistics, regression
analysis is a statistical process
for estimating the relationships
among variables.”
● Create a hypothesis function based on the blue dots
● When a new X appears, calculate Y
The graph: X values are features, Y values are target values.

Linear Regression Example
● Let X be temperature
● Let Y be chance of rain
Create a function that predicts chance of rain, given temperature
(In reality X is a vector with many feature values)

Hypothesis (function
of a line):
Parameters:
Cost Function:
Goal:
Linear Regression Maths
Step:

Supervised Learning -
Classification
“identifying to which of a set of
categories a new observation
belongs, on the basis of a training
set of data containing observations
(or instances) whose category
membership is known.”
Given a set of training instances, predict a continuous
valued output for new ones.
The graph: x1 and x2 are features, dot color is the target class.

Classification Example
● Let X1 be temperature
● Let X2 be humidity
Create a function that predicts rain or no rain.
(In reality X is a vector with many feature values)

2D Example
● Let X be humidity
● Let Y = 0 for no rain
● Let Y = 1 for rain
Linear hypothesis function doesn't really make sense now.
Logistic function can approximate better.

Agenda Explained
● Until now:
● What is text classification
● What is supervised learning (classification)
● Up next:
● How to apply supervised learning to text?

Statistical Sentiment Analysis
● Document: A piece of text
● Corpus: Set of documents
● Target: Y, positive/negative, emotion, percentage
● Training corpus: Set of documents for which we know Y
●
What is X?
●
How to convert a document to a (real-valued) vector
● Building training corpus
● Find “enough” data

Defining Features
● Each word: one-hot vector
● I = [0, 0, 0, 1, 0, 0, 0, …, 0]
● like = [1, 0, 0, 0, 0, 0, 0, …, 0]
● cookies = [0, 0, 0, 0, 0, 0, 1, …, 0]
● Number of dimensions = size of vocabulary
● Document: bag of words
● Order of words is lost
● Count of words can be added
● Term frequency / inverse document frequency
"I like cookies" = [1, 0, 0, 1, 0, 0, 1, …, 0]

Feature Engineering
● Ngrams (as one-hot)
● I, like, cookies - unigrams
● “I like” = [0, 0, 0, 0, 1, 0, …, 0] - bigrams
● “I like cookies” - trigrams
● Character n-grams:
● li, ik, ke, lik, ike
● Dictionaries:
● Great value for sentiment analysis
● Very good for domain specific text
If document contains any of:
{love, like, good, cool}
add this one: [0, 0, 1, 0, …, 0]

Feature Engineering
● Simple features
● Document Length
● Emoticons
● elooongated words
● ALL-CAPS
● Stopwords
● Through other classification methods:
● Parts of speech
● Negation contexts “I don't like cookies”
● Named Entities
● Approximate dimensions of X: 100k – 10m

Work Process
● Assemble training corpus
● Separate test corpus
● Invent new features
● Generate model (supervised learning)
● Test performance
● Repeat

Tips & Tricks
● Performance usually is
● precision / recall / accuracy / f-measure
● Simple Machine Learning with tons of features
● Even a linear classifier works
● Marketing
● Everyone uses different corpus (can't compare accuracy)
● Showing only what you're sure about
● Generalizing: “overall, 70% of your customers like you”

A.I. - Why is it not working?
“Algorithmically solvable: A decision problem that can be
solved by an algorithm that halts on all inputs in a finite
number of steps.
“Unsolvable problem: A problem that cannot be solved for
all cases by any algorithm whatsoever”
● Artificial Intelligence: Develop intelligent systems, deal with
real world problems. It works... kind of...
- “Siri, will you marry me?”
- “My End User License Agreement does not cover marriage.
My apologies”

Challenges
● Annotation Guidelines
● Inter-annotator agreement
● SemEval
● Sentiment analysis corpus (~14k tweets)
● For 40% of tweets annotators didn't agree
"I don't know half of you half as well as I should like; and I like less
than half of you half as well as you deserve.”
Bilbo Baggins

Still not convinced?
● Context issues
● Narrowing the domain helps
● “beer is cool”, “soup is cool”
● “No babies yet!” - condoms / fertility drugs
● “Obama goes full Bush on Syria”
● User generated content SUCKS!
● “Polynesian sauce from chik fila a be so bomb”
● Common sense
“I tried the banana slicer and found it unacceptable. […] the
slicer is curved from left to right. All of my bananas are bent
the other way.”

Word representations
● One-hot is sparse and meaningless
● N-dimensional vector for each word
● “Ubuntu” close to “Debian”
● “king” to “queen” = “man” to “woman”
● Based solely on word co-occurrence
n = 50 to 1000

Deep Learning
● Artificial Neural Networks
● Input - word embeddings
● Output – target class
● Complex layer structure
● No feature engineering

Tools
● NLTK – NLP in python
● GATE – NLP in java + GUI
● Stanford CoreNLP – NLP in java + deep neural networks
● AlchemyAPI – commercial API for NLP (free demo)
● MetaMind – enterprise sentiment analysis and computer vision (deep
neural networks)
● WolframAlpha – Smart question answering (knows maths)

Sentiment Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Sentiment Analysis

Similar to Sentiment Analysis (20)

More from Data Science Society

More from Data Science Society (20)

Recently uploaded

Recently uploaded (20)

Sentiment Analysis