Slides from the Sentiment Analysis in Python workshop, held as part of the Data Science Week at Waseda, January 2019. The accompanying Jupyter Notebook code can be found at http://www.robfahey.co.uk/blog/social-media-data-workshop-waseda/
2. ”How does it make you
feel?”
Sentiment Analysis
Also called “Tone Analysis” (Grimmer & Stewart 2013)
or “Opinion Mining” (Dave, Lawrence & Pennock 2003)
Whatever you call it, the question it aims to answer is
always the same:
3. THE OBJECTIVE
• In the Internet age, humans create and publish billions of
pieces of content (text, movies, images etc.) every single day.
• Many of those data express a sentiment about a subject of some kind.
• By selecting data related to a subject (a person, a country, a
brand, etc.), we can measure public sentiment in a very detailed
way.
• We can even see how sentiment changes minute-by-minute, or
day-by-day – giving us unprecedented insights into political
trends, marketing campaigns or financial market movements.
4. THE CHALLENGE
• Sentiment Analysis is easy for humans, but hard for computers.
• Humans: can process complex texts, images or videos with an
understanding of cultural and social contexts, allowing us to
quickly and naturally judge the sentiment or emotion being
expressed.
• Computers: can count things really, really fast.
• Sentiment Analysis methodologies all try to overcome the
weaknesses of computers (no context, no understanding) by
using their strengths (counting very fast!).
5. TWO APPROACHES
UNSUPERVISED METHODS
• Dictionary / Lexicon
Methods
• Word Embeddings
SUPERVISED METHODS
• Classification Algorithms
• Aggregate Algorithms
Requires Training DataNo Training Data Required
6. HOW A MACHINE LEARNS
• To carry out “Machine Learning”, the machine needs something
to learn from.
• In dictionary approaches, you teach the computer a lexicon –
a set of words that are associated with different sentiments.
• This approach can be improved (or at least complicated) by using
techniques like word embeddings, which try to estimate the sentiment of
unknown words by seeing how frequently they occur in proximity to
known words;
• Or by trying to consider the grammatical context in which a word
appears.
great +1
awful -1
7. HOW A MACHINE LEARNS (2)
• In supervised approaches, the computer instead learns from a
set of sample data which you have categorized by hand, using
human coding.
• There are lots of different algorithms and approaches for supervised
learning, but they all have this in common – you need to create training
data first.
• The algorithms try to learn the patterns which are associated with each
sentiment.
“This movie was terrible - why would Brad Pitt agree to star
in this rubbish? It’s not like he needs the money.”
Negativ
e
“Just had a great time at the cinema, what a fantastic movie!
I don’t want to ruin the ending but it’s a crazy surprise. Well
worth the money.”
Positive
8. PREPARING YOUR DATA:
WORD SEGMENTATION
• The first challenge is how to divide sentences in your data into
words.
• In English or other European languages, this is fairly easy –
These / languages / have / spaces / between / the / words.
• It’s not quite that simple – a process called stemming is often
used to change every word back to its most simple form by
removing plurals, tenses etc.
• Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and
“going”, express the same concept!
9. PREPARING YOUR DATA:
WORD SEGMENTATION (IN OTHER
LANGUAGES)
• In other languages like Japanese, word segmentation is more
challenging.
• 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし
ないといけない。 Where do the words begin and end in that
sentence?
• Thankfully there is software to help with this process in many
languages.
• Japanese: MeCab, ChaSen, Janome (Python package)
• Chinese (and Arabic): Stanford Word Segmenter
• Korean: Open-Korean-Text (looks good, but I haven’t tried it)
10. DICTIONARY APPROACHES
• To use a dictionary approach, you need to start by acquiring a
dictionary (or “lexicon”) which you’ll use to calculate sentiment.
• There are many of these available for the English language and
other major languages. In minority languages, however, these
resources might not be available – or might be of very dubious
quality.
• Your dictionary needs to be appropriate to your text. Using a
dictionary full of Twitter slang on newspaper texts will yield
bad results – and vice versa.
11. A SIMPLE EXAMPLE
Just had a great time at the cinema, what a
fantastic movie! I don’t want to ruin the
ending but it’s a crazy surprise. Well worth
the money.
“This movie was terrible - why would Brad
Pitt agree to star in this rubbish? It’s not like
he needs the money.”
12. A SIMPLE EXAMPLE…?
This movie has a fantastic cast, an
interesting concept and amazing special
effects – but the end result is utterly
boring.
14. THE BAG OF WORDS
• You may have noticed something about the examples we
looked at – the order of the words doesn’t matter.
• This is actually true of (almost) every
sentiment analysis approach (and text
mining approaches in general).
• It’s counter-intuitive, but computers are much
better at treating texts as a ”bag of words”
than they are at understanding grammar,
word order etc.
15. VECTOR REPRESENTATIONS
• Often, after dividing the sentence into words, we represent it
using a vector of word frequencies. An entire corpus of
documents can be represented in a single matrix: the term-
document matrix (TDM).
I like to eat sushi
You like to eat
burgers
She doesn’t like
sushi
I Like To Eat Sushi You Burgers She Doesn’t
1 1 1 1 1 0 0 0 0
0 1 1 1 0 1 1 0 0
0 1 0 0 1 0 0 1 1
16. FEATURE SELECTION
• A term-document matrix could easily get VERY big –
overwhelming a computer’s memory and taking a very long
time to process. We often need to focus somehow on the most
relevant terms in the vocabulary. How?
• Stopwords: Very commonly used words are of little value in
distinguishing documents, so we can remove them.
• Document Frequency: Ignoring words which appear in too many or too
few documents allows us to focus only on words useful to our research.
• TF-IDF: Less useful for short documents (e.g. Twitter), but “Term
Frequency / Inverse Document Frequency” points out words that are
especially good at distinguishing differences between texts.
17. CLASSIFICATION ALGORITHMS
• Classification algorithms are the most commonly used tool in
machine learning – not just in text mining, but also in fields
like voice recognition, computer vision or predicting behaviour.
• They are essentially tools for pattern recognition – you show
them a number of labelled examples of vector representations
(in our case, term-document matrices) and they try to find the
patterns which maximise the probability of a vector belonging
to a certain label.
18. CHOOSING AN ALGORITHM
• There are many kinds of classification algorithm – from simple
statistical methods like Naïve Bayes, to evolutions of
regression-based approaches like Support Vector Machines, to
science-fiction sounding approaches like Random Forest (which
constructs a “forest” of “decision trees” and uses them to vote
of classification) and Neural Networks (which were designed to
emulate the decision-making behavior of neurons in the human
brain).
• How do you pick the best one for your research?
• Simple answer: try them all and see what works best. Luckily,
20. AGGREGATE ALGORITHMS
• There is one final group of sentiment analysis approaches
which has been gaining in popularity in recent years.
• Aggregate algorithms are similar to classification algorithms in
many ways (they need training data and function on pattern
recognition), but different in one crucial way – they do not
classify individual documents, but instead aim to give an
accurate measurement of the distribution of classes in the
overall corpus.
21. AGGREGATE ALGORITHMS
• This has some serious advantages! Aggregate algorithms tend
to be able to give accurate results with a much smaller amount
of training data, for example.
• Aggregate algorithms are also really good at handling data with
a lot of “off-topic” texts.
• Classification algorithms have a statistical problem with this data – when
the “off-topic” category is very common, there is a bias towards mis-
classifying a lot of texts as off-topic.
• But… You can’t see classifications for individual texts, so
they’re not appropriate for every kind of research.
23. PITFALLS AND WARNINGS
• Clean your Data! Data accessed from the internet often includes
a lot of texts you didn’t actually mean to analyse – check
carefully to make sure your data isn’t full of bots reposting
garbage, or posts about a totally different topic.
• Read your Data! Don’t just take the results of any algorithm to
be accurate – even if it agrees with your hypothesis. At some
point you’re going to need to dive in and read samples of the
data you’ve collected, to confirm that you’re really observing
24. WRAPPING UP
• This workshop can really only introduce a few of the most
commonly used approaches in sentiment analysis. This is a
rapidly changing field and new algorithms and approaches are
being developed all the time.
• There are some approaches which require a lot more technical
skill than the ones we looked at today – for example, creating
your own sentiment dictionary and analyser that’s perfectly
appropriate for your corpus of texts is possible, but difficult
unless you’re a skilled programmer.
• The approaches we looked at today are very mainstream and
commonly used in a lot of academic studies – I hope they’ll be