SlideShare a Scribd company logo
1 of 25
Workshop:
Sentiment Analysis with
Python
Rob Fahey robfahey@fuji.waseda.jp @robfahey
Data Science Week at Waseda, January 2019
”How does it make you
feel?”
Sentiment Analysis
Also called “Tone Analysis” (Grimmer & Stewart 2013)
or “Opinion Mining” (Dave, Lawrence & Pennock 2003)
Whatever you call it, the question it aims to answer is
always the same:
THE OBJECTIVE
• In the Internet age, humans create and publish billions of
pieces of content (text, movies, images etc.) every single day.
• Many of those data express a sentiment about a subject of some kind.
• By selecting data related to a subject (a person, a country, a
brand, etc.), we can measure public sentiment in a very detailed
way.
• We can even see how sentiment changes minute-by-minute, or
day-by-day – giving us unprecedented insights into political
trends, marketing campaigns or financial market movements.
THE CHALLENGE
• Sentiment Analysis is easy for humans, but hard for computers.
• Humans: can process complex texts, images or videos with an
understanding of cultural and social contexts, allowing us to
quickly and naturally judge the sentiment or emotion being
expressed.
• Computers: can count things really, really fast.
• Sentiment Analysis methodologies all try to overcome the
weaknesses of computers (no context, no understanding) by
using their strengths (counting very fast!).
TWO APPROACHES
UNSUPERVISED METHODS
• Dictionary / Lexicon
Methods
• Word Embeddings
SUPERVISED METHODS
• Classification Algorithms
• Aggregate Algorithms
Requires Training DataNo Training Data Required
HOW A MACHINE LEARNS
• To carry out “Machine Learning”, the machine needs something
to learn from.
• In dictionary approaches, you teach the computer a lexicon –
a set of words that are associated with different sentiments.
• This approach can be improved (or at least complicated) by using
techniques like word embeddings, which try to estimate the sentiment of
unknown words by seeing how frequently they occur in proximity to
known words;
• Or by trying to consider the grammatical context in which a word
appears.
great +1
awful -1
HOW A MACHINE LEARNS (2)
• In supervised approaches, the computer instead learns from a
set of sample data which you have categorized by hand, using
human coding.
• There are lots of different algorithms and approaches for supervised
learning, but they all have this in common – you need to create training
data first.
• The algorithms try to learn the patterns which are associated with each
sentiment.
“This movie was terrible - why would Brad Pitt agree to star
in this rubbish? It’s not like he needs the money.”
Negativ
e
“Just had a great time at the cinema, what a fantastic movie!
I don’t want to ruin the ending but it’s a crazy surprise. Well
worth the money.”
Positive
PREPARING YOUR DATA:
WORD SEGMENTATION
• The first challenge is how to divide sentences in your data into
words.
• In English or other European languages, this is fairly easy –
These / languages / have / spaces / between / the / words.
• It’s not quite that simple – a process called stemming is often
used to change every word back to its most simple form by
removing plurals, tenses etc.
• Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and
“going”, express the same concept!
PREPARING YOUR DATA:
WORD SEGMENTATION (IN OTHER
LANGUAGES)
• In other languages like Japanese, word segmentation is more
challenging.
• 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし
ないといけない。 Where do the words begin and end in that
sentence?
• Thankfully there is software to help with this process in many
languages.
• Japanese: MeCab, ChaSen, Janome (Python package)
• Chinese (and Arabic): Stanford Word Segmenter
• Korean: Open-Korean-Text (looks good, but I haven’t tried it)
DICTIONARY APPROACHES
• To use a dictionary approach, you need to start by acquiring a
dictionary (or “lexicon”) which you’ll use to calculate sentiment.
• There are many of these available for the English language and
other major languages. In minority languages, however, these
resources might not be available – or might be of very dubious
quality.
• Your dictionary needs to be appropriate to your text. Using a
dictionary full of Twitter slang on newspaper texts will yield
bad results – and vice versa.
A SIMPLE EXAMPLE
Just had a great time at the cinema, what a
fantastic movie! I don’t want to ruin the
ending but it’s a crazy surprise. Well worth
the money.
“This movie was terrible - why would Brad
Pitt agree to star in this rubbish? It’s not like
he needs the money.”
A SIMPLE EXAMPLE…?
This movie has a fantastic cast, an
interesting concept and amazing special
effects – but the end result is utterly
boring.
DICTIONARY APPROACHES
PLEASE OPEN JUPYTER LAB!
THE BAG OF WORDS
• You may have noticed something about the examples we
looked at – the order of the words doesn’t matter.
• This is actually true of (almost) every
sentiment analysis approach (and text
mining approaches in general).
• It’s counter-intuitive, but computers are much
better at treating texts as a ”bag of words”
than they are at understanding grammar,
word order etc.
VECTOR REPRESENTATIONS
• Often, after dividing the sentence into words, we represent it
using a vector of word frequencies. An entire corpus of
documents can be represented in a single matrix: the term-
document matrix (TDM).
I like to eat sushi
You like to eat
burgers
She doesn’t like
sushi
I Like To Eat Sushi You Burgers She Doesn’t
1 1 1 1 1 0 0 0 0
0 1 1 1 0 1 1 0 0
0 1 0 0 1 0 0 1 1
FEATURE SELECTION
• A term-document matrix could easily get VERY big –
overwhelming a computer’s memory and taking a very long
time to process. We often need to focus somehow on the most
relevant terms in the vocabulary. How?
• Stopwords: Very commonly used words are of little value in
distinguishing documents, so we can remove them.
• Document Frequency: Ignoring words which appear in too many or too
few documents allows us to focus only on words useful to our research.
• TF-IDF: Less useful for short documents (e.g. Twitter), but “Term
Frequency / Inverse Document Frequency” points out words that are
especially good at distinguishing differences between texts.
CLASSIFICATION ALGORITHMS
• Classification algorithms are the most commonly used tool in
machine learning – not just in text mining, but also in fields
like voice recognition, computer vision or predicting behaviour.
• They are essentially tools for pattern recognition – you show
them a number of labelled examples of vector representations
(in our case, term-document matrices) and they try to find the
patterns which maximise the probability of a vector belonging
to a certain label.
CHOOSING AN ALGORITHM
• There are many kinds of classification algorithm – from simple
statistical methods like Naïve Bayes, to evolutions of
regression-based approaches like Support Vector Machines, to
science-fiction sounding approaches like Random Forest (which
constructs a “forest” of “decision trees” and uses them to vote
of classification) and Neural Networks (which were designed to
emulate the decision-making behavior of neurons in the human
brain).
• How do you pick the best one for your research?
• Simple answer: try them all and see what works best. Luckily,
CLASSIFICATION APPROACHES
PLEASE GO BACK TO JUPYTER LAB!
AGGREGATE ALGORITHMS
• There is one final group of sentiment analysis approaches
which has been gaining in popularity in recent years.
• Aggregate algorithms are similar to classification algorithms in
many ways (they need training data and function on pattern
recognition), but different in one crucial way – they do not
classify individual documents, but instead aim to give an
accurate measurement of the distribution of classes in the
overall corpus.
AGGREGATE ALGORITHMS
• This has some serious advantages! Aggregate algorithms tend
to be able to give accurate results with a much smaller amount
of training data, for example.
• Aggregate algorithms are also really good at handling data with
a lot of “off-topic” texts.
• Classification algorithms have a statistical problem with this data – when
the “off-topic” category is very common, there is a bias towards mis-
classifying a lot of texts as off-topic.
• But… You can’t see classifications for individual texts, so
they’re not appropriate for every kind of research.
AGGREGATE APPROACHES
PLEASE GO BACK TO JUPYTER LAB!
PITFALLS AND WARNINGS
• Clean your Data! Data accessed from the internet often includes
a lot of texts you didn’t actually mean to analyse – check
carefully to make sure your data isn’t full of bots reposting
garbage, or posts about a totally different topic.
• Read your Data! Don’t just take the results of any algorithm to
be accurate – even if it agrees with your hypothesis. At some
point you’re going to need to dive in and read samples of the
data you’ve collected, to confirm that you’re really observing
WRAPPING UP
• This workshop can really only introduce a few of the most
commonly used approaches in sentiment analysis. This is a
rapidly changing field and new algorithms and approaches are
being developed all the time.
• There are some approaches which require a lot more technical
skill than the ones we looked at today – for example, creating
your own sentiment dictionary and analyser that’s perfectly
appropriate for your corpus of texts is possible, but difficult
unless you’re a skilled programmer.
• The approaches we looked at today are very mainstream and
commonly used in a lot of academic studies – I hope they’ll be
THANK YOU!
• Questions, ideas or feedback?
• Email: robfahey@fuji.waseda.jp
• Twitter: @robfahey
• Website: robfahey.co.uk

More Related Content

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Sentiment Analysis in Python - Waseda Data Science Week 2019

  • 1. Workshop: Sentiment Analysis with Python Rob Fahey robfahey@fuji.waseda.jp @robfahey Data Science Week at Waseda, January 2019
  • 2. ”How does it make you feel?” Sentiment Analysis Also called “Tone Analysis” (Grimmer & Stewart 2013) or “Opinion Mining” (Dave, Lawrence & Pennock 2003) Whatever you call it, the question it aims to answer is always the same:
  • 3. THE OBJECTIVE • In the Internet age, humans create and publish billions of pieces of content (text, movies, images etc.) every single day. • Many of those data express a sentiment about a subject of some kind. • By selecting data related to a subject (a person, a country, a brand, etc.), we can measure public sentiment in a very detailed way. • We can even see how sentiment changes minute-by-minute, or day-by-day – giving us unprecedented insights into political trends, marketing campaigns or financial market movements.
  • 4. THE CHALLENGE • Sentiment Analysis is easy for humans, but hard for computers. • Humans: can process complex texts, images or videos with an understanding of cultural and social contexts, allowing us to quickly and naturally judge the sentiment or emotion being expressed. • Computers: can count things really, really fast. • Sentiment Analysis methodologies all try to overcome the weaknesses of computers (no context, no understanding) by using their strengths (counting very fast!).
  • 5. TWO APPROACHES UNSUPERVISED METHODS • Dictionary / Lexicon Methods • Word Embeddings SUPERVISED METHODS • Classification Algorithms • Aggregate Algorithms Requires Training DataNo Training Data Required
  • 6. HOW A MACHINE LEARNS • To carry out “Machine Learning”, the machine needs something to learn from. • In dictionary approaches, you teach the computer a lexicon – a set of words that are associated with different sentiments. • This approach can be improved (or at least complicated) by using techniques like word embeddings, which try to estimate the sentiment of unknown words by seeing how frequently they occur in proximity to known words; • Or by trying to consider the grammatical context in which a word appears. great +1 awful -1
  • 7. HOW A MACHINE LEARNS (2) • In supervised approaches, the computer instead learns from a set of sample data which you have categorized by hand, using human coding. • There are lots of different algorithms and approaches for supervised learning, but they all have this in common – you need to create training data first. • The algorithms try to learn the patterns which are associated with each sentiment. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.” Negativ e “Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money.” Positive
  • 8. PREPARING YOUR DATA: WORD SEGMENTATION • The first challenge is how to divide sentences in your data into words. • In English or other European languages, this is fairly easy – These / languages / have / spaces / between / the / words. • It’s not quite that simple – a process called stemming is often used to change every word back to its most simple form by removing plurals, tenses etc. • Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and “going”, express the same concept!
  • 9. PREPARING YOUR DATA: WORD SEGMENTATION (IN OTHER LANGUAGES) • In other languages like Japanese, word segmentation is more challenging. • 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし ないといけない。 Where do the words begin and end in that sentence? • Thankfully there is software to help with this process in many languages. • Japanese: MeCab, ChaSen, Janome (Python package) • Chinese (and Arabic): Stanford Word Segmenter • Korean: Open-Korean-Text (looks good, but I haven’t tried it)
  • 10. DICTIONARY APPROACHES • To use a dictionary approach, you need to start by acquiring a dictionary (or “lexicon”) which you’ll use to calculate sentiment. • There are many of these available for the English language and other major languages. In minority languages, however, these resources might not be available – or might be of very dubious quality. • Your dictionary needs to be appropriate to your text. Using a dictionary full of Twitter slang on newspaper texts will yield bad results – and vice versa.
  • 11. A SIMPLE EXAMPLE Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.”
  • 12. A SIMPLE EXAMPLE…? This movie has a fantastic cast, an interesting concept and amazing special effects – but the end result is utterly boring.
  • 14. THE BAG OF WORDS • You may have noticed something about the examples we looked at – the order of the words doesn’t matter. • This is actually true of (almost) every sentiment analysis approach (and text mining approaches in general). • It’s counter-intuitive, but computers are much better at treating texts as a ”bag of words” than they are at understanding grammar, word order etc.
  • 15. VECTOR REPRESENTATIONS • Often, after dividing the sentence into words, we represent it using a vector of word frequencies. An entire corpus of documents can be represented in a single matrix: the term- document matrix (TDM). I like to eat sushi You like to eat burgers She doesn’t like sushi I Like To Eat Sushi You Burgers She Doesn’t 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1
  • 16. FEATURE SELECTION • A term-document matrix could easily get VERY big – overwhelming a computer’s memory and taking a very long time to process. We often need to focus somehow on the most relevant terms in the vocabulary. How? • Stopwords: Very commonly used words are of little value in distinguishing documents, so we can remove them. • Document Frequency: Ignoring words which appear in too many or too few documents allows us to focus only on words useful to our research. • TF-IDF: Less useful for short documents (e.g. Twitter), but “Term Frequency / Inverse Document Frequency” points out words that are especially good at distinguishing differences between texts.
  • 17. CLASSIFICATION ALGORITHMS • Classification algorithms are the most commonly used tool in machine learning – not just in text mining, but also in fields like voice recognition, computer vision or predicting behaviour. • They are essentially tools for pattern recognition – you show them a number of labelled examples of vector representations (in our case, term-document matrices) and they try to find the patterns which maximise the probability of a vector belonging to a certain label.
  • 18. CHOOSING AN ALGORITHM • There are many kinds of classification algorithm – from simple statistical methods like Naïve Bayes, to evolutions of regression-based approaches like Support Vector Machines, to science-fiction sounding approaches like Random Forest (which constructs a “forest” of “decision trees” and uses them to vote of classification) and Neural Networks (which were designed to emulate the decision-making behavior of neurons in the human brain). • How do you pick the best one for your research? • Simple answer: try them all and see what works best. Luckily,
  • 19. CLASSIFICATION APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  • 20. AGGREGATE ALGORITHMS • There is one final group of sentiment analysis approaches which has been gaining in popularity in recent years. • Aggregate algorithms are similar to classification algorithms in many ways (they need training data and function on pattern recognition), but different in one crucial way – they do not classify individual documents, but instead aim to give an accurate measurement of the distribution of classes in the overall corpus.
  • 21. AGGREGATE ALGORITHMS • This has some serious advantages! Aggregate algorithms tend to be able to give accurate results with a much smaller amount of training data, for example. • Aggregate algorithms are also really good at handling data with a lot of “off-topic” texts. • Classification algorithms have a statistical problem with this data – when the “off-topic” category is very common, there is a bias towards mis- classifying a lot of texts as off-topic. • But… You can’t see classifications for individual texts, so they’re not appropriate for every kind of research.
  • 22. AGGREGATE APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  • 23. PITFALLS AND WARNINGS • Clean your Data! Data accessed from the internet often includes a lot of texts you didn’t actually mean to analyse – check carefully to make sure your data isn’t full of bots reposting garbage, or posts about a totally different topic. • Read your Data! Don’t just take the results of any algorithm to be accurate – even if it agrees with your hypothesis. At some point you’re going to need to dive in and read samples of the data you’ve collected, to confirm that you’re really observing
  • 24. WRAPPING UP • This workshop can really only introduce a few of the most commonly used approaches in sentiment analysis. This is a rapidly changing field and new algorithms and approaches are being developed all the time. • There are some approaches which require a lot more technical skill than the ones we looked at today – for example, creating your own sentiment dictionary and analyser that’s perfectly appropriate for your corpus of texts is possible, but difficult unless you’re a skilled programmer. • The approaches we looked at today are very mainstream and commonly used in a lot of academic studies – I hope they’ll be
  • 25. THANK YOU! • Questions, ideas or feedback? • Email: robfahey@fuji.waseda.jp • Twitter: @robfahey • Website: robfahey.co.uk