This project presents a general framework for sentiment analysis of Twitter data, by analyzing the typical public reaction towards health and well-being in Twitter media. The proposed framework is developed using Python, based on part-of-speech (POS) tagged bigrams. Tweets mentioning about common health issues are collected using NodeXL, a free and open-source network analysis tool. Extracted unstructured twitter data is preprocessed and a representative feature vector is generated for each tweet. A probabilistic classifier like Naïve Bayes is trained to determine the polarity and polarity score of the tweet.
This system presents three major outputs: automatic classification of a given tweet, analysis of the general public attitude as well as the top stories from that given set of tweets. Also it contains a module to track the most popular words or phrases in the feed related to a specific topic.
Semelhante a General Framework for Sentiment Analysis of Twitter Data, with Special Attention towards Improving Health Awareness - Final Year Research Project
51_Introduction to Artificial Intelligence and its applications.pdfVamsi kumar
Semelhante a General Framework for Sentiment Analysis of Twitter Data, with Special Attention towards Improving Health Awareness - Final Year Research Project (20)
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
General Framework for Sentiment Analysis of Twitter Data, with Special Attention towards Improving Health Awareness - Final Year Research Project
1. General Framework for
Sentiment Analysis of Twitter Data
with Special Attention Towards
Improving Health Awareness
B. J. Gunasekara
Supervisor - Dr R. D. Nawarathna
2. Introduction
Social networking
encourages users to
express their ideas &
views on
their day-to-day life
style
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
2
3. Social Media Analytics
• The practice of gathering data from web
resources like blogs and social media and
analyzing that data
• Applications
Big Data Analysis
Survey & Marketing
Decision Making
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
3
4. Twitter
“To give everyone the
power to create and
share ideas and
information instantly,
without barriers”
4
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
5. 288 Million Monthly Active Users
500 Million Tweets Sent Per Day
152,000+ Tweets by Healthcare
professionals per Day
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
5
6. Tell your story with
140 characters
Textual content
User mentions
Hashtags
URLs
Location
Content of a tweet
6
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
7. Most of the tweets contain a less
informational value!!!
but a collection of tweets can
provide a
valuable insight into a
population
7
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
8. One voice can make a difference…
But a million can change the world!
#LetDoctorsBeDoctors #ChildhoodCancer
#BreastCancer
#digitalhealth
#ObesityCareWeek
# Parkinsons#Lyphoma
#Migraine
8
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
9. Importance of Improving Health
Literacy
• Maintain personal health & wellbeing
• Save on your medical costs
• Avoid Misinterpretations
chemo isn't so nice. Bad dreams
I really am surprised at how bad the side-effects are from
#chemo this time. It's taken me by surprise a bit. Not good.
hospitals are the worst!! hate the medicine like
smell lingering in the air why did my life become
so bad hate #chemo ahhh
Don't let chemotherapy take away your 'you‘ !!!
find your fab again with @Baldlybeautiful
My dads experimental chemo has officially stopped
his tumors from growing for an entire year now
9
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
10. Natural Language Processing
• NLP is the platform built to understand the linguistic
interaction between humans and computers.
• Main Tasks –
Information Extraction
Semantic Parsing
Text To 3D Scene Generation
Sentiment And Social Meaning
Machine Translation
Dialog And Speech Processing
Automatic Summarization
Text Segmentation
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
10
11. • Sentiment analysis is the extraction of
subjective information in a document using
NLP, text analysis and computer linguistics.
• Basic Tasks
Polarity classification
Subjectivity/objectivity identification
Feature/aspect-based sentiment analysis
Sentiment Analysis (Opinion Mining)
11
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
12. Related Work
• Language feature analysis
• Special frameworks
Autoregressive Moving Average (ARMA)
Latent Dirichlet allocation(LDA)
Ailment Topic Aspect Model (ATAM)
• Derivations from existing models
BioCaster Ontology,
an extant knowledge model of laymen’s terms
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
12
13. Problem Statement
• Perform a sentiment analysis which concerns
on improving health awareness,
by analyzing the typical public reaction to
common illnesses and treatments in Twitter
community.
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
13
14. Methodology
• The proposed method is based on POS Tagged
Bigrams with Naïve Bayes Classifier
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
14
16. Feature Extraction
• “200 lives were lost, coz of this massive
dengue outbreak “Tweet
• ['lives', 'lost', 'coz', 'massive', 'dengue',
'outbreak']Unigrams
• ['lives_lost', 'lost_coz', 'coz_massive',
'massive_dengue', 'dengue_outbreak']Bigrams
• [('lives', 'NNS'), ('were', 'VBD'), ('lost',
'VBN'), ('coz', 'NN'), ('of', 'IN'), ('this', 'DT'),
('massive', 'JJ'), ('dengue', 'NN'),
('outbreak', 'NN')]
POS tagging
16
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
17. Bigram vs. Unigram
• The frequency distribution of bigrams in a
string is used for simple statistical analysis of
text.
• Unlike unigrams, bigrams suggest another
word (increased long-tail specificity )
• Classifier has more contexts to predict the
label than relying on single word.
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
17
18. POS Tagging
• The process of labeling the particular part of
speech of a word with respect to its definition,
as well as its context.
• Mainly nouns & adjectives were considered.
• Adjectives can modify a noun to add value, to
add better sense.
Penn Treebank
Brown Corpus
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
18
19. • Based on Bayes Theorem
• It assumes that the probability of each attribute
belonging to a given class value is independent
of all other attributes and probabilities of each
attribute belonging to each class.
• Ideal for categorical data – easy to calculate
using ratios.
Naïve Bayes classifier
19
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
20. System Implementation
• Python 3.4
Operator - Functional interface to built-in operators.
Itertools - Numeric and Mathematical Modules
Re - Searching within and changing text using formal
patterns.
• NLTK
Probability - Classes for representing and processing
probabilistic information
Classify - Classifiers
Metrics - Testing & validation
• Matplotlib & Pylab
• Tkinter
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
20
21. Experimental Setup
• Specific health topics, illnesses and treatments
were selected using WebMD and Mayo Clinic
• Tweets related to those issues were collected
using NodeXL tool.
• Data was collected over a period of time to
ensure that it does not contain any strange
outliers.
• Training sets
– the datasets were distributed within groups with 10
people in each and the label of a tweet was
assigned according to the tag chosen by
the majority.
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
21
22. • Both Naïve Bayes and Maximum Entropy
classifiers were used.
• Experiments were carried trying out for
different combinations of bigram/unigram,
with part-of-speech (POS) tagging.
• The performance was evaluated with
cross validation.
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
22
23. Datasets
23
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
Name
Content
(keywords)
# From To Classified
Polarity Ratio
(Negative:Positive)
Dengue Dengue 472
27/04/2015
20:29
1/7/2015
15:14
Yes 323:149
H1N1 H1N1, Influenza 548
24/06/15
1:45
30/06/15
14:57
Yes 314 : 234
Chemo-I Chemotherapy 170
12/10/15
7:12
22/10/15
14:37
Yes 72 : 98
Chemo-II Chemotherapy 734
12/10/2015
12:04
22/10/15
14:37
No -
24. Experiment 1: Dengue Dataset
Dengue, Dengue Vaccine
Naïve Bayes MaxEnt
Uni
grams
Bi
grams
POS-
Tagged
Bigrams
Uni
grams
Bi
grams
POS-
Tagged
Bigrams
Accuracy 72.52 75.50 81.82 68.68 70.32 76.06
Weighted
Precision
74.26 74.40 81.69 72.42 65.91 61.28
Weighted
Recall
70.70 73.77 82.26 67.30 70.82 57.70
Weighted
F-measure
70.90 71.00 79.84 67.55 60.57 58.72
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
24
25. Accuracy
25
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
60.00
65.00
70.00
75.00
80.00
85.00
Naïve Bayes Maximum Entropy
Unigrams
Bigrams
POS-Tagged Bigrams
26. Weighted F-measure
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Naïve Bayes Maximum Entropy
Unigrams
Bigrams
POS-Tagged Bigrams
26
27. Experiment 2: H1N1 Dataset
H1N1,Influenza
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
27
Naïve Bayes
Unigrams Bigrams POS-Tagged Bigrams
Accuracy
67.43 70.59 76.04
Weighted Precision 67.52 70.62 76.09
Weighted Recall 67.95 70.44 76.05
Weighted F-measure 65.69 70.08 75.78
28. Experiment 3: Chemo-I Dataset
Chemotherapy
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
28
Naïve Bayes
Unigrams Bigrams POS-Tagged Bigrams
Accuracy 75.88 76.47 78.24
Weighted Precision 78.23 78.66 79.96
Weighted Recall 75.10 75.60 77.16
Weighted F-measure 75.69 76.25 77.93
29. Polarity Checker : Dataset Analysis
29
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
30. Polarity Checker : Top Stories
Positive Negative
1
#Dengue News: Scientists identify the
skin immune cells targeted by the
dengue virus
United Nations News Centre - At least
3,000 suspected Dengue fever cases
reported in Yemen – UN health agency:
2
Co-ordination meet of BBMP Health and
edu. dept. regarding control and
prevention of Dengue and Chikungunya
fever spread by Mosquito bite. (1/5)
#MyiTimes Country faces largest dengue
epidemic ever - KUALA LUMPUR: The
country is probably facing the largest
dengue problem
3
Well that's a 1st! Malaysia Dept of Health
officials doing house to house calls
looking for dengue hot spots!!
Clean bill of health here!
#Dengue News: Country faces largest
dengue epidemic ever - Free Malaysia
Today
4
@PascalBarollier Fantastic! Thanks for
helping our tribe put a face to dengue
global leaders won't forget.
Country faces largest dengue epidemic
ever: The number of deaths has doubled
this year compared to the same period…
5
@DengueInfo Thank you for helping us
get the word out on Dengue Tribe! To
help put a face to dengue, join here
#Yemen Yemen: At least 3,000 suspected
Dengue fever cases reported in Yemen –
UN health agency says 30
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
31. Polarity Checker : Text Analysis
31
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
33. Buzzmeter : Unigram vs. Bigram
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
33
34. Buzzmeter : Unigram vs. Bigram
• Chemo radiation
• Breast cancer
• Last chemo
• Cancer awareness
34
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
35. Conclusion
• This research presents a sentiment analysis with
special attention towards improving health
awareness.
automatic classification of a given tweet
generate the general attitude from a given set of
tweets, with top stories.
track most commonly used words/phrases in health
related tweets
• POS-tagged bigrams using nouns + adjectives
with Naive Bayes method produced the
best overall performance.
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
35
36. Future Recommendations
• Real-time Twitter data analyzing
• Web plug-ins
• Mobile apps
• Identifying pattern of spreading of a disease,
threatened areas & age groups
• Health alerts/warnings system
Department of Stat. & Comp. Sc., Faculty of
Science, University of Peradeniya
36