1. TITLE
Twitter Sentiment Analysis using Various
Classification Algorithms
Abstract
Twitter is a web application to determine online news and social networking service
where users post and interact with messages, anywhere in the world. Twitter posts are generally
short (140 characters) and generated continuously by public which is well suited for opinion
mining. Twitter messages can be classified either in positive or negative sentiment based on
certain aspects with respect to term based query. The past studies of sentiment classification
are not very conclusive about which features and supervised classification algorithms are good
for designing accurate and efficient sentiment classification system. We propose to combine
many feature extraction techniques like emoticons, exclamation and question mark symbol,
word gazetteer, unigrams to design more accurate sentiment classification system.
Keywords
Twitter; Sentiment Analysis; Opinion Mining; Natural Language Processing
Introduction
Human decision making is extensively influenced by assessment or judgement of others.
Before making any move, customers tend to gather as much information as possible about the
product they want to buy. The investors analyse and predict the stock market movement of a
company based on its popularity among its customers be investing their money in its shares.
With the advent development of social media, gathering data for evaluation become easier and
less time consuming. Different platform like Twitter, Facebook, Linked In serve as repositories
of useful data in terms of reviews, likes, comments etc.
Opinions are linked to almost all human activities because they have key impact on our decision
making. We mostly seek others opinions while taking any decisions. In the real world,
organizations and business entities are always willing to know public and general opinions
about their services and products. On the other hand, consumers also seek the opinions of
existing users of a product or service before making a decision to purchase products and
subscribing to services. Opinions of public about political candidates can be analysed to
forecast results of an election. In the past, organizations, governments and business entities
used to conduct surveys and opinion polls on focused groups for obtaining citizen opinions and
their sentiments [1].
Twitter is a social networking web application with microblogging feature that has a large and
constantly growing user data-base. Thus, the application provides a rich data set in the form of
2. messages that are usually short status updates from Twitter application users that must be
expressed in not more than 140 characters in length. On Twitter, data that consists of millions
of short messages and user status updates are generated each day on about hundreds of different
topics. The task of extracting data from these small texts has become immensely useful for
sorting and ranking popularity of topics mentioned within the updates. Nowadays twitter has
emerged as one of the most popular platforms for expressing sentiments and thoughts on
Internet. It is very useful and obvious to mine and analyse Twitter data for interesting
information regarding major trending topics in the media and other spaces.
Methodology
Twitter Sentiment Analysis is generally divided into 3 major categories that is
1. Machine Learning Approach
2. Lexicon Based Approach
3. Hybrid Approach
The Machine Learning Approach (ML) uses linguistic features and applies well known
Machine Learning algorithms.
The Lexicon based approach is driven by a opinion lexicon, which is nothing but a collection
of pre-compiled opinion terms. It is mainly divided into two main approaches that is
a) Dictionary based approach
b) Corpus Based approach
The Hybrid Approach combines the above two approaches.
To increase the performance and efficiency of sentiment classification system the combination
of well-known features extraction methods is considered. The proposed method compares 6
supervised classification algorithms that is
a) Naïve Bayes Algorithm
b) Bayes Net Algorithm
c) Discriminative Multinomial Naïve Bayes(DMNB) Algorithm
d) Sequential Minimal Optimization (SMO) Algorithm
e) Hyperpipes Algorithm
f) Random Forest Algorithm
1) Naïve Bayes(NB): This algorithm is a probabilistic classifier in a simple form that counts
the combinations of values and frequency in a data set under consideration and calculates
probabilities set. Bayes theorem is the base of this algorithm and assumes that all the attributes
are completely independent against a set value of the class variable.
3. 2) Bayes Net (BN): Bayesian nets (BN) are a network-based system that are mainly used for
analysing and representing the models that involves uncertainty. Bayesian networks learns the
causal relationships and use it to implement incremental learning. To perform classification,
first the input nodes must be set with the evidence and then the output nodes can be queried
and analysed using standard Bayesian network inference.
3) Discriminative Multinominal Naive Bayes (DMNB): The multinomial Naive Bayes is a
well-known and widely used classifier for classification of documents and tested to yield
satisfactory performance. Discriminative multinomial Naïve Bayes (DMNB) takes a document
and consider it as a bag-of-words. For each class c, P(w|c), the training data is unitized to
estimate the probability of observing the word w against the given class. It works on the
collection of training documents of the particular class by calculating each word’s relative
occurrence frequency. The classifier also needs the prior probability, Pc) which is intuitive to
estimate. If the word w occurs nwd number of times in document d, then given a document
under test the probability of the class c is calculated in the following manner
4) SMO: Sequential Minimal Optimization (SMO) method is generally used in the training
process of Support Vector Machines (SVM) classification algorithm. SMO algorithm consists
of many optimizations designed primarily to increase the analysis performance of large
datasets. It is designed to ensure that the algorithm converges with results even in degenerate
conditions. It works by breaking up a problem into a set of atomic sub-problems, which are
solved using analytical approach
5) Hyperpipes: Hyperpipes is a technique that creates a “hyperpipe” for each class of a data
set. These Classes are the collections of data build around single object template. it can work
extremely fast and effectively.
6) Random Forest: Many trees are produced by this algorithm for classification process. It
classifies new object from an input vector by setting the vector against the forest on each of the
trees. A classification is generated by each tree. In other words, that class is voted by the tree.
The classification having the most votes is chosen by the random forest method across all the
trees. It also runs efficiently on large datasets.
Results Obtained
The six selected classification algorithms were executed on features extracted from Sanders
Twitter dataset on Weka tool. by configuring it with 10-fold cross validation flag building and
testing of the system is carried out. Simulation results in empirical form are presented in Tables
1-9.
4. False Positive Rate (FPR), True Positive Rate (TPR), Precision (P), recall (R), F-score (F),
and Receiver Operating Characteristic values (ROC) are shown in the following tables.
Table 1: Naïve Bayes Result
Table 2: Bayes Net Results
Table 3: Discriminative Multinominal Naive Bayes(DMNB) Results
6. Performance and Results Comparison
Based on simulation results, the performance of Naive Bayes algorithm is least in comparison
of all six algorithms considered in this study. In general, precision and recall scores are
sufficiently low against the Positive and Negative classes. This is due to large number of
instances in the class ‘other’ in comparison of positive and negative classes. The considered
Sanders dataset is highly imbalanced. Overall, the two most balanced and well-performing
algorithms are DMNB and SMO, with overall F-scores of 0.769 and 0.75 respectively.
Fig 1: Precision Comparison
Fig 2: Recall Comparison
8. References
[1] Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. "Sentiment analysis algorithms and
applications: A survey." Ain Shams Engineering Journal 5.4 (2014): 1093-1113.
[2] Liu, Bing. "Sentiment analysis and opinion mining." Synthesis lectures on human language
technologies 5.1 (2012): 1-167.
[3] Agarwal, Apoorv, et al. "Sentiment analysis of twitter data." Proceedings of the workshop
on languages in social media. Association for Computational Linguistics, 2011.
[4] Imran, Muhammad, et al. "Processing social media messages in mass emergency: A
survey." ACM Computing Surveys (CSUR) 47.4 (2015): 67.
[5] Feldman, Ronen. "Techniques and applications for sentiment analysis, “Communications
of the ACM 56.4 (2013): 82-89.
[6] Pang, Bo, and Lillian Lee. “Opinion mining and sentiment analysis. “Foundations and
trends in information retrieval 2.1-2 (2008): 1-135.
[7] Cambria, Erik, et al. “New avenues in opinion mining and sentiment analysis.” IEEE
Intelligent Systems 28.2 (2013): 15- 21.
[8] Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and
techniques. Morgan Kaufmann, 2005.
[9] Bifet, Albert, and Eibe Frank. "Sentiment knowledge discovery in twitter streaming data."
International Conference on Discovery Science. Springer Berlin Heidelberg, 2010.
[10] Saif, Hassan, Yulan He, and Harith Alani. "Semantic sentiment analysis of twitter.
International Semantic Web Conference. Springer Berlin Heidelberg, 2012.