SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
Analysis of Textual Data Classification with a Reddit Comments Dataset
Adam Babs
Abstract
Sentiment analysis became an area of intense
research as it may be found very useful in
various areas and may provide valuable infor-
mation. Text classification is the problem of
labeling natural language texts into a set of
predefined classes.
The main objective of this project is to cat-
egorize comments from the American social
news aggregation, web content rating, and
discussion website – Reddit.
Various methods have been applied to develop
the training algorithms and improve their ac-
curacy, I experimented with different classi-
fication algorithms, ensemble methods, nu-
merous text preprocessing approaches, vari-
ous feature extraction methods. My experi-
ments demonstrate how combining the afore-
mentioned methods allowed me to train my fi-
nal model with accuracy of 57.455%, which is
approximately 5.5 percentage points increase
in comparison to where I began.
1 Introduction
Reddit community does not have rules specifying how
comments in particular categories should be com-
posed and what type of vocabulary or structures they
should comprise of. It is what makes the task of
labeling comments a complex problem - successful
predictions mostly depend on the words used in given
categories and words used in various categories may
be very similar.
My main goal was to achieve the highest possible ac-
curacy of predictors on the Reddit comments. While
developing the model which achieved the highest pre-
diction performance I experimented with numerous
combinations of feature extractions and preprocess-
ing methods which I applied to learn algorithms. I
applied some basic text data preprocessing meth-
ods such as lemmatizing, stemming, removing; stop-
words, punctuation, numbers, characters outside of
the English alphabet, converting the text to lower-
case. At this point, I could observe a relatively high
increase in the accuracy of algorithms.
Subsequently, I experimented with feature extraction
and compared the accuracy of algorithms using the
Count Vectorization and computing Term Frequency
— Inverse Document Frequency. I tried training mod-
els including and excluding different features based
on their frequency of occurrence and using different
numbers of features that were included in the learning
process.
Logistic Regression, Stochastic Gradient Descent
Classifier, Support Vector Classifier, Decision Trees,
Multinomial Na¨ıve Bayes and Bernoulli Na¨ıve Bayes
algorithms were used to perform comments classifi-
cation. To increase the prediction performance of
classifiers, I applied the Grid Search and Random
Search methods to find hyperparameters which pro-
vide the highest accuracy.
Ultimately, the highest accuracy was achieved as a re-
sult of using the Ensemble Methods of stacking three
algorithms and using hard voting. These three al-
gorithms were: Multinomial Na¨ıve Bayes, Stochastic
Gradient Descent Classifier and Logistic Regression.
Diagram 1. Workflow I applied to solve the labelling
problem
Source: Google Developers Machine Learning Guide
2 Related work
As stated before document classification became an
area of intense research and found its’ application in
numerous fields. Tagging the content posted on the
internet, analyzing customer reviews, spam filtering
or extracting information from textual data for mar-
keting purposes are only a few of them. Many scien-
tific papers indicate how crucial and useful sentiment
analysis became and authors discuss how significant
in text labeling problems, are appropriate feature ex-
1
tractions and accurate data preprocessing.
In recent years, document classification has become
important due to the advent of large amounts of data
in digital form. For several decades now, document
classification in the form of text classification systems
have been widely implemented in numerous applica-
tions (Dino Isa et al., 2008).
The representation of a problem has a strong impact
on the generalization accuracy of a learning system.
For the problem of text categorization of a document,
which is typically a string of characters, has to be
transformed into a representation, which is suitable
for the learning algorithm and the classification task
(Li-Ping Jing et al., 2003).
Isabelle Guyon and Andre Elisseeff discuss other
profits of appropriate feature selection and how it
influences the accuracy of prediction algorithms.
There are many potential benefits of variable
and feature selection: facilitating data visualiza-
tion and data understanding, reducing the mea-
surement and storage requirements, reducing train-
ing and utilization times, defying the curse of
dimensionality to improve prediction performance
(Isabelle Guyon and Andre Elisseeff, 2003).
Daniel Morariu, Radu Cretulescu, Lucian Vintan dis-
cuss how implementing an ensemble method based on
the Support Vector Machine techniques and Na¨ıve
Bayes Theory may improve the prediction perfor-
mance of predictors. They also describe the stacking
method as a strategy that can improve the accuracy
on the text classification problem. We combine mul-
tiple classifiers hoping that the classification accuracy
can be improved without a significant increase in re-
sponse time. Instead of building only one highly accu-
rate specialized classifier with much time and effort,
we build and combine several simpler classifiers. A
usually approach is to build individual classifiers and
later combine their judgments to make the final deci-
sion (D. Morariu, R. Cretulescu, L. Vintan, 2010).
3 Dataset and setup
The data used in the project is a file made up of Red-
dit comments that had been posted on the platform.
The train dataset on which I conducted experiments
and trained models comprises of 69 999 comments and
their labels. The test dataset on which I tested the
model and reached the accuracy of 57.455% in the
competition comprises of 29 999 samples without la-
bels.
Initially, I trained models before preprocessing the
textual data and gathered the results. Subsequently,
I started with preprocessing and experimented with
various methods and different setups of lemmatizing,
stemming, deleting words containing less than three
characters, removing digits, numbers and all charac-
ters that are not in the English alphabet. I removed
all hyperlinks, stopwords, punctuation and turned ev-
ery character into a lower case.
Various combinations resulted in different accuracies,
ultimately the approach that gave us the highest ac-
curacy was: lemmatizing, turning every character
to lower case, removing hyperlinks, stopwords, and
punctuation. After applying the approach that gave
us the highest accuracy, I deleted one sample (number
58011) - it contained an empty string.
Category Number of samples
baseball 3500
canada 3500
gameofthrones 3500
wow 3500
Music 3500
anime 3500
europe 3500
worldnews 3500
GlobalOffensive 3500
hockey 3500
nfl 3500
AskReddit 3500
nba 3500
soccer 3500
funny 3500
trees 3500
Overwatch 3500
conspiracy 3500
leagueoflegends 3500
movies 3499
Table 1. Categories and number of comments in
each category
4 Proposed approach
Initially, I trained models without preprocessing the
data, I just applied the Tf-Idf weighting and Count
Vectorization. Tables below present how these two
techniques influence the classifying accuracy.
2
Classifier Accuracy
Multinomial Na¨ıve Bayes 55.7265%
Logistic Regression 54.3222%
Stochastic Gradient Descent 55.72507%
Support Vector Classifier 55.31507%
Table 2. Accuracy before data preprocessing, Tf-Idf
Weighting
Classifier Accuracy
Multinomial Na¨ıve Bayes 53.07789%
Logistic Regression 52.50789%
Stochastic Gradient Descent 45.39635%
Support Vector Classifier 49.02784%
Table 3. Accuracy before data preprocessing, Count
Vectorization
The difference in prediction performance is relatively
high, thus I chose to use Tf-Idf weighting in further ex-
periments. This indicates that rare words contribute
more weights to the models I implemented, which is
reasonable as comments in the same section refer to
the mutual topics and similar vocabulary is used.
As I conducted the experiments with different pre-
processing methods I found out that lemmatization,
turning everything character to lower case and remov-
ing: punctuation, hyperlinks, stop words and remov-
ing samples that were empty after preprocessing (sam-
ple number 58011), yields the best predictions, hence
I decided to use this approach while constructing the
final model.
Classifier Accuracy
Multinomial Na¨ıve Bayes 56.07222%
Logistic Regression 54.64506%
Stochastic Gradient Descent 55.08364%
Support Vector Classifier 54.74792%
Table 4. Accuracy after data preprocessing, Tf-Idf
Weighting
Classifier Accuracy
Multinomial Na¨ıve Bayes 54.44792%
Logistic Regression 52.38788%
Stochastic Gradient Descent 50.78357%
Support Vector Classifier 49.28641%
Table 5. Accuracy after data preprocessing, Count
Vectorization
Contrary to my expectations, data preprocessing in-
creased the prediction performance of only two mod-
els - Naive Bayes Algorithm and Logistic Regression
while decreasing the accuracy of two others. This was
a valuable finding and it helped me improve the final
accuracy of my model.
To train my final model I decided to use the Stack-
ing Ensemble Method. Results show that stacking
has been proven to be consistently effective over all
domains, working better than majority voting and
that using the opinion summary can improve the per-
formance further (Ying Su et al., 2012). I combined
the performance of the Multinomial Na¨ıve Bayes,
Stochastic Gradient Descent Classifier and Logistic
Regression.
To improve the prediction performance of the model,
I tuned hyperparameters and experimented with
changing the voting method and added the Support
Vector Classifier. Ultimately, the majority voting
method resulted in the highest accuracy and adding
another base model, the Support Vector Classifier did
not increase the successful predictions rate.
As I discovered earlier, the Stochastic Gradient De-
scent yields a higher accuracy on a dataset before
preprocessing, hence I used this finding in modeling
the stacked classifier.
Another method that I experimented with was imple-
menting the Random Forest Classifier. To optimize
its’ performance I tuned hyperparameters using the
Grid Search method, although the achieved accuracy
of the algorithm was not satisfying.
5 Results
When constructing the final model with the highest
accuracy, I built three individual classifiers and com-
bined their performance. I stacked Multinomial Naive
Bayes, Support Vector Classification and Stochastic
Gradient Descent and merged their judgments to
make the final decision based on hard voting. Base
models in the final classifier were fitted on different
datasets. Stochastic Gradient Descent and the final
model were fitted on the text before preprocessing,
Multinomial Naive Bayes and Logistic Regression
were fitted on preprocessed comments. The algo-
rithm used for Kaggle submission was fitted on the
preprocessed training. It resulted in a performance
presented below:
Stacked Classifier Performance:
Average accuracy from 5-fold cross validation on the
training dataset: 57.4679%
3
Accuracy on Kaggle (preprocessed data): 57.455%
Accuracy on Kaggle (data before preprocessing):
53.677%
I applied various techniques, tried numerous models
and different combinations of stacking these models.
My final model had the highest accuracy that I man-
aged to achieve.
I experimented with:
• stemming,
• n-grams,
• removing all strings of length less than two,
• removing all digits and numbers,
• tuning max df and min df hyperparameters1
• tuning the max features hyperparameter2
1
Maximum and minimum document frequency across
the corpus, which ignore terms that have a document fre-
quency higher or lower than the given threshold,
2
Maximum features hyperparameter which builds a
vocabulary that only consider the top given features or-
dered by term frequency across the corpus.
All of the above techniques failed to improve the pre-
diction accuracy of implemented classifiers.
To develop the Random Forest Classifier performance,
I used the grid search method to find the best parame-
ters. To expand the search, I manipulated the number
of features included in the training process, however,
I did not notice any improvements.
Parameters tuned to develop the classifier:
• number of estimators
• maximum depth of the tree
• minimum number of samples required to split
an internal node
• minimum number of samples required to be at
a leaf node
Random Forest Classifier Performance:
Accuracy: 44.29285%
Best parameters: max depth= 30, min samples leaf=1,
min samples split=10, n estimators= 1200
6 Discussion and Conclusion
Labeling textual data coming from twenty different
categories, where the majority of samples comprise
of similar language written using ordinary, everyday
language is a demanding task and even tiny improve-
ments in prediction performance result in a consid-
erable increase in the number of correctly classified
samples when working on big datasets.
Considering the fact that I started with an accuracy
oscillating between 45% and 52% (depending on the
model used) and improved the final accuracy of the
predictor to around 55% only as a result of feature
extraction and text preprocessing, I managed to cor-
rectly classify a few thousand more comments. *When
predicting the labels on a dataset containing 100 000 sam-
ples. It proves that even a small increase in accuracy
is meaningful.
The main conclusion drawn from this task is that
experimenting with preprocessing, feature extraction,
learning various models and tuning their hyperparam-
eters significantly contribute to the improvement of
predicting performance and makes room for continu-
ous development. Despite the fact that I found some
of the discovered results rather unexpected, they were
valuable and I built my final model upon them. I also
observed how effective combining the performance of
several algorithms may be.
References
[1] Isabelle Guyon and Andre Elisseeff. An Introduc-
tion to Variable and Feature Selection, 2003.
[2] Dino Isa, Lam H. Lee, V.P. Kallimani and R.
RajKumar Text Document Preprocessing with the
Bayes Formula for Classification Using the Sup-
port Vector Machine, 2008.
[3] Google Developers Machine Learning Guide on
Text Classification – workflow diagram
https://developers.google.com/machine
-learning/guides/text-classification
[4] D. Morariu, R. Cretulescu and L. Vintan. Improv-
ing a SVM Meta-classifier for Text Documents by
using Naive Bayes, 2003.
[5] Ying Su, Yong Zhang, Donghong Ji, Yibing Wang
and Hongmiao Wu. Ensemble Learning for Senti-
ment Classification, 2012.
[6] i-Ping Jing, Hou-Kuan Huang and Hong-Bo Shi.
Improved feature selection approach TFIDF in text
mining, 2002.
4

Mais conteúdo relacionado

Mais procurados

Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...IJECEIAES
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Institute of Contemporary Sciences
 
Lecture Defense June 2011
Lecture Defense June 2011Lecture Defense June 2011
Lecture Defense June 2011Louis Ventura
 
Temporal Learning and Sequence Modeling for a Job Recommender System
Temporal Learning and Sequence Modeling for a Job Recommender SystemTemporal Learning and Sequence Modeling for a Job Recommender System
Temporal Learning and Sequence Modeling for a Job Recommender SystemAnoop Kumar
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisDmitry Grapov
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...AIRCC Publishing Corporation
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rulesChemseddine Berbague
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET Journal
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...IJERA Editor
 

Mais procurados (17)

Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
 
Lecture Defense June 2011
Lecture Defense June 2011Lecture Defense June 2011
Lecture Defense June 2011
 
E1802023741
E1802023741E1802023741
E1802023741
 
Temporal Learning and Sequence Modeling for a Job Recommender System
Temporal Learning and Sequence Modeling for a Job Recommender SystemTemporal Learning and Sequence Modeling for a Job Recommender System
Temporal Learning and Sequence Modeling for a Job Recommender System
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data Analysis
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rules
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...
 
Ch06
Ch06Ch06
Ch06
 

Semelhante a Analysis of Textual Data Classification with a Reddit Comments Dataset

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...Geetika Gautam
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
An approach for improved students’ performance prediction using homogeneous ...
An approach for improved students’ performance prediction  using homogeneous ...An approach for improved students’ performance prediction  using homogeneous ...
An approach for improved students’ performance prediction using homogeneous ...IJECEIAES
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainShakas Technologies
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441IJRAT
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objectsnexgentechnology
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainnexgentech15
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Relevance feature discovery for text mining
Relevance feature discovery for text miningRelevance feature discovery for text mining
Relevance feature discovery for text miningredpel dot com
 

Semelhante a Analysis of Textual Data Classification with a Reddit Comments Dataset (20)

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Recommender system
Recommender systemRecommender system
Recommender system
 
J017256674
J017256674J017256674
J017256674
 
ashu ppt final.pptx
ashu ppt final.pptxashu ppt final.pptx
ashu ppt final.pptx
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
20120140505011
2012014050501120120140505011
20120140505011
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
An approach for improved students’ performance prediction using homogeneous ...
An approach for improved students’ performance prediction  using homogeneous ...An approach for improved students’ performance prediction  using homogeneous ...
An approach for improved students’ performance prediction using homogeneous ...
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objects
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Research proposal
Research proposalResearch proposal
Research proposal
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
 
Cw36587594
Cw36587594Cw36587594
Cw36587594
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Relevance feature discovery for text mining
Relevance feature discovery for text miningRelevance feature discovery for text mining
Relevance feature discovery for text mining
 

Último

Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 

Último (20)

Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 

Analysis of Textual Data Classification with a Reddit Comments Dataset

  • 1. Analysis of Textual Data Classification with a Reddit Comments Dataset Adam Babs Abstract Sentiment analysis became an area of intense research as it may be found very useful in various areas and may provide valuable infor- mation. Text classification is the problem of labeling natural language texts into a set of predefined classes. The main objective of this project is to cat- egorize comments from the American social news aggregation, web content rating, and discussion website – Reddit. Various methods have been applied to develop the training algorithms and improve their ac- curacy, I experimented with different classi- fication algorithms, ensemble methods, nu- merous text preprocessing approaches, vari- ous feature extraction methods. My experi- ments demonstrate how combining the afore- mentioned methods allowed me to train my fi- nal model with accuracy of 57.455%, which is approximately 5.5 percentage points increase in comparison to where I began. 1 Introduction Reddit community does not have rules specifying how comments in particular categories should be com- posed and what type of vocabulary or structures they should comprise of. It is what makes the task of labeling comments a complex problem - successful predictions mostly depend on the words used in given categories and words used in various categories may be very similar. My main goal was to achieve the highest possible ac- curacy of predictors on the Reddit comments. While developing the model which achieved the highest pre- diction performance I experimented with numerous combinations of feature extractions and preprocess- ing methods which I applied to learn algorithms. I applied some basic text data preprocessing meth- ods such as lemmatizing, stemming, removing; stop- words, punctuation, numbers, characters outside of the English alphabet, converting the text to lower- case. At this point, I could observe a relatively high increase in the accuracy of algorithms. Subsequently, I experimented with feature extraction and compared the accuracy of algorithms using the Count Vectorization and computing Term Frequency — Inverse Document Frequency. I tried training mod- els including and excluding different features based on their frequency of occurrence and using different numbers of features that were included in the learning process. Logistic Regression, Stochastic Gradient Descent Classifier, Support Vector Classifier, Decision Trees, Multinomial Na¨ıve Bayes and Bernoulli Na¨ıve Bayes algorithms were used to perform comments classifi- cation. To increase the prediction performance of classifiers, I applied the Grid Search and Random Search methods to find hyperparameters which pro- vide the highest accuracy. Ultimately, the highest accuracy was achieved as a re- sult of using the Ensemble Methods of stacking three algorithms and using hard voting. These three al- gorithms were: Multinomial Na¨ıve Bayes, Stochastic Gradient Descent Classifier and Logistic Regression. Diagram 1. Workflow I applied to solve the labelling problem Source: Google Developers Machine Learning Guide 2 Related work As stated before document classification became an area of intense research and found its’ application in numerous fields. Tagging the content posted on the internet, analyzing customer reviews, spam filtering or extracting information from textual data for mar- keting purposes are only a few of them. Many scien- tific papers indicate how crucial and useful sentiment analysis became and authors discuss how significant in text labeling problems, are appropriate feature ex- 1
  • 2. tractions and accurate data preprocessing. In recent years, document classification has become important due to the advent of large amounts of data in digital form. For several decades now, document classification in the form of text classification systems have been widely implemented in numerous applica- tions (Dino Isa et al., 2008). The representation of a problem has a strong impact on the generalization accuracy of a learning system. For the problem of text categorization of a document, which is typically a string of characters, has to be transformed into a representation, which is suitable for the learning algorithm and the classification task (Li-Ping Jing et al., 2003). Isabelle Guyon and Andre Elisseeff discuss other profits of appropriate feature selection and how it influences the accuracy of prediction algorithms. There are many potential benefits of variable and feature selection: facilitating data visualiza- tion and data understanding, reducing the mea- surement and storage requirements, reducing train- ing and utilization times, defying the curse of dimensionality to improve prediction performance (Isabelle Guyon and Andre Elisseeff, 2003). Daniel Morariu, Radu Cretulescu, Lucian Vintan dis- cuss how implementing an ensemble method based on the Support Vector Machine techniques and Na¨ıve Bayes Theory may improve the prediction perfor- mance of predictors. They also describe the stacking method as a strategy that can improve the accuracy on the text classification problem. We combine mul- tiple classifiers hoping that the classification accuracy can be improved without a significant increase in re- sponse time. Instead of building only one highly accu- rate specialized classifier with much time and effort, we build and combine several simpler classifiers. A usually approach is to build individual classifiers and later combine their judgments to make the final deci- sion (D. Morariu, R. Cretulescu, L. Vintan, 2010). 3 Dataset and setup The data used in the project is a file made up of Red- dit comments that had been posted on the platform. The train dataset on which I conducted experiments and trained models comprises of 69 999 comments and their labels. The test dataset on which I tested the model and reached the accuracy of 57.455% in the competition comprises of 29 999 samples without la- bels. Initially, I trained models before preprocessing the textual data and gathered the results. Subsequently, I started with preprocessing and experimented with various methods and different setups of lemmatizing, stemming, deleting words containing less than three characters, removing digits, numbers and all charac- ters that are not in the English alphabet. I removed all hyperlinks, stopwords, punctuation and turned ev- ery character into a lower case. Various combinations resulted in different accuracies, ultimately the approach that gave us the highest ac- curacy was: lemmatizing, turning every character to lower case, removing hyperlinks, stopwords, and punctuation. After applying the approach that gave us the highest accuracy, I deleted one sample (number 58011) - it contained an empty string. Category Number of samples baseball 3500 canada 3500 gameofthrones 3500 wow 3500 Music 3500 anime 3500 europe 3500 worldnews 3500 GlobalOffensive 3500 hockey 3500 nfl 3500 AskReddit 3500 nba 3500 soccer 3500 funny 3500 trees 3500 Overwatch 3500 conspiracy 3500 leagueoflegends 3500 movies 3499 Table 1. Categories and number of comments in each category 4 Proposed approach Initially, I trained models without preprocessing the data, I just applied the Tf-Idf weighting and Count Vectorization. Tables below present how these two techniques influence the classifying accuracy. 2
  • 3. Classifier Accuracy Multinomial Na¨ıve Bayes 55.7265% Logistic Regression 54.3222% Stochastic Gradient Descent 55.72507% Support Vector Classifier 55.31507% Table 2. Accuracy before data preprocessing, Tf-Idf Weighting Classifier Accuracy Multinomial Na¨ıve Bayes 53.07789% Logistic Regression 52.50789% Stochastic Gradient Descent 45.39635% Support Vector Classifier 49.02784% Table 3. Accuracy before data preprocessing, Count Vectorization The difference in prediction performance is relatively high, thus I chose to use Tf-Idf weighting in further ex- periments. This indicates that rare words contribute more weights to the models I implemented, which is reasonable as comments in the same section refer to the mutual topics and similar vocabulary is used. As I conducted the experiments with different pre- processing methods I found out that lemmatization, turning everything character to lower case and remov- ing: punctuation, hyperlinks, stop words and remov- ing samples that were empty after preprocessing (sam- ple number 58011), yields the best predictions, hence I decided to use this approach while constructing the final model. Classifier Accuracy Multinomial Na¨ıve Bayes 56.07222% Logistic Regression 54.64506% Stochastic Gradient Descent 55.08364% Support Vector Classifier 54.74792% Table 4. Accuracy after data preprocessing, Tf-Idf Weighting Classifier Accuracy Multinomial Na¨ıve Bayes 54.44792% Logistic Regression 52.38788% Stochastic Gradient Descent 50.78357% Support Vector Classifier 49.28641% Table 5. Accuracy after data preprocessing, Count Vectorization Contrary to my expectations, data preprocessing in- creased the prediction performance of only two mod- els - Naive Bayes Algorithm and Logistic Regression while decreasing the accuracy of two others. This was a valuable finding and it helped me improve the final accuracy of my model. To train my final model I decided to use the Stack- ing Ensemble Method. Results show that stacking has been proven to be consistently effective over all domains, working better than majority voting and that using the opinion summary can improve the per- formance further (Ying Su et al., 2012). I combined the performance of the Multinomial Na¨ıve Bayes, Stochastic Gradient Descent Classifier and Logistic Regression. To improve the prediction performance of the model, I tuned hyperparameters and experimented with changing the voting method and added the Support Vector Classifier. Ultimately, the majority voting method resulted in the highest accuracy and adding another base model, the Support Vector Classifier did not increase the successful predictions rate. As I discovered earlier, the Stochastic Gradient De- scent yields a higher accuracy on a dataset before preprocessing, hence I used this finding in modeling the stacked classifier. Another method that I experimented with was imple- menting the Random Forest Classifier. To optimize its’ performance I tuned hyperparameters using the Grid Search method, although the achieved accuracy of the algorithm was not satisfying. 5 Results When constructing the final model with the highest accuracy, I built three individual classifiers and com- bined their performance. I stacked Multinomial Naive Bayes, Support Vector Classification and Stochastic Gradient Descent and merged their judgments to make the final decision based on hard voting. Base models in the final classifier were fitted on different datasets. Stochastic Gradient Descent and the final model were fitted on the text before preprocessing, Multinomial Naive Bayes and Logistic Regression were fitted on preprocessed comments. The algo- rithm used for Kaggle submission was fitted on the preprocessed training. It resulted in a performance presented below: Stacked Classifier Performance: Average accuracy from 5-fold cross validation on the training dataset: 57.4679% 3
  • 4. Accuracy on Kaggle (preprocessed data): 57.455% Accuracy on Kaggle (data before preprocessing): 53.677% I applied various techniques, tried numerous models and different combinations of stacking these models. My final model had the highest accuracy that I man- aged to achieve. I experimented with: • stemming, • n-grams, • removing all strings of length less than two, • removing all digits and numbers, • tuning max df and min df hyperparameters1 • tuning the max features hyperparameter2 1 Maximum and minimum document frequency across the corpus, which ignore terms that have a document fre- quency higher or lower than the given threshold, 2 Maximum features hyperparameter which builds a vocabulary that only consider the top given features or- dered by term frequency across the corpus. All of the above techniques failed to improve the pre- diction accuracy of implemented classifiers. To develop the Random Forest Classifier performance, I used the grid search method to find the best parame- ters. To expand the search, I manipulated the number of features included in the training process, however, I did not notice any improvements. Parameters tuned to develop the classifier: • number of estimators • maximum depth of the tree • minimum number of samples required to split an internal node • minimum number of samples required to be at a leaf node Random Forest Classifier Performance: Accuracy: 44.29285% Best parameters: max depth= 30, min samples leaf=1, min samples split=10, n estimators= 1200 6 Discussion and Conclusion Labeling textual data coming from twenty different categories, where the majority of samples comprise of similar language written using ordinary, everyday language is a demanding task and even tiny improve- ments in prediction performance result in a consid- erable increase in the number of correctly classified samples when working on big datasets. Considering the fact that I started with an accuracy oscillating between 45% and 52% (depending on the model used) and improved the final accuracy of the predictor to around 55% only as a result of feature extraction and text preprocessing, I managed to cor- rectly classify a few thousand more comments. *When predicting the labels on a dataset containing 100 000 sam- ples. It proves that even a small increase in accuracy is meaningful. The main conclusion drawn from this task is that experimenting with preprocessing, feature extraction, learning various models and tuning their hyperparam- eters significantly contribute to the improvement of predicting performance and makes room for continu- ous development. Despite the fact that I found some of the discovered results rather unexpected, they were valuable and I built my final model upon them. I also observed how effective combining the performance of several algorithms may be. References [1] Isabelle Guyon and Andre Elisseeff. An Introduc- tion to Variable and Feature Selection, 2003. [2] Dino Isa, Lam H. Lee, V.P. Kallimani and R. RajKumar Text Document Preprocessing with the Bayes Formula for Classification Using the Sup- port Vector Machine, 2008. [3] Google Developers Machine Learning Guide on Text Classification – workflow diagram https://developers.google.com/machine -learning/guides/text-classification [4] D. Morariu, R. Cretulescu and L. Vintan. Improv- ing a SVM Meta-classifier for Text Documents by using Naive Bayes, 2003. [5] Ying Su, Yong Zhang, Donghong Ji, Yibing Wang and Hongmiao Wu. Ensemble Learning for Senti- ment Classification, 2012. [6] i-Ping Jing, Hou-Kuan Huang and Hong-Bo Shi. Improved feature selection approach TFIDF in text mining, 2002. 4