The advent of social networks has changed the research in computer science. Now, the massive volume of data has present in the form of twitter, facebook, emails, IOT (Internet of Things). So, the storage and analysis of these data has become great challenge for researchers. Traditional frameworks have failed for the processing of large data. R is open source programming framework developed for the analysis of large data results better accuracy. It also gives the opportunity of the implementation in R programming language. In this paper, a study on the use of R for the classification of large social network data. Naïve Bayes algorithm is used for the classification of large twitter data. The experiment has shown that enormous amount of data can be sufficiently classified using the R framework with promising results.
General Principles of Intellectual Property: Concepts of Intellectual Proper...
Using R for Classification of Large Social Network Data
1. Using R for Classification of Large Social Network Data
Furqan Nasir
Department of Information Technology
Shahbaz Nazeer
Department of Information Technology
Javed Ferzund
Department of Computer Science
Government College University Government College University COMSATS Institute of Information
Faisalabad, Pakistan Faisalabad, Pakistan Technology
furqan.nasir10@gmail.com shahbaz_nazir1@yahoo.com Sahiwal, Pakistan
jferzund@ciitsahiwal.edu.pk
Usman Zulfiqar
Department of Computer Science
Shahzad Ahmed
Department of Computer Science
M. Usman Ali
Department of Computer Science
University of Agriculture COMSATS Institute of Information COMSATS Institute of Information
Faisalabad, Pakistan Technology Technology
usmanz257@yahoo.com Sahiwal, Pakistan Sahiwal, Pakistan
shahzadahmad@ciitsahiwal.edu.pk usman.sani1439@ciitsahiwal.edu.pk
Fahad Atta
Riphah Institute of Computing and
Applied Sciences (RICAS)
Riphah International University
Lahore, Pakistan.
fahadatta1@gmail.com
Abstract—The advent of social networks has changed the research
in computer science. Now, the massive volume of data has present
in the form of twitter, facebook, emails, IOT (Internet of Things).
So, the storage and analysis of these data has become great
challenge for researchers. Traditional frameworks have failed for
the processing of large data. R is open source programming
framework developed for the analysis of large data results better
accuracy. It also gives the opportunity of the implementation in R
programming language. In this paper, a study on the use of R for
the classification of large social network data. Naïve Bayes
algorithm is used for the classification of large twitter data. The
experiment has shown that enormous amount of data can be
sufficiently classified using the R framework with promising
results.
Keyword- Machine Learning, R, Naïve Bayes,
Classification, term frequency-inverse document frequency,
Twitter
I. INTRODUCTION
With the advancements of new techniques as well as
technologies, massive amount of data have been generated in
small, medium and large scale in every field such as
electronics, finance, marketing, bioinformatics and also in
computer science studies. The social network play an
imperative part in the perspective of generating enormous
amount of data. In every seconds different amount of huge
data have produced by twitter, facebook, skype, messenger,
gmail and whatsApp. This generated amount of data leads to
terabyte to petabytes. Millions of tweets are generated in
every second. In a day, 7 terabyte and 10 terabyte data has
produced by using twitter and facebook respectively. The
twitter users are tweet almost 277,000 times every minute a
day. The email users send 204,000,000 messages every
minute of the day. Similarly, the uploading rate of using
youtube has 72 hours of new video in every minute of the
day. The growth rate of uncertain data increases from
enterprise data through sensors and devices.
Big Data includes large amount of data that is more
complex for the purpose of storing, handling, and managing.
The 4 V’s of Big Data are volume, variety, velocity and
veracity. Big Data process includes many phases that include
acquisition, extraction, integration, analysis, interpretation
and decision. In acquisition phase, select the large dataset
and then filter that data by creating data dictionary or
characteristics. In extraction phase, large data is transformed
by replacing variables and normalize this data to remove
redundant records. Then cleaning is performed for
improving the quality of data and manage the bugs/errors
from it. In integration phase, empirical view of data is made
by combining the data from different sources. Then
standardize the large data for sharing across the enterprise
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
64 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. and ensure the consistency of data by mapping it. In the
analysis phase, DM (Data Mining) is performed to extract the
hidden information from the data as well as ML (Machine
Learning) Algorithms are applied to these large data for
better analysis results. Visualize these data in the form of
graphs and plots. In interpretation phase, clearly, identifies
the knowledge about the domain of data and finding the
patterns using different tools. These patterns are helpful for
better analysis of data. In decision phase, focus on the
improvement of all process by using managerial techniques.
All of the Big Data needs to be classify in an efficient way to
improve the accuracy, performance as well as scalability.
To remove these bottleneck, R is statistical powerful tool
that is used for the representation of bars, graphs, and charts
of large datasets. It is used for the time-series analysis. It
provides the features of import and export functions that are
helpful for the integration with different programming
languages. It plays a significant role in the Machine Learning
and Data Mining. R tool provides the R programming
language for the purpose of implementation. This language
is used to create data frames for large datasets.
With the passage of time, stack of ML (Machine
Learning) Algorithms and techniques are available like k
Nearest Neighbor, Support Vector Machine, Random Forest,
Decision Tree and Naïve Bayes. These techniques and
classifiers are implemented in R language by using R
Framework. Multiple statistical techniques and test are used
for the purpose of feature classification that identify the
relevant features from large dataset. These techniques
include PCA (Principal Component Analysis), Factor
Analysis and tf-idf (term frequency-inverse document
frequency).
The statistical technique tf-idf (term frequency-inverse
document frequency) is used for the purpose of classifying
large twitter data in binary form. After feature selection, ML
(Machine Learning) Naïve Bayes Algorithm is used to
classify the data in R Framework in the form of binary-class.
The objectives of this paper are:
To represent Machine Learning Naïve Bayes Classifier
for large Social data (Twitter) analysis
To classify data with R framework
The rest of the paper is organized as follows: Section II
describes the related work in this field. Section III explains
methodology for Feature Classification on R framework.
Section IV highlights on results obtained from experiment
and discussion about analysis and Classification of data using
R. Section V concludes further research work for analysis
and Classification of ML Algorithms on R for Social network
dataset.
II. RELATED WORK
Ahmed at al. [1] have been described stack of data
formats (models) for different types of data storage. It
describes different data storage models for the different type
of ML (Machine Learning) techniques and Algorithms for BI
(Bioinformatics) data in large scala. Ali et al [2] have
highlights the many Machine Learning Algorithms for the
Classification and Clustering of large data. In his paper,
many libraries are defined like Mahout and Mlib that are
helpful when ML Algorithms are implementation. It
compares the NB (Naïve Bayes), SVM (Support Vector
Machine), LR (Logistic Regression), kNN (k Nearest
Neighbor), Linear Regression and RF (Random Forest) ML
Algorithms and techniques. Rehman et al [3] have been
explained the implementation language for various BI
(Bioinformatics) tools. It highlights the importance of Scala
language for all BI datasets. It describes the supported
platform for different languages and determines that the
proposed language is better than all traditional existing
languages. Sarwar et al [4] have been described many tools
based on Big Data for large BI (Bioinformatics) data. For
alignment viewer, tools are Base-by-Base, CINEMA,
MEGA, PFAAT, Strap, UGENE, JSAV, DSASO and
FLAK. For database search, tools are HMMER, KLAST,
Parasail, BLAST, SAM, FASTA and SWIPE. Ali et al [5]
have been applied PCA (Principal Component Analysis) and
Factor Analysis statistical tests for the purpose of extracting
the relevant features. In his paper, it also explained the
genomics dataset. Mitchell et al [6] have been described
Machine Learning Random Forest Algorithm that is
implemented in R tool for large dataset. For the Microarray
data, parallel generation of the tree is very useful. By using
Message Passing Interface, objects are sent. In his paper, a
brief comparison is given with traditional work. For the
experiment, Microarray dataset with 23292 genes. Using
Random Forest Algorithm, Efficiency and speed are
achieved in R Statistical tool. In his work, target class is
predicted by the construction of multiple trees. Categorical
and continuous features are classified using proposed. R et
al [7] have been described the comparison between Decision
Tree and Random Forest Machine Learning Algorithms
using R tool. The comparison is based on execution time
between these Classifiers. Results show that the Decision
Tree Algorithm consumes 0.17 and Random Forest
Algorithm consume 7.68.In his work, the error rate of
Random Forest is less than Decision Tree Algorithm with
256 data samples on Rattle implementation. Data Mining is
the process selection of meaningful attributes and target
class is predicted using Classification. His experiment is
designed for predicting weather conditions. The rattle is
Graphical User Interface for the analysis of large data in R
tool. Karatzoglou et al [8] have explained the functions of
Machine Learning Support Vector Machine Classifier. His
tool is implemented in R language. A lot of packages are
included like kernlab, svmpath, klaR and e1071. For the
experimental purpose, datasets iris, musk, breast cancer and
DNA was used. In his work, comparison of 4 SVM packages
is given in terms of accuracy, execution time and testing
error rate. A lot of packages are available for the
implementation of Machine Learning Algorithms in R tool.
For the Classification and Regression of large dataset in R
tool, proposed packages are used. Qian et al [9] have been
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
65 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. explained Machine Learning PivotalR package for large data
sets. His package includes in R tool to help for the processing
of Big Data tasks. It provides the Graphical User Interface
with the help of table operations, MADlib library and
abstraction layers on Hadoop MapReduce framework. In his
work, null values are managed by using proposed package.
The PostgreSQL and PostgreSQL databases are included in
his package. For parallel implementations of Machine
Learning Algorithms, it gives R wrapper for library MADlib
that is useful for structured and unstructured data. Jurka et al
[10] have designed RTextRools package that is used for the
Machine Learning Classification. In his work, many Machine
Learning Algorithms and techniques like Random Forest,
Support Vector Machine, tree, booting, SLDA, Glmnet and
NNET are used for the training of different models. In his
paper, classification of the dataset is performed based on
training models as well as accuracy is determined for all
Machine Learning techniques. Different parameters like
precision and recall represent the Classification accuracy. In
his experiment, cross-validation technique is used of every
classifier. For the experimental purpose, minimum 4GB
memory is required for installation. CSV or text file format
is used for input. In R tool, construct the matrix to remove
sparse data. Training and testing sizes are determined with
the help of container. Bischl et al [11] have been explained
Machine Learning mlr package in R language. In his paper,
multiple applications of his package in R language is
described like training and testing the models, constructing
superlative models, task learning and predicting different
tasks. For pre and post prediction and testing types, it gives
the wrapper methodology. His package gives the features of
parallelization as well as resampling that are the main
problems in existing packages. This package is very useful
for the implementation of many Machine Learning
Algorithms in R language in the viewpoint of Classification,
Clustering, and Prediction. For the selection of attributes as
well as for the Classification in the binary and multi-class,
this package performs better. This package is based on the
object-oriented framework.
III.METHODOLOGY
A. Data Collection
For the experiment, Twitter dataset is used which is
collected from Twitter API. The number of rows are 3183 as
well as number of columns (attributes) are 6 in the whole
dataset.
B. Feature Selection
For the selection of relevant features from dataset, R
tool was used. For this purpose, tf-idf (term frequency-
inverse document frequency) technique was applied on large
twitter data. It is technique statistical technique that
describes the importance of attributes among collection of
attributes. The term frequency explains the importance of
features as well as inverse document frequency minimize the
weight for mostly used attributes by increasing the weights
of meaningful features.
C. Naïve Bayes
In this study, Naïve Bayes algorithm is used for the
classification of large twitter data. Naïve Bayes Classifier is
a scalable method that used a lot of parameters for the
Classification purpose. It is supervised learning model that
classified the attributes (features) in the form of binary or
multi-class. First of all, model is trained by using that
classifier. It is based on probability model in which attributes
are classified by the representation of vectors with the help
of assigning attribute probabilities. The decision rules
combines the NB (Naïve Bayes) classifier with probability
model. There are many models that are based on events as
well as estimation like Gaussian, Multinomial and Bernoulli.
The text classification is easily performed by using that
classifier. This algorithm is very fast for the prediction of
classes than other classification algorithms and techniques.
The Naïve Bayes Algorithmic steps are shown in Figure 1.
Figure 1: Naive Bayes Algorithm
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
66 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. In NB (Naïve Bayes) model, class as well as conditional
probabilities are stored in the form of file. After making these
probabilities, prediction are made using NB model. This
classifier can be implemented in many programming
languages like C++, Java, R and Python. In R language, stack
of built-in libraries and methods are available for the analysis
of large dataset. The main applications of his algorithm
include semantic analysis, recommendation systems as well
as spam filtration.
D. Platform
For the experiments, R is imperative framework that is
used for the analysis of large data in the perspective of their
statistics and reporting. It provides the component of modular
programming. Procedures are integrated with the help of R
programming language. It gives the benefits of data storage
as well as analysis of data. Its main features are like arrays,
lists and matrices. A lot of data types are supported by R
language like factors, data frames, logical, complex, lists as
well as arrays. Vectors are used for combining the elements
into one form. Lists contains different types of elements.
Data frames provides the opportunity of storing different
data columns. Different types of operators are used like
arithmetic, comparison, logical and miscellaneous which are
used for multiple purposes.
In R language, decision making statements helps to
design the logic. These statements are if single selection
statement, else if double selection statements and switch
multiple selection statements. Looping construct like repeat,
while and for loops play a superlative role. It provides the
user defined and built-in functions for the purpose of
analysis and operation.
Figure 2: Internal Structure of R Framework in the Perspective of Machine Learning
R Framework supports stack of ML (Machine Learning)
and Data Mining (DM) tasks. In the Machine Learning
domain, Classification, Regression, Role Mining and
Regression (Prediction) has widely used in the perspective of
their analysis by using R. For the purpose of different
analysis, stack of R packages are used such as RODBC,
Gmodels, Class, Tm, Wordcloud, e1071, C50, rpart,
Neuralnet and Kernlab. For selecting the relevant attributes
from large dataset, different Statistical tests like Principal
Component Analysis and Factor Analysis are used in R
language. Different data storage formats are presented by
using R language.
Many ML (Machine Learning) Classification technique
such as k Nearest Neighbor, Support Vector Machine, Naïve
Bayes, Decision Tree, Regression Tree and Random Forest
are implemented by using R programming language for the
prediction and analysis of large that are present in the form
of social networks like twitter. These techniques use multiple
R supported libraries and methods for their graphical
representation.
For the experiment, Naïve Bayes Classifier is used for
the classification of large twitter data in R framework. The
internal structure of R Framework in the perspective of
Machine Learning Algorithms is given in Figure 2.
E. Data Loading and Analysis
First of all, obtain the twitter data for the purpose of
experiment. For this, create new application in the twitter
account. Then, authenticate that application for accessing a
lot of information exist about tweets. These information
present in the form of bits and authorized with API
(Application Programming Interface). In R tool, install the
“twitterR” package. For windows, “cecert.pem” file
downloaded and save the file in c: directory. The R objects
are created with the help of consumer secret key which is
automatically generated when application is created. The
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
67 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. keys information is passed by “OAuthFactory” function
by accessing, requesting and authenticating URL (Uniform
Resource Locator). For the completion on success, TRUE
value is returned by “registerTwitterOAuth” function. After
performing these steps, search the appropriate words with
“searchTwitter” function with parameters such as hashtag
(#bigdata), number of counts (n=1500) and cecert.pm file.
Then, check the type of object by with the help of class ( )
function. This class function determines that the list of
objects are exist. Then, created list is transformed to data
frames by the call ( ) function. After these steps, we have
CSV (Comma Separated Value) file that stored in valid path.
After accessing the twitter dataset in R, preprocessed the
whole data by using tf-idf (term frequency-inverse document
frequency) technique. After preprocessed the data, NB
(Naïve Bayes) Algorithm used to classify the relevant
features (attributes).
For the classification purpose, the whole tweets data
include half hashtag “#prolife” and half include hashtag
“prochoice”. The hashtag “#prochoice” represents 1 as well
as hashtag “#prolife” represents 0. The “abortion_tweets”
indicates the tweets list. Then, create vector “hash” in which
repetition occurs in such a way that the tweet prochoice
combined with zero vector same as tweet prolifes. After
these steps, generate “for” loop that breaks the whole
sentences into words as well as it is split by whitespace “ “.
Then make a corpus by installing package “tm” for applying
tf-idf (term frequency-inverse document frequency)
technique that represents the importance of all words for the
purpose of preprocessing. So, we convert all words in
lowercase letter after making corpus. After that, remove the
punctuation, hashtags and extra white space and produced all
words. By using the RWeka package, create bigram
tokenizer. The bigrams helps to create the matrix. Then, adds
the user names as row names to matrix after creating matrix.
We create column sums for removing the uncommon
bigrams and also create table. Then, set values that are
greater than 1 to 1 by setting the threshold. So, keep all rows
and those columns in which their sum is greater than
threshold. Then, drop the users that have small number of
bigrams by finding total zeros as well as table is created by
exploring data. After these steps, cutoff by 2 is applied and
create a list of authors by saving those users that have 2
bigram. From the hashtags, drop users. Then, install package
“e1071” for the purpose of turning the output attributes into
a factor. Then run Naïve Bayes model with the help of
“NBmod” object by saving them. For presenting the
predictions, vector of predictions generates with the help of
arguments such as estimated model, predictions and
outcome. And then make confusion matrix to check the
classified attributes.
IV.RESULTS AND DISCUSSION
By the upcoming era, enormous volume of data have
been produced by social networks like twitter, facebook and
yahoo. The traditional tools and techniques have not better
performs in terms of accuracy, scalability and performance.
With the passage of time, stack of Machine Learning
Techniques and Algorithms are available for the analysis of
such large data by classifying that data. Similarly, stack of
Statistical tests and methods are exists for the extraction of
relevant attributes. To achieve better performance results,
preprocessed the large twitter data using statistical technique
tf-idf (term frequency-inverse document frequency).
Machine Learning Naïve Bayes classifier are used for the
classification of such data in R framework.
The classifier Naïve Bayes implements in R tool with
the help of tf-idf (term frequency-inverse document
frequency) statistical test. Statistical test tf-idf is used to
create corpus from selected tweets. The proposed work gives
83.3% accuracy and 16.7% testing error. Outcomes of
prediction are based on two classes i.e., prolife and
prochoice. The prediction 0 represents prolife class and 1
represents prochoice class.
V. CONCLUSION
Today, existing tools have not better perform for the
classification of social network data due to the enormous
volume of data production. However, R framework stored,
handled, managed and analyzed such large data. R provides
the built-in libraries as well as data frames for the
implementation of machine learning classification
algorithms. In this paper, a study was presented on the use of
R for the classification of social network data. Naïve Bayes
technique was applied on the twitter data set. The statistical
tf-idf (term frequency-inverse document frequency)
technique was used to extract the relevant features. Then, it
was classified by R tool. Naïve Bayes model was built using
the training dataset and then evaluated using the testing
dataset. The results indicate 83.3% accuracy for binary
classification which is quite good for large dataset. In future,
we want to try other machine learning techniques on
different datasets using the R framework.
REFERENCES
[1] S. Ahmed, M. U. Ali, J. Ferzund, M. A. Sarwar, A. Rehman
and A. Mehmood, "Modern Data Formats for Big
Bioinformatics Data Analytics," International Journal of
Advanced Computer Science and Applications (IJACSA), vol.
8, no. 4, 2017.
[2] M. U. Ali, S. Ahmad and J. Ferzund, "Harnessing the Potential
of Machine Learning for Bioinformatics using Big Data
Tools," International Journal of Computer Science and
Information Security (IJCSIS), vol. 14, no. 10, pp. 668-675,
2016.
[3] A. Rehman, A. Abbas, M. A. Sarwar and J. Ferzund, "Need
and Role of Scala Implementations in Bioinformatics,"
International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 08, no. 02, 2017.
[4] M. A. Sarwar, A. Rehman and J. Ferzund, "Database Search,
Alignment Viewer and Genomics Analysis Tools: Big Data
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
68 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. for Bioinformatics," International Journal of Computer
Science and Information Security (IJCSIS), vol. 14, no. 12, pp.
317-328, 2016.
[5] M. U. Ali, S. Ahmed, J. Ferzund, A. Mehmood and A.
Rehman, "Using PCA and Factor Analysis for Dimensionality
Reduction of Bio-informatics Data," International Journal of
Advanced Computer Science and Applications (IJACSA), vol.
8, no. 5, pp. 415--426, 2017.
[6] L. Mitchell, "A parallel random forest implementation for R,"
Technical report, EPCC, 2011.
[7] T. R. Prajwala, "A Comparative Study on Decision Tree and
Random Forest Using R Tool," International Journal of
Advanced Research in Computer and Communication
Engineering, vol. 4, no. 1, pp. 196-199, 2015.
[8] A. Karatzoglou, D. Meyer and K. Hornik, "Support Vector
Machines in R," JSS Journal of Statistical Software, vol. 15,
no. 9, 2006.
[9] H. Qian, "PivotalR: A Package for Machine Learning on Big
Data," The R Journal, vol. 6, no. 1, pp. 57-67, 2014.
[10] T. P. Jurka, L. Collingwood, A. E. Boydstun, E. Grossman and
W. v. Atteveldt, "RTextTools: A Supervised Learning
Package for Text Classification," The R journal, vol. 5, no. 1,
pp. 6-12, 2013.
[11] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E.
Studerus, G. Casalicchio and Z. M. Jones, "mlr: Machine
Learning in R," Journal of Machine Learning Research, vol.
17, pp. 1-5, 2017.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
69 https://sites.google.com/site/ijcsis/
ISSN 1947-5500