SlideShare a Scribd company logo
1 of 6
Download to read offline
Using R for Classification of Large Social Network Data
Furqan Nasir
Department of Information Technology
Shahbaz Nazeer
Department of Information Technology
Javed Ferzund
Department of Computer Science
Government College University Government College University COMSATS Institute of Information
Faisalabad, Pakistan Faisalabad, Pakistan Technology
furqan.nasir10@gmail.com shahbaz_nazir1@yahoo.com Sahiwal, Pakistan
jferzund@ciitsahiwal.edu.pk
Usman Zulfiqar
Department of Computer Science
Shahzad Ahmed
Department of Computer Science
M. Usman Ali
Department of Computer Science
University of Agriculture COMSATS Institute of Information COMSATS Institute of Information
Faisalabad, Pakistan Technology Technology
usmanz257@yahoo.com Sahiwal, Pakistan Sahiwal, Pakistan
shahzadahmad@ciitsahiwal.edu.pk usman.sani1439@ciitsahiwal.edu.pk
Fahad Atta
Riphah Institute of Computing and
Applied Sciences (RICAS)
Riphah International University
Lahore, Pakistan.
fahadatta1@gmail.com
Abstract—The advent of social networks has changed the research
in computer science. Now, the massive volume of data has present
in the form of twitter, facebook, emails, IOT (Internet of Things).
So, the storage and analysis of these data has become great
challenge for researchers. Traditional frameworks have failed for
the processing of large data. R is open source programming
framework developed for the analysis of large data results better
accuracy. It also gives the opportunity of the implementation in R
programming language. In this paper, a study on the use of R for
the classification of large social network data. Naïve Bayes
algorithm is used for the classification of large twitter data. The
experiment has shown that enormous amount of data can be
sufficiently classified using the R framework with promising
results.
Keyword- Machine Learning, R, Naïve Bayes,
Classification, term frequency-inverse document frequency,
Twitter
I. INTRODUCTION
With the advancements of new techniques as well as
technologies, massive amount of data have been generated in
small, medium and large scale in every field such as
electronics, finance, marketing, bioinformatics and also in
computer science studies. The social network play an
imperative part in the perspective of generating enormous
amount of data. In every seconds different amount of huge
data have produced by twitter, facebook, skype, messenger,
gmail and whatsApp. This generated amount of data leads to
terabyte to petabytes. Millions of tweets are generated in
every second. In a day, 7 terabyte and 10 terabyte data has
produced by using twitter and facebook respectively. The
twitter users are tweet almost 277,000 times every minute a
day. The email users send 204,000,000 messages every
minute of the day. Similarly, the uploading rate of using
youtube has 72 hours of new video in every minute of the
day. The growth rate of uncertain data increases from
enterprise data through sensors and devices.
Big Data includes large amount of data that is more
complex for the purpose of storing, handling, and managing.
The 4 V’s of Big Data are volume, variety, velocity and
veracity. Big Data process includes many phases that include
acquisition, extraction, integration, analysis, interpretation
and decision. In acquisition phase, select the large dataset
and then filter that data by creating data dictionary or
characteristics. In extraction phase, large data is transformed
by replacing variables and normalize this data to remove
redundant records. Then cleaning is performed for
improving the quality of data and manage the bugs/errors
from it. In integration phase, empirical view of data is made
by combining the data from different sources. Then
standardize the large data for sharing across the enterprise
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
64 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
and ensure the consistency of data by mapping it. In the
analysis phase, DM (Data Mining) is performed to extract the
hidden information from the data as well as ML (Machine
Learning) Algorithms are applied to these large data for
better analysis results. Visualize these data in the form of
graphs and plots. In interpretation phase, clearly, identifies
the knowledge about the domain of data and finding the
patterns using different tools. These patterns are helpful for
better analysis of data. In decision phase, focus on the
improvement of all process by using managerial techniques.
All of the Big Data needs to be classify in an efficient way to
improve the accuracy, performance as well as scalability.
To remove these bottleneck, R is statistical powerful tool
that is used for the representation of bars, graphs, and charts
of large datasets. It is used for the time-series analysis. It
provides the features of import and export functions that are
helpful for the integration with different programming
languages. It plays a significant role in the Machine Learning
and Data Mining. R tool provides the R programming
language for the purpose of implementation. This language
is used to create data frames for large datasets.
With the passage of time, stack of ML (Machine
Learning) Algorithms and techniques are available like k
Nearest Neighbor, Support Vector Machine, Random Forest,
Decision Tree and Naïve Bayes. These techniques and
classifiers are implemented in R language by using R
Framework. Multiple statistical techniques and test are used
for the purpose of feature classification that identify the
relevant features from large dataset. These techniques
include PCA (Principal Component Analysis), Factor
Analysis and tf-idf (term frequency-inverse document
frequency).
The statistical technique tf-idf (term frequency-inverse
document frequency) is used for the purpose of classifying
large twitter data in binary form. After feature selection, ML
(Machine Learning) Naïve Bayes Algorithm is used to
classify the data in R Framework in the form of binary-class.
The objectives of this paper are:
To represent Machine Learning Naïve Bayes Classifier
for large Social data (Twitter) analysis
To classify data with R framework
The rest of the paper is organized as follows: Section II
describes the related work in this field. Section III explains
methodology for Feature Classification on R framework.
Section IV highlights on results obtained from experiment
and discussion about analysis and Classification of data using
R. Section V concludes further research work for analysis
and Classification of ML Algorithms on R for Social network
dataset.
II. RELATED WORK
Ahmed at al. [1] have been described stack of data
formats (models) for different types of data storage. It
describes different data storage models for the different type
of ML (Machine Learning) techniques and Algorithms for BI
(Bioinformatics) data in large scala. Ali et al [2] have
highlights the many Machine Learning Algorithms for the
Classification and Clustering of large data. In his paper,
many libraries are defined like Mahout and Mlib that are
helpful when ML Algorithms are implementation. It
compares the NB (Naïve Bayes), SVM (Support Vector
Machine), LR (Logistic Regression), kNN (k Nearest
Neighbor), Linear Regression and RF (Random Forest) ML
Algorithms and techniques. Rehman et al [3] have been
explained the implementation language for various BI
(Bioinformatics) tools. It highlights the importance of Scala
language for all BI datasets. It describes the supported
platform for different languages and determines that the
proposed language is better than all traditional existing
languages. Sarwar et al [4] have been described many tools
based on Big Data for large BI (Bioinformatics) data. For
alignment viewer, tools are Base-by-Base, CINEMA,
MEGA, PFAAT, Strap, UGENE, JSAV, DSASO and
FLAK. For database search, tools are HMMER, KLAST,
Parasail, BLAST, SAM, FASTA and SWIPE. Ali et al [5]
have been applied PCA (Principal Component Analysis) and
Factor Analysis statistical tests for the purpose of extracting
the relevant features. In his paper, it also explained the
genomics dataset. Mitchell et al [6] have been described
Machine Learning Random Forest Algorithm that is
implemented in R tool for large dataset. For the Microarray
data, parallel generation of the tree is very useful. By using
Message Passing Interface, objects are sent. In his paper, a
brief comparison is given with traditional work. For the
experiment, Microarray dataset with 23292 genes. Using
Random Forest Algorithm, Efficiency and speed are
achieved in R Statistical tool. In his work, target class is
predicted by the construction of multiple trees. Categorical
and continuous features are classified using proposed. R et
al [7] have been described the comparison between Decision
Tree and Random Forest Machine Learning Algorithms
using R tool. The comparison is based on execution time
between these Classifiers. Results show that the Decision
Tree Algorithm consumes 0.17 and Random Forest
Algorithm consume 7.68.In his work, the error rate of
Random Forest is less than Decision Tree Algorithm with
256 data samples on Rattle implementation. Data Mining is
the process selection of meaningful attributes and target
class is predicted using Classification. His experiment is
designed for predicting weather conditions. The rattle is
Graphical User Interface for the analysis of large data in R
tool. Karatzoglou et al [8] have explained the functions of
Machine Learning Support Vector Machine Classifier. His
tool is implemented in R language. A lot of packages are
included like kernlab, svmpath, klaR and e1071. For the
experimental purpose, datasets iris, musk, breast cancer and
DNA was used. In his work, comparison of 4 SVM packages
is given in terms of accuracy, execution time and testing
error rate. A lot of packages are available for the
implementation of Machine Learning Algorithms in R tool.
For the Classification and Regression of large dataset in R
tool, proposed packages are used. Qian et al [9] have been
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
65 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
explained Machine Learning PivotalR package for large data
sets. His package includes in R tool to help for the processing
of Big Data tasks. It provides the Graphical User Interface
with the help of table operations, MADlib library and
abstraction layers on Hadoop MapReduce framework. In his
work, null values are managed by using proposed package.
The PostgreSQL and PostgreSQL databases are included in
his package. For parallel implementations of Machine
Learning Algorithms, it gives R wrapper for library MADlib
that is useful for structured and unstructured data. Jurka et al
[10] have designed RTextRools package that is used for the
Machine Learning Classification. In his work, many Machine
Learning Algorithms and techniques like Random Forest,
Support Vector Machine, tree, booting, SLDA, Glmnet and
NNET are used for the training of different models. In his
paper, classification of the dataset is performed based on
training models as well as accuracy is determined for all
Machine Learning techniques. Different parameters like
precision and recall represent the Classification accuracy. In
his experiment, cross-validation technique is used of every
classifier. For the experimental purpose, minimum 4GB
memory is required for installation. CSV or text file format
is used for input. In R tool, construct the matrix to remove
sparse data. Training and testing sizes are determined with
the help of container. Bischl et al [11] have been explained
Machine Learning mlr package in R language. In his paper,
multiple applications of his package in R language is
described like training and testing the models, constructing
superlative models, task learning and predicting different
tasks. For pre and post prediction and testing types, it gives
the wrapper methodology. His package gives the features of
parallelization as well as resampling that are the main
problems in existing packages. This package is very useful
for the implementation of many Machine Learning
Algorithms in R language in the viewpoint of Classification,
Clustering, and Prediction. For the selection of attributes as
well as for the Classification in the binary and multi-class,
this package performs better. This package is based on the
object-oriented framework.
III.METHODOLOGY
A. Data Collection
For the experiment, Twitter dataset is used which is
collected from Twitter API. The number of rows are 3183 as
well as number of columns (attributes) are 6 in the whole
dataset.
B. Feature Selection
For the selection of relevant features from dataset, R
tool was used. For this purpose, tf-idf (term frequency-
inverse document frequency) technique was applied on large
twitter data. It is technique statistical technique that
describes the importance of attributes among collection of
attributes. The term frequency explains the importance of
features as well as inverse document frequency minimize the
weight for mostly used attributes by increasing the weights
of meaningful features.
C. Naïve Bayes
In this study, Naïve Bayes algorithm is used for the
classification of large twitter data. Naïve Bayes Classifier is
a scalable method that used a lot of parameters for the
Classification purpose. It is supervised learning model that
classified the attributes (features) in the form of binary or
multi-class. First of all, model is trained by using that
classifier. It is based on probability model in which attributes
are classified by the representation of vectors with the help
of assigning attribute probabilities. The decision rules
combines the NB (Naïve Bayes) classifier with probability
model. There are many models that are based on events as
well as estimation like Gaussian, Multinomial and Bernoulli.
The text classification is easily performed by using that
classifier. This algorithm is very fast for the prediction of
classes than other classification algorithms and techniques.
The Naïve Bayes Algorithmic steps are shown in Figure 1.
Figure 1: Naive Bayes Algorithm
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
66 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
In NB (Naïve Bayes) model, class as well as conditional
probabilities are stored in the form of file. After making these
probabilities, prediction are made using NB model. This
classifier can be implemented in many programming
languages like C++, Java, R and Python. In R language, stack
of built-in libraries and methods are available for the analysis
of large dataset. The main applications of his algorithm
include semantic analysis, recommendation systems as well
as spam filtration.
D. Platform
For the experiments, R is imperative framework that is
used for the analysis of large data in the perspective of their
statistics and reporting. It provides the component of modular
programming. Procedures are integrated with the help of R
programming language. It gives the benefits of data storage
as well as analysis of data. Its main features are like arrays,
lists and matrices. A lot of data types are supported by R
language like factors, data frames, logical, complex, lists as
well as arrays. Vectors are used for combining the elements
into one form. Lists contains different types of elements.
Data frames provides the opportunity of storing different
data columns. Different types of operators are used like
arithmetic, comparison, logical and miscellaneous which are
used for multiple purposes.
In R language, decision making statements helps to
design the logic. These statements are if single selection
statement, else if double selection statements and switch
multiple selection statements. Looping construct like repeat,
while and for loops play a superlative role. It provides the
user defined and built-in functions for the purpose of
analysis and operation.
Figure 2: Internal Structure of R Framework in the Perspective of Machine Learning
R Framework supports stack of ML (Machine Learning)
and Data Mining (DM) tasks. In the Machine Learning
domain, Classification, Regression, Role Mining and
Regression (Prediction) has widely used in the perspective of
their analysis by using R. For the purpose of different
analysis, stack of R packages are used such as RODBC,
Gmodels, Class, Tm, Wordcloud, e1071, C50, rpart,
Neuralnet and Kernlab. For selecting the relevant attributes
from large dataset, different Statistical tests like Principal
Component Analysis and Factor Analysis are used in R
language. Different data storage formats are presented by
using R language.
Many ML (Machine Learning) Classification technique
such as k Nearest Neighbor, Support Vector Machine, Naïve
Bayes, Decision Tree, Regression Tree and Random Forest
are implemented by using R programming language for the
prediction and analysis of large that are present in the form
of social networks like twitter. These techniques use multiple
R supported libraries and methods for their graphical
representation.
For the experiment, Naïve Bayes Classifier is used for
the classification of large twitter data in R framework. The
internal structure of R Framework in the perspective of
Machine Learning Algorithms is given in Figure 2.
E. Data Loading and Analysis
First of all, obtain the twitter data for the purpose of
experiment. For this, create new application in the twitter
account. Then, authenticate that application for accessing a
lot of information exist about tweets. These information
present in the form of bits and authorized with API
(Application Programming Interface). In R tool, install the
“twitterR” package. For windows, “cecert.pem” file
downloaded and save the file in c: directory. The R objects
are created with the help of consumer secret key which is
automatically generated when application is created. The
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
67 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
keys information is passed by “OAuthFactory” function
by accessing, requesting and authenticating URL (Uniform
Resource Locator). For the completion on success, TRUE
value is returned by “registerTwitterOAuth” function. After
performing these steps, search the appropriate words with
“searchTwitter” function with parameters such as hashtag
(#bigdata), number of counts (n=1500) and cecert.pm file.
Then, check the type of object by with the help of class ( )
function. This class function determines that the list of
objects are exist. Then, created list is transformed to data
frames by the call ( ) function. After these steps, we have
CSV (Comma Separated Value) file that stored in valid path.
After accessing the twitter dataset in R, preprocessed the
whole data by using tf-idf (term frequency-inverse document
frequency) technique. After preprocessed the data, NB
(Naïve Bayes) Algorithm used to classify the relevant
features (attributes).
For the classification purpose, the whole tweets data
include half hashtag “#prolife” and half include hashtag
“prochoice”. The hashtag “#prochoice” represents 1 as well
as hashtag “#prolife” represents 0. The “abortion_tweets”
indicates the tweets list. Then, create vector “hash” in which
repetition occurs in such a way that the tweet prochoice
combined with zero vector same as tweet prolifes. After
these steps, generate “for” loop that breaks the whole
sentences into words as well as it is split by whitespace “ “.
Then make a corpus by installing package “tm” for applying
tf-idf (term frequency-inverse document frequency)
technique that represents the importance of all words for the
purpose of preprocessing. So, we convert all words in
lowercase letter after making corpus. After that, remove the
punctuation, hashtags and extra white space and produced all
words. By using the RWeka package, create bigram
tokenizer. The bigrams helps to create the matrix. Then, adds
the user names as row names to matrix after creating matrix.
We create column sums for removing the uncommon
bigrams and also create table. Then, set values that are
greater than 1 to 1 by setting the threshold. So, keep all rows
and those columns in which their sum is greater than
threshold. Then, drop the users that have small number of
bigrams by finding total zeros as well as table is created by
exploring data. After these steps, cutoff by 2 is applied and
create a list of authors by saving those users that have 2
bigram. From the hashtags, drop users. Then, install package
“e1071” for the purpose of turning the output attributes into
a factor. Then run Naïve Bayes model with the help of
“NBmod” object by saving them. For presenting the
predictions, vector of predictions generates with the help of
arguments such as estimated model, predictions and
outcome. And then make confusion matrix to check the
classified attributes.
IV.RESULTS AND DISCUSSION
By the upcoming era, enormous volume of data have
been produced by social networks like twitter, facebook and
yahoo. The traditional tools and techniques have not better
performs in terms of accuracy, scalability and performance.
With the passage of time, stack of Machine Learning
Techniques and Algorithms are available for the analysis of
such large data by classifying that data. Similarly, stack of
Statistical tests and methods are exists for the extraction of
relevant attributes. To achieve better performance results,
preprocessed the large twitter data using statistical technique
tf-idf (term frequency-inverse document frequency).
Machine Learning Naïve Bayes classifier are used for the
classification of such data in R framework.
The classifier Naïve Bayes implements in R tool with
the help of tf-idf (term frequency-inverse document
frequency) statistical test. Statistical test tf-idf is used to
create corpus from selected tweets. The proposed work gives
83.3% accuracy and 16.7% testing error. Outcomes of
prediction are based on two classes i.e., prolife and
prochoice. The prediction 0 represents prolife class and 1
represents prochoice class.
V. CONCLUSION
Today, existing tools have not better perform for the
classification of social network data due to the enormous
volume of data production. However, R framework stored,
handled, managed and analyzed such large data. R provides
the built-in libraries as well as data frames for the
implementation of machine learning classification
algorithms. In this paper, a study was presented on the use of
R for the classification of social network data. Naïve Bayes
technique was applied on the twitter data set. The statistical
tf-idf (term frequency-inverse document frequency)
technique was used to extract the relevant features. Then, it
was classified by R tool. Naïve Bayes model was built using
the training dataset and then evaluated using the testing
dataset. The results indicate 83.3% accuracy for binary
classification which is quite good for large dataset. In future,
we want to try other machine learning techniques on
different datasets using the R framework.
REFERENCES
[1] S. Ahmed, M. U. Ali, J. Ferzund, M. A. Sarwar, A. Rehman
and A. Mehmood, "Modern Data Formats for Big
Bioinformatics Data Analytics," International Journal of
Advanced Computer Science and Applications (IJACSA), vol.
8, no. 4, 2017.
[2] M. U. Ali, S. Ahmad and J. Ferzund, "Harnessing the Potential
of Machine Learning for Bioinformatics using Big Data
Tools," International Journal of Computer Science and
Information Security (IJCSIS), vol. 14, no. 10, pp. 668-675,
2016.
[3] A. Rehman, A. Abbas, M. A. Sarwar and J. Ferzund, "Need
and Role of Scala Implementations in Bioinformatics,"
International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 08, no. 02, 2017.
[4] M. A. Sarwar, A. Rehman and J. Ferzund, "Database Search,
Alignment Viewer and Genomics Analysis Tools: Big Data
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
68 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
for Bioinformatics," International Journal of Computer
Science and Information Security (IJCSIS), vol. 14, no. 12, pp.
317-328, 2016.
[5] M. U. Ali, S. Ahmed, J. Ferzund, A. Mehmood and A.
Rehman, "Using PCA and Factor Analysis for Dimensionality
Reduction of Bio-informatics Data," International Journal of
Advanced Computer Science and Applications (IJACSA), vol.
8, no. 5, pp. 415--426, 2017.
[6] L. Mitchell, "A parallel random forest implementation for R,"
Technical report, EPCC, 2011.
[7] T. R. Prajwala, "A Comparative Study on Decision Tree and
Random Forest Using R Tool," International Journal of
Advanced Research in Computer and Communication
Engineering, vol. 4, no. 1, pp. 196-199, 2015.
[8] A. Karatzoglou, D. Meyer and K. Hornik, "Support Vector
Machines in R," JSS Journal of Statistical Software, vol. 15,
no. 9, 2006.
[9] H. Qian, "PivotalR: A Package for Machine Learning on Big
Data," The R Journal, vol. 6, no. 1, pp. 57-67, 2014.
[10] T. P. Jurka, L. Collingwood, A. E. Boydstun, E. Grossman and
W. v. Atteveldt, "RTextTools: A Supervised Learning
Package for Text Classification," The R journal, vol. 5, no. 1,
pp. 6-12, 2013.
[11] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E.
Studerus, G. Casalicchio and Z. M. Jones, "mlr: Machine
Learning in R," Journal of Machine Learning Research, vol.
17, pp. 1-5, 2017.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
69 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot

Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
iaemedu
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 

What's hot (18)

Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
IRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic RegressionIRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic Regression
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
DBMS
DBMSDBMS
DBMS
 
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
 
Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...
 
Prd 1099
Prd 1099Prd 1099
Prd 1099
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 

Similar to Using R for Classification of Large Social Network Data

Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data Environment
IJCSIS Research Publications
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
IJCSIS Research Publications
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
IJERA Editor
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
IJERA Editor
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
ijtsrd
 

Similar to Using R for Classification of Large Social Network Data (20)

Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data Environment
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 
resume_MH
resume_MHresume_MH
resume_MH
 
Mining Social Media Data for Understanding Drugs Usage
Mining Social Media Data for Understanding Drugs  UsageMining Social Media Data for Understanding Drugs  Usage
Mining Social Media Data for Understanding Drugs Usage
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big DataIRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big Data
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Analysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic ToolsAnalysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic Tools
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Using R for Classification of Large Social Network Data

  • 1. Using R for Classification of Large Social Network Data Furqan Nasir Department of Information Technology Shahbaz Nazeer Department of Information Technology Javed Ferzund Department of Computer Science Government College University Government College University COMSATS Institute of Information Faisalabad, Pakistan Faisalabad, Pakistan Technology furqan.nasir10@gmail.com shahbaz_nazir1@yahoo.com Sahiwal, Pakistan jferzund@ciitsahiwal.edu.pk Usman Zulfiqar Department of Computer Science Shahzad Ahmed Department of Computer Science M. Usman Ali Department of Computer Science University of Agriculture COMSATS Institute of Information COMSATS Institute of Information Faisalabad, Pakistan Technology Technology usmanz257@yahoo.com Sahiwal, Pakistan Sahiwal, Pakistan shahzadahmad@ciitsahiwal.edu.pk usman.sani1439@ciitsahiwal.edu.pk Fahad Atta Riphah Institute of Computing and Applied Sciences (RICAS) Riphah International University Lahore, Pakistan. fahadatta1@gmail.com Abstract—The advent of social networks has changed the research in computer science. Now, the massive volume of data has present in the form of twitter, facebook, emails, IOT (Internet of Things). So, the storage and analysis of these data has become great challenge for researchers. Traditional frameworks have failed for the processing of large data. R is open source programming framework developed for the analysis of large data results better accuracy. It also gives the opportunity of the implementation in R programming language. In this paper, a study on the use of R for the classification of large social network data. Naïve Bayes algorithm is used for the classification of large twitter data. The experiment has shown that enormous amount of data can be sufficiently classified using the R framework with promising results. Keyword- Machine Learning, R, Naïve Bayes, Classification, term frequency-inverse document frequency, Twitter I. INTRODUCTION With the advancements of new techniques as well as technologies, massive amount of data have been generated in small, medium and large scale in every field such as electronics, finance, marketing, bioinformatics and also in computer science studies. The social network play an imperative part in the perspective of generating enormous amount of data. In every seconds different amount of huge data have produced by twitter, facebook, skype, messenger, gmail and whatsApp. This generated amount of data leads to terabyte to petabytes. Millions of tweets are generated in every second. In a day, 7 terabyte and 10 terabyte data has produced by using twitter and facebook respectively. The twitter users are tweet almost 277,000 times every minute a day. The email users send 204,000,000 messages every minute of the day. Similarly, the uploading rate of using youtube has 72 hours of new video in every minute of the day. The growth rate of uncertain data increases from enterprise data through sensors and devices. Big Data includes large amount of data that is more complex for the purpose of storing, handling, and managing. The 4 V’s of Big Data are volume, variety, velocity and veracity. Big Data process includes many phases that include acquisition, extraction, integration, analysis, interpretation and decision. In acquisition phase, select the large dataset and then filter that data by creating data dictionary or characteristics. In extraction phase, large data is transformed by replacing variables and normalize this data to remove redundant records. Then cleaning is performed for improving the quality of data and manage the bugs/errors from it. In integration phase, empirical view of data is made by combining the data from different sources. Then standardize the large data for sharing across the enterprise International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 64 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. and ensure the consistency of data by mapping it. In the analysis phase, DM (Data Mining) is performed to extract the hidden information from the data as well as ML (Machine Learning) Algorithms are applied to these large data for better analysis results. Visualize these data in the form of graphs and plots. In interpretation phase, clearly, identifies the knowledge about the domain of data and finding the patterns using different tools. These patterns are helpful for better analysis of data. In decision phase, focus on the improvement of all process by using managerial techniques. All of the Big Data needs to be classify in an efficient way to improve the accuracy, performance as well as scalability. To remove these bottleneck, R is statistical powerful tool that is used for the representation of bars, graphs, and charts of large datasets. It is used for the time-series analysis. It provides the features of import and export functions that are helpful for the integration with different programming languages. It plays a significant role in the Machine Learning and Data Mining. R tool provides the R programming language for the purpose of implementation. This language is used to create data frames for large datasets. With the passage of time, stack of ML (Machine Learning) Algorithms and techniques are available like k Nearest Neighbor, Support Vector Machine, Random Forest, Decision Tree and Naïve Bayes. These techniques and classifiers are implemented in R language by using R Framework. Multiple statistical techniques and test are used for the purpose of feature classification that identify the relevant features from large dataset. These techniques include PCA (Principal Component Analysis), Factor Analysis and tf-idf (term frequency-inverse document frequency). The statistical technique tf-idf (term frequency-inverse document frequency) is used for the purpose of classifying large twitter data in binary form. After feature selection, ML (Machine Learning) Naïve Bayes Algorithm is used to classify the data in R Framework in the form of binary-class. The objectives of this paper are: To represent Machine Learning Naïve Bayes Classifier for large Social data (Twitter) analysis To classify data with R framework The rest of the paper is organized as follows: Section II describes the related work in this field. Section III explains methodology for Feature Classification on R framework. Section IV highlights on results obtained from experiment and discussion about analysis and Classification of data using R. Section V concludes further research work for analysis and Classification of ML Algorithms on R for Social network dataset. II. RELATED WORK Ahmed at al. [1] have been described stack of data formats (models) for different types of data storage. It describes different data storage models for the different type of ML (Machine Learning) techniques and Algorithms for BI (Bioinformatics) data in large scala. Ali et al [2] have highlights the many Machine Learning Algorithms for the Classification and Clustering of large data. In his paper, many libraries are defined like Mahout and Mlib that are helpful when ML Algorithms are implementation. It compares the NB (Naïve Bayes), SVM (Support Vector Machine), LR (Logistic Regression), kNN (k Nearest Neighbor), Linear Regression and RF (Random Forest) ML Algorithms and techniques. Rehman et al [3] have been explained the implementation language for various BI (Bioinformatics) tools. It highlights the importance of Scala language for all BI datasets. It describes the supported platform for different languages and determines that the proposed language is better than all traditional existing languages. Sarwar et al [4] have been described many tools based on Big Data for large BI (Bioinformatics) data. For alignment viewer, tools are Base-by-Base, CINEMA, MEGA, PFAAT, Strap, UGENE, JSAV, DSASO and FLAK. For database search, tools are HMMER, KLAST, Parasail, BLAST, SAM, FASTA and SWIPE. Ali et al [5] have been applied PCA (Principal Component Analysis) and Factor Analysis statistical tests for the purpose of extracting the relevant features. In his paper, it also explained the genomics dataset. Mitchell et al [6] have been described Machine Learning Random Forest Algorithm that is implemented in R tool for large dataset. For the Microarray data, parallel generation of the tree is very useful. By using Message Passing Interface, objects are sent. In his paper, a brief comparison is given with traditional work. For the experiment, Microarray dataset with 23292 genes. Using Random Forest Algorithm, Efficiency and speed are achieved in R Statistical tool. In his work, target class is predicted by the construction of multiple trees. Categorical and continuous features are classified using proposed. R et al [7] have been described the comparison between Decision Tree and Random Forest Machine Learning Algorithms using R tool. The comparison is based on execution time between these Classifiers. Results show that the Decision Tree Algorithm consumes 0.17 and Random Forest Algorithm consume 7.68.In his work, the error rate of Random Forest is less than Decision Tree Algorithm with 256 data samples on Rattle implementation. Data Mining is the process selection of meaningful attributes and target class is predicted using Classification. His experiment is designed for predicting weather conditions. The rattle is Graphical User Interface for the analysis of large data in R tool. Karatzoglou et al [8] have explained the functions of Machine Learning Support Vector Machine Classifier. His tool is implemented in R language. A lot of packages are included like kernlab, svmpath, klaR and e1071. For the experimental purpose, datasets iris, musk, breast cancer and DNA was used. In his work, comparison of 4 SVM packages is given in terms of accuracy, execution time and testing error rate. A lot of packages are available for the implementation of Machine Learning Algorithms in R tool. For the Classification and Regression of large dataset in R tool, proposed packages are used. Qian et al [9] have been International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 65 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. explained Machine Learning PivotalR package for large data sets. His package includes in R tool to help for the processing of Big Data tasks. It provides the Graphical User Interface with the help of table operations, MADlib library and abstraction layers on Hadoop MapReduce framework. In his work, null values are managed by using proposed package. The PostgreSQL and PostgreSQL databases are included in his package. For parallel implementations of Machine Learning Algorithms, it gives R wrapper for library MADlib that is useful for structured and unstructured data. Jurka et al [10] have designed RTextRools package that is used for the Machine Learning Classification. In his work, many Machine Learning Algorithms and techniques like Random Forest, Support Vector Machine, tree, booting, SLDA, Glmnet and NNET are used for the training of different models. In his paper, classification of the dataset is performed based on training models as well as accuracy is determined for all Machine Learning techniques. Different parameters like precision and recall represent the Classification accuracy. In his experiment, cross-validation technique is used of every classifier. For the experimental purpose, minimum 4GB memory is required for installation. CSV or text file format is used for input. In R tool, construct the matrix to remove sparse data. Training and testing sizes are determined with the help of container. Bischl et al [11] have been explained Machine Learning mlr package in R language. In his paper, multiple applications of his package in R language is described like training and testing the models, constructing superlative models, task learning and predicting different tasks. For pre and post prediction and testing types, it gives the wrapper methodology. His package gives the features of parallelization as well as resampling that are the main problems in existing packages. This package is very useful for the implementation of many Machine Learning Algorithms in R language in the viewpoint of Classification, Clustering, and Prediction. For the selection of attributes as well as for the Classification in the binary and multi-class, this package performs better. This package is based on the object-oriented framework. III.METHODOLOGY A. Data Collection For the experiment, Twitter dataset is used which is collected from Twitter API. The number of rows are 3183 as well as number of columns (attributes) are 6 in the whole dataset. B. Feature Selection For the selection of relevant features from dataset, R tool was used. For this purpose, tf-idf (term frequency- inverse document frequency) technique was applied on large twitter data. It is technique statistical technique that describes the importance of attributes among collection of attributes. The term frequency explains the importance of features as well as inverse document frequency minimize the weight for mostly used attributes by increasing the weights of meaningful features. C. Naïve Bayes In this study, Naïve Bayes algorithm is used for the classification of large twitter data. Naïve Bayes Classifier is a scalable method that used a lot of parameters for the Classification purpose. It is supervised learning model that classified the attributes (features) in the form of binary or multi-class. First of all, model is trained by using that classifier. It is based on probability model in which attributes are classified by the representation of vectors with the help of assigning attribute probabilities. The decision rules combines the NB (Naïve Bayes) classifier with probability model. There are many models that are based on events as well as estimation like Gaussian, Multinomial and Bernoulli. The text classification is easily performed by using that classifier. This algorithm is very fast for the prediction of classes than other classification algorithms and techniques. The Naïve Bayes Algorithmic steps are shown in Figure 1. Figure 1: Naive Bayes Algorithm International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 66 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. In NB (Naïve Bayes) model, class as well as conditional probabilities are stored in the form of file. After making these probabilities, prediction are made using NB model. This classifier can be implemented in many programming languages like C++, Java, R and Python. In R language, stack of built-in libraries and methods are available for the analysis of large dataset. The main applications of his algorithm include semantic analysis, recommendation systems as well as spam filtration. D. Platform For the experiments, R is imperative framework that is used for the analysis of large data in the perspective of their statistics and reporting. It provides the component of modular programming. Procedures are integrated with the help of R programming language. It gives the benefits of data storage as well as analysis of data. Its main features are like arrays, lists and matrices. A lot of data types are supported by R language like factors, data frames, logical, complex, lists as well as arrays. Vectors are used for combining the elements into one form. Lists contains different types of elements. Data frames provides the opportunity of storing different data columns. Different types of operators are used like arithmetic, comparison, logical and miscellaneous which are used for multiple purposes. In R language, decision making statements helps to design the logic. These statements are if single selection statement, else if double selection statements and switch multiple selection statements. Looping construct like repeat, while and for loops play a superlative role. It provides the user defined and built-in functions for the purpose of analysis and operation. Figure 2: Internal Structure of R Framework in the Perspective of Machine Learning R Framework supports stack of ML (Machine Learning) and Data Mining (DM) tasks. In the Machine Learning domain, Classification, Regression, Role Mining and Regression (Prediction) has widely used in the perspective of their analysis by using R. For the purpose of different analysis, stack of R packages are used such as RODBC, Gmodels, Class, Tm, Wordcloud, e1071, C50, rpart, Neuralnet and Kernlab. For selecting the relevant attributes from large dataset, different Statistical tests like Principal Component Analysis and Factor Analysis are used in R language. Different data storage formats are presented by using R language. Many ML (Machine Learning) Classification technique such as k Nearest Neighbor, Support Vector Machine, Naïve Bayes, Decision Tree, Regression Tree and Random Forest are implemented by using R programming language for the prediction and analysis of large that are present in the form of social networks like twitter. These techniques use multiple R supported libraries and methods for their graphical representation. For the experiment, Naïve Bayes Classifier is used for the classification of large twitter data in R framework. The internal structure of R Framework in the perspective of Machine Learning Algorithms is given in Figure 2. E. Data Loading and Analysis First of all, obtain the twitter data for the purpose of experiment. For this, create new application in the twitter account. Then, authenticate that application for accessing a lot of information exist about tweets. These information present in the form of bits and authorized with API (Application Programming Interface). In R tool, install the “twitterR” package. For windows, “cecert.pem” file downloaded and save the file in c: directory. The R objects are created with the help of consumer secret key which is automatically generated when application is created. The International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 67 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. keys information is passed by “OAuthFactory” function by accessing, requesting and authenticating URL (Uniform Resource Locator). For the completion on success, TRUE value is returned by “registerTwitterOAuth” function. After performing these steps, search the appropriate words with “searchTwitter” function with parameters such as hashtag (#bigdata), number of counts (n=1500) and cecert.pm file. Then, check the type of object by with the help of class ( ) function. This class function determines that the list of objects are exist. Then, created list is transformed to data frames by the call ( ) function. After these steps, we have CSV (Comma Separated Value) file that stored in valid path. After accessing the twitter dataset in R, preprocessed the whole data by using tf-idf (term frequency-inverse document frequency) technique. After preprocessed the data, NB (Naïve Bayes) Algorithm used to classify the relevant features (attributes). For the classification purpose, the whole tweets data include half hashtag “#prolife” and half include hashtag “prochoice”. The hashtag “#prochoice” represents 1 as well as hashtag “#prolife” represents 0. The “abortion_tweets” indicates the tweets list. Then, create vector “hash” in which repetition occurs in such a way that the tweet prochoice combined with zero vector same as tweet prolifes. After these steps, generate “for” loop that breaks the whole sentences into words as well as it is split by whitespace “ “. Then make a corpus by installing package “tm” for applying tf-idf (term frequency-inverse document frequency) technique that represents the importance of all words for the purpose of preprocessing. So, we convert all words in lowercase letter after making corpus. After that, remove the punctuation, hashtags and extra white space and produced all words. By using the RWeka package, create bigram tokenizer. The bigrams helps to create the matrix. Then, adds the user names as row names to matrix after creating matrix. We create column sums for removing the uncommon bigrams and also create table. Then, set values that are greater than 1 to 1 by setting the threshold. So, keep all rows and those columns in which their sum is greater than threshold. Then, drop the users that have small number of bigrams by finding total zeros as well as table is created by exploring data. After these steps, cutoff by 2 is applied and create a list of authors by saving those users that have 2 bigram. From the hashtags, drop users. Then, install package “e1071” for the purpose of turning the output attributes into a factor. Then run Naïve Bayes model with the help of “NBmod” object by saving them. For presenting the predictions, vector of predictions generates with the help of arguments such as estimated model, predictions and outcome. And then make confusion matrix to check the classified attributes. IV.RESULTS AND DISCUSSION By the upcoming era, enormous volume of data have been produced by social networks like twitter, facebook and yahoo. The traditional tools and techniques have not better performs in terms of accuracy, scalability and performance. With the passage of time, stack of Machine Learning Techniques and Algorithms are available for the analysis of such large data by classifying that data. Similarly, stack of Statistical tests and methods are exists for the extraction of relevant attributes. To achieve better performance results, preprocessed the large twitter data using statistical technique tf-idf (term frequency-inverse document frequency). Machine Learning Naïve Bayes classifier are used for the classification of such data in R framework. The classifier Naïve Bayes implements in R tool with the help of tf-idf (term frequency-inverse document frequency) statistical test. Statistical test tf-idf is used to create corpus from selected tweets. The proposed work gives 83.3% accuracy and 16.7% testing error. Outcomes of prediction are based on two classes i.e., prolife and prochoice. The prediction 0 represents prolife class and 1 represents prochoice class. V. CONCLUSION Today, existing tools have not better perform for the classification of social network data due to the enormous volume of data production. However, R framework stored, handled, managed and analyzed such large data. R provides the built-in libraries as well as data frames for the implementation of machine learning classification algorithms. In this paper, a study was presented on the use of R for the classification of social network data. Naïve Bayes technique was applied on the twitter data set. The statistical tf-idf (term frequency-inverse document frequency) technique was used to extract the relevant features. Then, it was classified by R tool. Naïve Bayes model was built using the training dataset and then evaluated using the testing dataset. The results indicate 83.3% accuracy for binary classification which is quite good for large dataset. In future, we want to try other machine learning techniques on different datasets using the R framework. REFERENCES [1] S. Ahmed, M. U. Ali, J. Ferzund, M. A. Sarwar, A. Rehman and A. Mehmood, "Modern Data Formats for Big Bioinformatics Data Analytics," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 8, no. 4, 2017. [2] M. U. Ali, S. Ahmad and J. Ferzund, "Harnessing the Potential of Machine Learning for Bioinformatics using Big Data Tools," International Journal of Computer Science and Information Security (IJCSIS), vol. 14, no. 10, pp. 668-675, 2016. [3] A. Rehman, A. Abbas, M. A. Sarwar and J. Ferzund, "Need and Role of Scala Implementations in Bioinformatics," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 08, no. 02, 2017. [4] M. A. Sarwar, A. Rehman and J. Ferzund, "Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 68 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. for Bioinformatics," International Journal of Computer Science and Information Security (IJCSIS), vol. 14, no. 12, pp. 317-328, 2016. [5] M. U. Ali, S. Ahmed, J. Ferzund, A. Mehmood and A. Rehman, "Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 8, no. 5, pp. 415--426, 2017. [6] L. Mitchell, "A parallel random forest implementation for R," Technical report, EPCC, 2011. [7] T. R. Prajwala, "A Comparative Study on Decision Tree and Random Forest Using R Tool," International Journal of Advanced Research in Computer and Communication Engineering, vol. 4, no. 1, pp. 196-199, 2015. [8] A. Karatzoglou, D. Meyer and K. Hornik, "Support Vector Machines in R," JSS Journal of Statistical Software, vol. 15, no. 9, 2006. [9] H. Qian, "PivotalR: A Package for Machine Learning on Big Data," The R Journal, vol. 6, no. 1, pp. 57-67, 2014. [10] T. P. Jurka, L. Collingwood, A. E. Boydstun, E. Grossman and W. v. Atteveldt, "RTextTools: A Supervised Learning Package for Text Classification," The R journal, vol. 5, no. 1, pp. 6-12, 2013. [11] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio and Z. M. Jones, "mlr: Machine Learning in R," Journal of Machine Learning Research, vol. 17, pp. 1-5, 2017. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 69 https://sites.google.com/site/ijcsis/ ISSN 1947-5500