Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
459
PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON
CLASSIFICATION ALGORITHM IN STREAM DATA MINING
Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod***
*(PG-CE Student, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
** (Principal, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
*** (Assistant Professor, Department of Computer Engineering)
ABSTRACT
Stream Data Mining is a new emerging topic in the field of research. Today, there are
number of application that generate Massive amount of stream data. Examples of such kind
of systems are Sensor networks, Real time surveillance systems, telecommunication systems.
Hence there is requirement of intelligent processing of such type of data that would help in
proper analysis and use of this data in other task even. Mining stream data is concerned with
extracting knowledge structures represented in models and patterns in non stopping streams
of information.
Classification process based on generating decision tree in stream data mining
that makes decision process easy. As per the characteristic of stream data, it becomes
essential to handle large amount of continuous and changing data with accuracy. In
classification process attribute selection at the non leaf decision node thus become a critical
analytic point. Various performance parameter’s like Speed of Classification, Accuracy, and
CPU Utilization time can be improved if split criterion is implemented precisely. This paper
presents implementation of different attribute selection criteria and their comparison with
alternative method.
Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive
Online Analysis), Split Criterion.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 2, March – April (2013), pp. 459-470
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

460
1. INTRODUCTION
Characteristic of stream data also act as challenges for the same. Due its huge size,
continuous nature, speed with which it changes, it requires a real time response which is done
after analysis of this type of data. As the data is huge in size algorithm which would access
the data is restricted for single scan of the data.
Data mining makes use of different types of algorithm for various types of mining
task like Classification, Clustering, and Pattern Recognition. Same way, Stream Data mining
also makes use of different types of algorithm for various types of mining task. Some of the
algorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decision
Tree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm is
based on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound to
gather optimum amount of data so that classification can be done accurately. CVFDT is the
algorithm which is able to detect concept drift which again is a challenge in stream data
mining. As the size of stream data is extremely large, a method is required for improving the
split criterion at the node of decision tree, so that the speed in tree generation is achieved
accuracy is improved and CPU utilization time is reduced. Two different types of split
criterion are checked for Stream data Classification in this paper. And thus improvement in
the algorithm based on it is done as a part of research work.
As said earlier, Stream Data is huge in size, so in order to perform certain analysis; we
need to take some sample of that data so that processing of stream data could be done with
ease. These samples taken should be such that whatever data comes in the portion of sample
is worth analyzing or processing, which means maximum knowledge is extracted from that
sampled data.
In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound based
tree algorithm.
2. RELATED WORK
Implementing algorithm for Stream Data Classification demands improvement in
resource utilization as well as improvisation in accuracy with ongoing classification process.
Here, we would see improvement done on algorithm that is based on Concept Drift Detection
while doing the classification of the data. Drift Detection here is done using Windowing
Technique.
Sliding Window: It is an advance technique. It deals with detailed analysis over most recent
data items and over summarized versions of older ones.
The inspiration behind sliding window is that the user is more concerned with the analysis of
most recent data streams. Thus the detailed analysis is done over the most recent data items
and summarized versions of the old ones. This idea has been adopted in many techniques in
the undergoing comprehensive data stream mining system.
3. CLASSIFICATION PROCESS.
There are many data mining algorithms that exist in practice. Data mining algorithms
can be categorized in three types:
1. Classification
2. Clustering
3. Association

461
A standard classification system has normally three different phases:
1. The training phase, during which the model is built using labeled data.
2. The testing phase, during which the model is tested by measuring its classification
accuracy on withheld labeled data.
3. The deployment phase during which the model is used to predict the class of unlabelled
data. The three phases are carried out in sequence. See Figure 2.1 for the standard
classification phases.
Fig 3.1: Phases of standard classification systems
3.1. STREAM DATA MINING
Ordinary classification is usually considered in three phases. In the first phase, a
model is built using data, called the training data, for which the property of interest (the class)
is already known (labeled data). In the second phase, the model is used to predict the class of
data (test data), for which the property of interest is known, but which the model has not
previously seen. In the third phase, the model is deployed and used to predict the property of
interest for (unlabelled data).
In stream classification, there is only a single stream of data, having labeled and unlabelled
records occurring together in the stream. The training/test and deployment phases, therefore,
interleave. Stream classification of unlabelled records could be required from the beginning
of the stream, after some sufficiently long initial sequence of labeled records, or at specific
moments in time or for a specific block of records selected by an external analyst.
4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE:
Selection of appropriate splitting criterion helps in improving performance measurement
dimensions. In data stream mining main three performance measurement dimensions:
- Accuracy
- Amount of space necessary or computer memory (Model cost or RAM hours)
- The time required to learn from training examples and to predict (Evaluation time)
These properties may be interdependent: adjusting the time and space used by an
algorithm can influence accuracy. By storing more pre-computed information, such as look
up tables, an algorithm can run faster at the expense of space. An algorithm can also run
faster by processing less information, either by stopping early or storing less, thus having less
data to process. The more time an algorithm has, the more likely it is that accuracy can be
increased.

462
There are major two types of attribute selection criterion and they are Information
Gain and Gini Index. Later one is also known as binary split criterion. During late 1970s and
1980s .
J.Ross Quinlan, a researcher in machine learning has developed a decision tree
algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attribute
selection. Information gain Gain (A) is given as Gain (A) = Info (D) –InfoA (D).We have
developed a new algorithm to calculate information gain. Methodology wise this algorithm is
promising. We have divided the algorithm into two parts. The first part calculates Info (D)
and the second part calculates the Gain (A).
4.1. Information Gain Calculation: (information before split) – (information after split)
Entropy: A common way to measure impurity is entropy
• Entropy = Where pi is the
probability of class i.
Compute it as the proportion of class i in the set.
• Entropy comes from information theory. The higher the entropy the more the
information content.
• For Continuous data value is computed as (ai+ai+1+1)/2
787.0
17
4
log
17
4
17
13
log
17
13
22 =





⋅−





⋅−
Entire population (30 instances)
Information Gain= 0.996 - 0.615 = 0.38
391.0
13
12
log
13
12
13
1
log
13
1
22 =





⋅−





⋅−
Calculating Information Gain
17 instances
13 instances
Information Gain = entropy(parent) – [average entropy(children)]
996.0
30
16
log
30
16
30
14
log
30
14
22 =





⋅−





⋅−
(Weighted) Average Entropy of Children = 615.0391.0
30
13
787.0
30
17
=





⋅+





⋅
parent
entropy
child
entropy
child
entropy
Figure 4.1: Phases of standard classification systems
4.2. Calculating Gini Index
If a data set T contains examples from n classes, Gini index, Gini (T) is defined as
Where pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T are
skewed.
After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the split
data is defined as
The attribute providing smallest gin split(T) is chosen to split the node.
∑−
i
ii pp 2log
∑=
−=
n
j
j
pTgini
1
2
1)(
)()()( 2
2
1
1
T
N
T
Ngini gini
N
gini
N
T
split
+=

463
5. METHODOLOGY AND PROPOSED ALGORITHM
CVFDT (Concept Adaptation Very fast Decision Tree) is an extended version of
VFDT which provides same speed and accuracy advantages but if any changes occur in
example generating process provide the ability to detect and respond. Various systems with
this CVFDT uses sliding window of various dataset to keep its model consistent. In Most of
systems, it needs to learn a new model from scratch after arrival of new data. Instead,
CVFDT continuous monitors the quality of new data and adjusts those that are no longer
correct. Whenever new data arrives, CVFDT incrementing counts for new data and
decrements counts for oldest data in the window. The concept is stationary than there is no
statically effect. If the concept is changing, however, some splits examples that will no longer
appear best because new data provides more gain than previous one. Whenever this thing
occurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new best
tree replaces old sub tree and it is more accurate on new data.
5.1 CVFDT ALGORITHM (Based on HoeffdingTree)
1. Alternate trees for each node in HT start as empty.
2. Process Examples from the stream indefinitely
3. For Each Example (x, y)
4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass
Through.
5. Add(x, y) To the sliding window of examples.
6. Remove and forget the effect of the oldest Examples, if the sliding window overflows.
7. CVFDT Grow
8. Check Split Validity if f examples seen since Last checking of alternate trees.
9. Return HT.
Fig: 5.1 Flow of CVFDT algorithm

464
6. EXPERIMENTAL ANALYSIS WITH OBSERVATION
Different types of dataset were taken and the algorithm of CVFDT was implemented
after Importing those data set to in MOA. Performance analysis of various split criterion used
in decision tree approach are also tested for improving the accuracy of the algorithm. Datasets
used here are in ARFF format. Some of the data are taken from Repository of California
University, some from projects of Spain which are working on Stream Data.
Data Sets taken were as follows:
1) Sensor
2) Sea
3) Random Tree generator.
The Readings taken here are for Sensor data. It contains information (temperature,
humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley
Research Lab. The whole stream contains consecutive information recorded over a 2 months
period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task
of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the
sensor data and the corresponding recording time. While the data stream flow over time, so
does the concepts underlying the stream. For example, the lighting during the working hours
is generally stronger than the night, and the temperature of specific sensors (conference room)
may regularly rise during the meetings.
Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repository
As discussed above an attribute selection measure is a heuristic for selecting the splitting criterion
that “best” separates a given Data. Two common methods used for it are:
1) Entropy based method (i.e. Information Gain)
2) Gini Index
6.1 RANDOM TREE GENERATOR DATA SET RESULTS

465
Instance Information
Gain(Accuracy)
Gini
Index(Accuracy)
100000 92.6 81.7
200000 93 83
300000 94.7 80.1
400000 96.3 82.2
500000 94.8 80.9
600000 96.9 81.9
700000 96.9 82.6
800000 96.7 82.1
900000 98.7 84
1000000 97.4 77.9
Table-I: Comparison for accuracy in random tree generator
6.2 SEA DATA SET RESULTS
Instance Information
Gain(Accuracy)
Gini
Index(Accuracy)
100000 89.8 89.3
200000 92.1 91.6
300000 89.6 89.3
400000 89.1 88.9
500000 88.5 88.5
600000 88.8 88.1
700000 90.6 90.6
800000 89.5 89.3
900000 89.1 89
1000000 89.9 89.9
Table-II: Comparison for accuracy for SEA Data

466
6.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPU
UTILIZATION)
Learning evaluation
instances
Evaluation time (Cpu
seconds) Info gain
Evaluation time (Cpu
seconds)Gini index
100000 6.676843 8.704856
200000 13.46289 18.67332
300000 20.23333 29.40619
400000 26.97257 39.87386
500000 33.68062 49.63952
600000 40.40426 59.06198
700000 47.0499 67.70443
800000 53.74234 78.0941
900000 59.93558 88.14057
1000000 66.79963 98.48343
1100000 73.27367 107.1727
1200000 79.27971 116.9851
1300000 85.53535 127.016
1400000 91.99379 136.6257
1500000 98.40543 145.2993
1600000 104.3803 152.9278
1700000 110.3083 160.0102
1800000 116.4859 168.1223
1900000 121.9928 174.8459
Table-III: Comparison of CPU Utilization time for SENSOR Data

467
6.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY)
Learning evaluation
instances
Classifications correct
(percent)Info Gain
Classifications correct
(percent)Gini Index
100000 96.3 98.4
200000 68.3 69.7
300000 18 64.4
400000 43.2 67.4
500000 62.8 72.9
600000 92 71
700000 97.9 72.5
800000 97.4 73.9
900000 96.8 73.7
1000000 80.6 68.5
1100000 53.6 71.2
1200000 71 90.3
1300000 84.1 73.1
1400000 78.5 83.9
1500000 96.3 84.9
1600000 50.9 84.9
1700000 24 79
1800000 74.3 87.6
1900000 98 97.8
Table-IV: Comparison of ACCURACY for SENSOR Data

468
6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)
Learning evaluation
instances
Tree size (nodes) Info
Gain
Tree size (nodes) Gini
Index
100000 14 126
200000 30 270
300000 44 396
400000 60 530
500000 76 666
600000 88 800
700000 102 938
800000 122 1076
900000 136 1214
1000000 150 1346
1100000 172 1466
1200000 196 1602
1300000 216 1742
1400000 226 1868
1500000 240 1998
1600000 262 2122
1700000 282 2238
1800000 292 2352
1900000 312 2474
Table-V: Comparison of TREE SIZE for SENSOR Data)

469
6.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES)
Learning evaluation
instances
Tree size (leaves) Info
Gain Tree size (leaves) Gini Index
100000 7 63
200000 15 135
300000 22 198
400000 30 265
500000 38 333
600000 44 400
700000 51 469
800000 61 538
900000 68 607
1000000 75 673
1100000 86 733
1200000 98 801
1300000 108 871
1400000 113 934
1500000 120 999
1600000 131 1061
1700000 141 1119
1800000 146 1176
1900000 156 1237
Table-IV: Comparison of LEAVES for SENSOR Data)
6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHER
FOR SENSOR DATA
Fig 6.2: Comparison of Performance for Sensor Data for every dimension together

470
7. CONCLUSION
In this paper, we discussed about theoretical aspects and practical results of Stream
Data Mining Classification algorithms with different split criterion. The comparison based on
different dataset shows the result analysis. Hoeffding trees with windowing technique spend
least amount of time for learning and results in higher accuracy than Gini Index. Memory
utilization, Accuracy and CPU Utilization which are crucial factor in Stream Data are
practically discussed here in this paper with observation. Classification generates decision
tree and tree generated with Split Criterion as Information gain shows that size of tree is also
decreased as shown in table along with dramatic change in accuracy and CPU Utilization.
REFERENCES
[1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, “A Survey Of Stream Data
Mining” Eight National Conference with International Participation-ETAI2007
[2] S.Muthukrishnan, “Data streams: Algorithms and Applications”.Proceeding of the
fourteenth annual ACM-SIAM symposium on discrete algorithms,2003
[3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]“Mining Data
Streams: A Review”, Centre for Distributed Systems and Software Engineering, Monash
University900 Dandenong Rd, Caulfield East, VIC3145, Australia
[4] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning
Algorithms and its Application to Clustering”, Proceedings of the Eighteenth International
Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann
[5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M.
Klein, M. Vasa, and D. Handy, VEDAS: “A Mobile and Distributed Data Stream Mining
System for Real-Time Vehicle Monitoring”, Proceedings of SIAM International Conference
on Data Mining, 2004.
[6]“Adaptive Parameter-free Learning from Evolving Data Streams”, Albert Bifet and Ricard
Gavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain.
[7] “Mining Stream with Concept Drift”, Dariusz Brzezinski, Master’s thesis, Poznan
University of Technology
[8] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,
Present and Future”, International journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375
[9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review
on the Data Mining And Information Security”, International journal of Computer
Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375

Parametric comparison based on split criterion on classification algorithm

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (6)

Semelhante a Parametric comparison based on split criterion on classification algorithm

Semelhante a Parametric comparison based on split criterion on classification algorithm (20)

Mais de IAEME Publication

Mais de IAEME Publication (20)

Último

Último (20)

Parametric comparison based on split criterion on classification algorithm