SlideShare uma empresa Scribd logo
1 de 7
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
62
Perfomance Comparison of Decsion Tree
Algorithms to Findout the Reason for Student’s
Absenteeism at the Undergraduate Level in a
College for an Academic Year
G. Suresh1
, K. Arunmozhi Arasan2
, S. Muthukumaran3
1
Assistant Professor, PG and Research Dept. of Computer Applications, St. Joseph’s College of Arts and Science
(Autonomous), Cuddalore.
2
HOD and Assistant Professor, Department of Computer Science and Applications, Siga College of Management and Computer
Science, Villupuram.
3
Assistant Professor, Department of Computer Science and Applications, Siga College of Management and Computer Science,
Villupuram.
Email:sureshg2233@yahoo.co.in, arunlucks@yahoo.co.in, muthulecturer@rediffmail.com
Abstract- Educational data mining is used to study the data
available in the educational field and bring out the hidden
knowledge from it. Classification methods like decision trees,
rule mining can be applied on the educational data for
predicting the students behavior. This paper focuses on finding
thesuitablealgorithm which yields the best result to find out the
reason behind students absenteeism in an academic year. The
first step in this processis to gather students data by using
questionnaire.The datais collected from 123 under graduate
students from a private college which is situated in a semi-
rural area. The second step is to clean the data which is
appropriate for mining purpose and choose the relevant
attributes. In the final step, three different Decision tree
induction algorithms namely, ID3(Iterative Dichotomiser),
C4.5 and CART(Classification and Regression Tree)were
applied for comparison of results for the same data sample
collected using questionnaire. The results were compared to
find the algorithm which yields the best result in predicting the
reason for student s absenteeism.
Keywords: Data Mining, Decision Tree Induction, ID3, C4.5
and CARTalgorithm.
I. INTRODUCTION
Currently many educational institutions especially small-
medium education institutions are facing problems with the
lack of attendance among the students[1]. The students who
possess less than 80% percentage of attendance will not be
permitted by the concerned universities to appear for the
semester exams. In the recent years all educational institutions
are facing this lack of attendance problem[2]. Hence,this
research aims to find suitable decision tree algorithm in
predicting the reason for student lack of attendance.
II. LITERATURE SURVEY
Ekkachai Naenudorn and Jatsada Singthongchaip
resented[3,4] their study on student recruiting on higher
education institutions. The objectives of this study are to test
the validity of the model derived fromdecision rules and to
find the right algorithm for data classification task. From
comparison of 4 algorithms; J48,Id3, Naïve Bayes and OneR,
with the goal to predict the features of students who are likely
to undergo thestudent admission process.Sunita B. Aher and
Lobo L.M.R.J[5]presented their paper in that they compare
the five classification algorithm to choose the best
classification algorithm for CourseRecommendation system.
These five classification algorithms are ADTree, Simple Cart,
J48, ZeroR& Naive Bays Classification Algorithm. They
compare these six algorithms using open source data mining
tool Weka& present the result[6].DursunDelen, Glenn Walker
and AmitKadam[7]presented a paper on predicting the breast
cancer survivability; they used two popular data mining
algorithms (artificial neural networks and decision trees) along
with a most commonly used logistic regression method to
develop the prediction models using a large data set[8]. They
also used 10-fold cross-validation methods for performance
comparison purposes and the results indicated that the decision
tree (C5) is the best predictor with 93.6% accuracy on the
holdout sample.
III. BACKGROUND KNOWLEDGE
Decision tree induction is the learning of decision trees from
class-labeled training tuples[9]. A decision tree is a flow chart
like tree structure, where each internal node(non-leaf node)
denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node(or terminal node)
holds a class label. The topmost node in a tree is the root node.
A. Decision Tree Induction Algorithm
During the late 1970s and early 1980s J.Ross Quinlan a
researcher in machine learning developed a decision tree
algorithm known as ID3(Iterative Dichotomiser)[10]. ID3
adopt a greedy(i.e. nonbacktracking) approach in which
decision trees are constructed in a top-down recursive divide-
and-conquer manner. A basic decision tree algorithm is
summarized below.
Algorithm: Generate decision tree. Generate a decision tree
from the training tuples of data partition D.
Input:
 Data partition, D, which is a set of training tuples and their
associated class lables.
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
63
 Attribute_list, the set of candidate attributes.
 Attribute_selection_method, a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a
splitting_attribute and possibly either a split point or
splitting subset.
Output: A decision tree
Method:
1. Create a node N:
2. If tuples in D are all of the same class, C then
3. Return N as a leaf node labeled with the class C
4. Ifattribute_list is empty then
5. Return N as a leaf node labeled with the majority class in
D;
//majority voting
6. Apply Attribute_selection_method
7. (D, attribute_list) to find the “best” splitting_criterion”;
8. Label node N with splitting_criterion;
9. If splitting_attribute is discrete-valued and multiway splits
allowed then
//not restricted to binary trees
10. Attribute_listattribute_list-splitting attribute;
//remove splitting_attribute
11. For each outcome j of splitting_criterion //partition the
tuples and grow subtrees for each partition
12. Let Dj be the set of data tuples in D satisfying outcome j;
//a partition
13. If Dj is empty then
14. Attach a leaf labeled with the majority class in D to node
N;
15. Else attach the node returned by
Generate_decision_tree(Dj,attribute_list)to node N;
End for
16. Return N;
B. Information Gain
Entropy:
Entropy[11] uses information gain as its attributes selection
measure. The attribute with highest information gain is chosen
as the splitting attribute for node N. The expected information
needed to classify an example in dataset D is given by
Info(D)= -∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖)
𝑚
𝑖=1
Partitioning to produce an exact classification of the
examples was done by
InfoA(D)= ∑
|𝐷𝑗|
|𝐷|
𝑣
𝑗=1 * Info(Dj)
The term |D1|/|D| acts as the weight of the jth
partition Infoi(D)
is the expressed information required to classify an example
from dataset D based on the partitioning by A[12]. The
information gain is defined as the difference between the
original information requirement and the new requirement that
is.
Gain (A) =Info (D) -InfoA(D)
The attribute A with the highest information gain (Gain
(A)) is chosen as the splitting attribute at node N.
C. C4.5 Algorithm
GAIN RATIO
The Gain ratio[13] applies a kind of normalization to
information gain using a split information value defined
analogously with Info(D) as
SplitInfoA(D)=-∑
|𝐷𝑗|
|𝐷|
𝑣
𝑗=1 ∗ log2(
|𝐷𝑗|
|𝐷|
)
Gain ratio differs from Information gain, which measures the
information with respect to classification that is acquired
based on the same partitioning. The gain ratio is defined as
GainRatio(A)=
𝐺𝑎𝑖𝑛(𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
The attribute with the maximum gain ratio is selected as the
splitting attribute.
D. C4.5
The C4.5 is the extension of the ID3 algorithm[15]. It uses the
gain ratio as additional information to select the best attribute
to act as root node which has the maximum gain ratio.The
general algorithm for building the decision tree is:
Algorithm
1. Check for the base cases
2. For each attribute a
A). Find the normalized information ratio from splitting
on a
3. Let a-best be the attribute with the highest
normalized information gain.
4. Create a decision node that splits on a-best.
5. Recursively on the sub lists obtained by splitting on a-
best, and add those nodes as children of node.
E. Classification and Regression Tree(Cart)
Classification and Regression Trees (CART) was
introduced by Breiman et al. It is a statistical technique that
can select from a large number of explanatory variables(x)
those that are most important in determining the response
variable (y) to be explained[16]. The CART steps can be
summarized as follows.
1. Assign all objects to root node.
2. Split each explanatory variable at all its possible split
points (that it is in between all the values observed for that
variable in the considered node.)
3. For each split point, split the parent node into two child
nodes by separating the objects with values lower and
higher than the split point for the considered explanatory
variable.
4. Select the variable and split point with the highest
reduction of impurity.
5. Perform the split of the parent node into two child nodes
according to the selected split point.
6. Repeat steps 2-5, using each node as a new parent node,
until the tree has maximum size.
7. Prune the tree back using cross-validation to select the
optimal sized tree.
For a node with n objects the impurity is then defined as:
Impurity=∑ (𝑦𝑖 − 𝑦
̅)
𝑛
𝑖=1
2
The Gini index of a node with n objects and c possible classes
is defined as:
Gini= 1- ∑ (
𝑛𝑗
𝑛
)
𝑐
𝑗=1
2
IV. DATA COLLECTION AND SYSTEM
FRAMEWORK
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
64
The data are collected from a private college at Ulundurpet in
Villupuram district. Among the 123 students 85 were male and
38 were female candidates.
Figure 1: System FrameWork
A. Experimental Setup and Evaluation
Tanagra is open source software used for data mining. It
supports all the basic type of data formats like *.xls, *.txt etc.
and it is very user friendly[17]. The data set is implemented in
Tanagra.The analysis of the data is done on the basis of gender
i.e. the reason why a male student take leave and the reason
why a female student take leave to the college by applying
Entropy and Information gain
Table 1:Total records in our dataset
MALE FEMALE TOTAL
85 38 123
Info(D)= -∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖)
𝑚
𝑖=1
= (
85
123
log2
85
123
-
38
123
log2
38
123
)
=0.892
Table 2:Total records for the attribute Job
JOB MALE FEMALE TOTAL
Strong
agree
25 1 26
Agree 28 2 30
Neutral 6 8 14
Disagree 14 13 27
Strong
Disagree
12 14 26
TOTAL 85 38 123
The gain for the attribute part time job is
InfoA(D)= ∑
|𝐷𝑗|
|𝐷|
𝑣
𝑗=1 * Info(Dj)
=
26
123
(
25
26
log2
25
26
-
1
26
log2
1
26
) +
30
123
(
28
30
log2
28
30
-
2
30
log2
2
30
)
+
14
123
(
6
14
log2
6
14
-
8
14
log2
8
14
) +
27
123
(
14
27
log2
14
27
-
13
27
log2
13
27
)
+
26
123
(
12
26
log2
12
26
-
14
26
log2
14
26
)
=0.050+0.086+0.112+0.219+0.210
=0.678
Gain(A)=Info(D)- InfoA(D)
=0.892-0.678
=0.214
The ID3 algorithm uses the entropy and information gain
values to select the root node. The information gain for the
attribute job has the highest value in the database and the rest
of the attributes has the least values.
‘
Figure 2:Decision Tree(ID3) in Tanagra
The job attribute is taken as the root node of the tree and the
123 records are split by the branches of the job as strongly
agree 26 records and Agree 30 records and Neutral 14 records
and Disagree 27 records and Strongly Disagree 26 records and
the process is applied for the each table and decision tree is
formed as shown in figure3.C4.5 algorithm uses an extension
to information known as gain ratio[18] and it applies a kind of
normalization to information gain using a Split information
value. The split information is calculated as below,
SplitInfoA(D) = - ∑
|𝐷𝑗|
|𝐷|
𝑣
𝑗=1 ∗ log2(
|𝐷𝑗|
|𝐷|
)
= -
26
123
∗ log2(
26
123
) -
30
123
∗ log2(
30
123
) -
14
123
∗ log2(
14
123
) -
27
123
∗ log2 (
27
123
) -
26
123
∗ log2(
26
123
)
= 04740+0.4965+0.3568+0.4802+0.4740
= 2.2815
The Gain Ratio for the part time job attribute is
GainRatio(A)=
𝐺𝑎𝑖𝑛(𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
=
0.214
2.2815
=0.0938
The results obtained from Tanagra using C4.5 algorithm is
shown in the figure below. Here job attribute is taken as the
rootnodebecauseit has the maximum gain ratio and it is used
as the root node for the above decision tree.
Figure 3:Decision Tree(C4.5) in Tanagra
Gini index is used in the CART algorithm. The Giniindex
value for the job attribute is calculated as below.
Gini= 1- ∑ (
𝑛𝑗
𝑛
)
𝑐
𝑗=1
2
Gini(D) =1- (
85
123
)
2
-(
38
123
)
2
= 0.428
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
65
Figure 4:Decision Tree(CART) in Tanagra
The Gini for the attribute part time job is
=(
26
123
) [1 − (
25
26
)
2
− (
1
26
)
2
] + (
30
123
) [1 − (
28
30
)
2
− (
2
30
)
2
] +
(
14
123
) [1 − (
6
14
)
2
− (
8
14
)
2
] + (
27
123
) [1 − (
14
27
)
2
− (
13
27
)
2
] +
(
26
123
) [1 − (
12
26
)
2
− (
14
26
)
2
] +
=0.0156+0.0302+0.0553+0.1096+0.1050
=0.3157
Gini(job) = 0.428-0.3157 = 0.1123
The attribute that maximizes the reduction impurity (or,
equivalently, has the minimum Gini index) is selected as the
splitting attribute. Here, job attribute has the minimum Gini
index value in the database.
B. Performance Measures
The confusion matrix is a tool for the analysis of a classifier.
Given m classes, a confusion matrix is a table of at least size
m by m. An entry CMi, j in the first m rows and m columns
indicates the number of tuples of classithat were labelled by
the classifier as class j. For a classifier to have good accuracy,
ideally most of the tupleswould be represented along the
diagonal of the confusion matrix, from entry CM1,1 to entry
CMm,m with the rest of the entries being to close to zero.
Table 3: Confusion Matrix
The performance of the ID3, C4.5 and CART decision tree
classification algorithms were evaluated by
Table 4: Confusion Matrix of C4.5 for our dataset
the accuracy % on the student leave data set[19]. The accuracy
of a classifier on a given test set is the percentage of test set
tuples that are correctly classified by the classifier.
Accuracy (M)=
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
For our data set usingC4.5 algorithm we have,
Accuracy (M) =
81+29
81+29+9+4
= 0.8943
A true positive(TP) occurs when a classifier correctly
classified class1. A true negative (TN) occurs when a classifier
correctly classified class2. A false positive (FP) occurs when a
classifier incorrectly classified class1. A false negative (FN)
occurs when a classifier incorrectly classified class2.
Error Rate = 1- Accuracy (M)
For our data set usingC4.5 algorithm we have,
Error Rate = 1 - 0.8943 = 0.1057
The sensitivity measures the proportion of the actual positives
which are correctly identified.
Sensitivity =
𝑇𝑃
𝑇𝑃+𝐹𝑁
For our data set usingC4.5 algorithm we have,
Sensitivity =
81
81+4
= 0.9529
The specificity measures the proportion of negatives which are
correctly identified.
Specificity =
𝑇𝑁
𝑇𝑁+𝐹𝑃
For our data set usingC4.5 algorithm we have,
Specificity =
29
29+9
= 0.7632
Precision is the percentage of true positives (TP) compared to
the total number of cases classified as positive events[13].
Precision =
𝑇𝑃
𝑇𝑃+𝐹𝑃
1-Precision = 1 −
𝑇𝑃
𝑇𝑃+𝐹𝑃
For our data set usingC4.5 algorithm we have,
Precision =
81
81+9
= 0.9000
1-Precision = 1- 0.9000 = 0.1000
The following table values are calculated by using the
confusion matrix obtained from Tanagra tool.
Table 5: Comparison values of the three Decision Tree
Algorithms
Measures ID3 C-RT C4.5
Accuracy 0.7317 0.8211 0.8943
Error Rate 0.2683 0.1789 0.1057
Recall 0.8706 0.9176 0.9529
Specificity 0.4211 0.6053 0.7632
Precision 0.7708 0.8387 0.9000
1-Precision 0.2292 0.1613 0.1000
The table4 shows that the C4.5 algorithm has the highest
accuracy rate (89.43%) and lowest error rate (10.57%) among
the three algorithms. The ID3 and C4.5 has the accuracy rates
73.17% and 82.11% respectively.
C1 C2
C1
C2
True Positives
False Positives
False Negatives
True Negatives
Male Female sum
Male 81 4 85
Female 9 29 38
90 33 123
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
66
The C4.5 algorithm gave the highest specificity value which is
the 0.7632% of negative class tuples correctly classified and
the highest sensitivity value which is the 0.9529% of positive
class tuples correctly classified.
C. Rule Extraction From C4.5 Decision Tree
To extract rules from a decision tree, one rule is created for
each path from the root to a leaf node[20]. Each splitting
criterion along a given path is logically ANDed to form the
rule antecedent (IF part). The leaf node holds the class
prediction forming the rule consequent (THEN part).
IF condition THEN conclusion
Using the above method we extract the rules obtained from the
decision tree using the C4.5 algorithm.
R1:If(job=strongly agree)→(gender=male) [26/123]
Coverage (R1) =26/123=21.13 %
R2:If(job=agree)→(gender=male)[30/123]
Coverage (R2) =30/123=24.39 %
R3:If(job=neutral)˄(staff problem)→(geder=male)[6/123]
Coverage (R3) =6/123=4.87 %
R4:If(job=neutral)˄(staff problem)→(geder=female)[8/123]
Coverage (R4) =8/123=6.5 %
R5:If(job=disagree)˄(occasionaly)→(geder=male) [12/123]
Coverage (R5) =12/123=9.75 %
R6:If(job=disagree)˄(occasionaly)→(geder=female) [15/123]
Coverage (R1) =15/123=12.19 %
R7:If(job=strongly disagree)˄(accident)→(geder=male)
[16/123]
Coverage (R1) =16/123=13 %
R6:If(job= strongly disagree)˄(accident)→(geder=female)
[10/123]
Coverage (R1) =10/123=8.13 %
D. Decision Tree Obtained Using Tanagra Tool
Table 6: Decision Tree obtained from Tanagra tool
S.NO. NAME OF
THE
ALGORITHM
DECISION TREE OBTAINED FROM TANAGRA TOOL
1. ID3  Job in [B]
o examstudy in [A] then Gender = A (85.71 % of 7 examples)
o examstudy in [C] then Gender = A (0.00 % of 0 examples)
o examstudy in [B] then Gender = A (100.00 % of 10 examples)
o examstudy in [D] then Gender = A (100.00 % of 11 examples)
o examstudy in [E] then Gender = A (50.00 % of 2 examples)
 Job in [E]
o Fess in [B] then Gender = B (100.00 % of 2 examples)
o Fess in [E] then Gender = A (54.55 % of 11 examples)
o Fess in [A] then Gender = B (54.55 % of 11 examples)
o Fess in [C] then Gender = A (50.00 % of 2 examples)
o Fess in [D] then Gender = A (0.00 % of 0 examples)
 Job in [A]
o Dept in [BCA] then Gender = A (93.33 % of 15 examples)
o Dept in [BBA] then Gender = A (100.00 % of 10 examples)
o Dept in [BCOM] then Gender = A (100.00 % of 1 examples)
 Job in [C] then Gender = B (57.14 % of 14 examples)
 Job in [D] then Gender = A (51.85 % of 27 examples)
2. C4.5  Job in [B] then Gender = A (93.33 % of 30 examples)
 Job in [E]
o Accident in [E] then Gender = B (85.71 % of 7 examples)
o Accident in [B] then Gender = A (100.00 % of 2 examples)
o Accident in [D] then Gender = B (100.00 % of 1 examples)
o Accident in [A] then Gender = A (64.29 % of 14 examples)
o Accident in [C] then Gender = B (100.00 % of 2 examples)
 Job in [A] then Gender = A (96.15 % of 26 examples)
 Job in [C]
o Stafproblm in [E] then Gender = A (83.33 % of 6 examples)
o Stafproblm in [A] then Gender = B (100.00 % of 1 examples)
o Stafproblm in [C] then Gender = A (0.00 % of 0 examples)
o Stafproblm in [D] then Gender = B (85.71 % of 7 examples)
o Stafproblm in [B] then Gender = A (0.00 % of 0 examples)
 Job in [D]
o Occasionaly in [E] then Gender = A (100.00 % of 6 examples)
o Occasionaly in [C] then Gender = A (100.00 % of 3 examples)
o Occasionaly in [D] then Gender = B (86.67 % of 15 examples)
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
67
o Occasionaly in [A] then Gender = A (100.00 % of 2 examples)
o Occasionaly in [B] then Gender = A (100.00 % of 1 examples)
3. CART  Job in [B,A] then Gender = A (92.86 % of 26 examples)
 Job in [E,C,D]
o misbus in [B,D] then Gender = B (83.33 % of 3 examples)
o misbus in [E,C,A] then Gender = A (63.64 % of 12 examples)
V. RESULTS AND DISCUSSION
From the above computation and decision tree obtained
using tanagra tool the following results were obtained using
C4.5 decision tree algorithm.
 56 male students strongly agree that they take leave for
the reason of going to job.
 6 male students and 8 female who were neutral about
going to job, agree that they take leave on the reason of
problems with the staff.
 12 male and 15 female students who disagree of going to
job, agree that they take leave occasionaly for no reason.
 16 male and 10 female students who strongly agree for the
reason of going to job also agree that they take leave if
they met with an accident.
When the same procedure is carried out to predict the reason
for student absenteism using ID3 and CART decsion tree
algorithms compared to C4.5, it was proved that C4.5
algorithm has relatively high performance.
VI. CONCLUSION
The results obtained from Tanagra clearly show that C4.5
algorithm has relatively high performance when compared
with the other two decision tree algorithms namely ID3 and
CART.Hence to achieve the best results for classifying a small
size dataset,C4.5 algorithm is the best choice among the other
two decision tree algorithms.
REFERENCES
[1] Waheed, T., Bonnell, R. B., Prasher, S. O., &Paulet, E. “Measuring
performance in precision agriculture: CART—A decision tree
approach”, Agricultural water management, Vol. 84, Issue 1, 2006, pp.
173-185.
[2] Naenudorn, E. K. K. A. C. H. A. I., Singthongchai, J. A. T. S. A. D. A.,
Kerdprasop, N. I. T. T. A. Y. A., &Kerdprasop, K. I. T. T. I. S. A. K.
“Classification model induction for student recruiting”. Latest Advances
in Educational Technologies, 2012, pp. 117-122.
[3] Aher, S. B., & Lobo, L. M. R. J.,.“Comparative Study Of Classification
Algorithms”. International Journal of Information Technology, Vol.5,
Issue 2, 2012, pp. 239-243.
[4] Hsieh, A. R., Hsiao, C. L., Chang, S. W., Wang, H. M., &Fann, C. S.,.
“On the use of multifactor dimensionality reduction (MDR) and
classification and regression tree (CART) to identify haplotype–
haplotype interactions in genetic studies”. Genomics,Vol9, Issue2,2012,
pp.77-85.
[5] Delen, D., Walker, G., &Kadam, A.,“Predicting breast cancer
survivability: a comparison of three data mining methods”. Artificial
intelligence in medicine, Vol34, Issue 2, 2005,pp.113-127.
[6] Kumar, S. A., &Vijayalakshmi, M. N.,.” Efficiency of decision trees in
predicting student's academic performance”, In First International
Conference on Computer Science, Engineering and Applications, CS and
IT , Vol. 2, 2011, pp. 335-343.
[7] Han, J., Kamber, M., & Pei, J.,“Data mining, southeast asia edition:
Concepts and techniques”, Morgan kaufmann, 2006.
[8] e Sun, H., “ Research on Student Learning Result System based on Data
Mining”, IJCSNS, Vol. 10, Issue 4, 2010, pp.203.
[9] Sehgal, L., Mohan, N., &Sandhu, P. S.,“Quality prediction of function
based software using decision tree approach”, In International
Conference on Computer Engineering and Multimedia Technologies
(ICCEMT),2012, pp. 43-47.
[10] Hsu, Chung-Chian, Yan-Ping Huang, and Keng-Wei Chang. "Extended
Naive Bayes classifier for mixed data." Expert Systems with
Applications, Vol. 35 Issue 3,2008, pp. 1080-1083.
[11] Questier, F.,"The use of CART and multivariate regression trees for
supervised and unsupervised feature selection." Chemometrics and
Intelligent Laboratory Systems, Vol. 76, Issue 1, 2005, pp. 45-54.
[12] Zighed, Djamel A., "Decision trees with optimal joint partitioning.",
International journal of intelligent systemsVol.20, Issue 7, 2005, pp.
693-718.
[13] Deconinck, Eric, "Classification trees based on infrared spectroscopic
data to discriminate between genuine and counterfeit medicines.",Journal
of pharmaceutical and biomedical analysis, Vol. 57, 2012, pp.68-75.
[14] Sanders, Peter., "Analysis of nearest neighbor load balancing algorithms
for random loads." Parallel Computing, Vol. 25, Issue 8, 1999,pp.1013-
1033.
[15]Garcı, Salvador,"Evolutionary-based selection of generalized instances
for imbalanced classification." Knowledge-Based Systems, Vol. 25,
Issue 1, 2012, pp. 3-12.
[16] Mohamed, Marghny H. "Rules extraction from constructively trained
neural networks based on genetic algorithms.",Neurocomputing, Vol.74,
Issue 17, 2011, pp. 3180-3192.
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 02 December 2015,Pages No.62- 68
ISSN: 2278-2400
68
[17] Wang, L. M., Li, X. L., Cao, C. H., & Yuan, S. M.,“Combining decision
tree and Naive Bayes for classification”,Knowledge-Based
Systems,Vol.19, Issue 7, 2006, pp.511-515.
[18] Talaie, A., Esmaili, N., Taguchi, T., Romagnoli, J. A., &Talaie, F., “
Application of an engineering algorithm with software code (C4. 5) for
specific ion detection in the chemical industries”, Advances in
Engineering Software, Vol. 30, Issue 1, 1999, pp.13-19.
[19] Lu, S. H., Chiang, D. A., Keh, H. C., & Huang, H. H., “Chinese text
classification by the Naïve Bayes Classifier and the associative classifier
with multiple confidence threshold values”, Knowledge-based
systems,Vol .23, Issue 6, 2010, pp.598-604.
[20] Vlahou, Antonia, "Diagnosis of ovarian cancer using decision tree
classification of mass spectral data", BioMed Research
International,Vol.5, 2003, pp. 308-314.

Mais conteúdo relacionado

Semelhante a Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for Student�s Absenteeism at the Undergraduate Level in a College for an Academic Year

Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
Editor IJCATR
 
A Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification TechniquesA Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification Techniques
ijsrd.com
 

Semelhante a Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for Student�s Absenteeism at the Undergraduate Level in a College for an Academic Year (20)

A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
 
Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Predicting Students Performance using K-Median Clustering
Predicting Students Performance using  K-Median ClusteringPredicting Students Performance using  K-Median Clustering
Predicting Students Performance using K-Median Clustering
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
 
A Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification TechniquesA Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification Techniques
 
winbis1005
winbis1005winbis1005
winbis1005
 
L016136369
L016136369L016136369
L016136369
 
Scalable decision tree based on fuzzy partitioning and an incremental approach
Scalable decision tree based on fuzzy partitioning and an  incremental approachScalable decision tree based on fuzzy partitioning and an  incremental approach
Scalable decision tree based on fuzzy partitioning and an incremental approach
 
Deployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionDeployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement prediction
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Hx3115011506
Hx3115011506Hx3115011506
Hx3115011506
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
A046010107
A046010107A046010107
A046010107
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 

Mais de ijcnes

Holistic Forecasting of Onset of Diabetes through Data Mining Techniques
Holistic Forecasting of Onset of Diabetes through Data Mining TechniquesHolistic Forecasting of Onset of Diabetes through Data Mining Techniques
Holistic Forecasting of Onset of Diabetes through Data Mining Techniques
ijcnes
 
Secured Seamless Wi-Fi Enhancement in Dynamic Vehicles
Secured Seamless Wi-Fi Enhancement in Dynamic VehiclesSecured Seamless Wi-Fi Enhancement in Dynamic Vehicles
Secured Seamless Wi-Fi Enhancement in Dynamic Vehicles
ijcnes
 

Mais de ijcnes (20)

A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...
 
Economic Growth of Information Technology (It) Industry on the Indian Economy
Economic Growth of Information Technology (It) Industry on the Indian EconomyEconomic Growth of Information Technology (It) Industry on the Indian Economy
Economic Growth of Information Technology (It) Industry on the Indian Economy
 
An analysis of Mobile Learning Implementation in Shinas College of Technology...
An analysis of Mobile Learning Implementation in Shinas College of Technology...An analysis of Mobile Learning Implementation in Shinas College of Technology...
An analysis of Mobile Learning Implementation in Shinas College of Technology...
 
A Survey on the Security Issues of Software Defined Networking Tool in Cloud ...
A Survey on the Security Issues of Software Defined Networking Tool in Cloud ...A Survey on the Security Issues of Software Defined Networking Tool in Cloud ...
A Survey on the Security Issues of Software Defined Networking Tool in Cloud ...
 
Challenges of E-government in Oman
Challenges of E-government in OmanChallenges of E-government in Oman
Challenges of E-government in Oman
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage System
 
Holistic Forecasting of Onset of Diabetes through Data Mining Techniques
Holistic Forecasting of Onset of Diabetes through Data Mining TechniquesHolistic Forecasting of Onset of Diabetes through Data Mining Techniques
Holistic Forecasting of Onset of Diabetes through Data Mining Techniques
 
A Survey on Disease Prediction from Retinal Colour Fundus Images using Image ...
A Survey on Disease Prediction from Retinal Colour Fundus Images using Image ...A Survey on Disease Prediction from Retinal Colour Fundus Images using Image ...
A Survey on Disease Prediction from Retinal Colour Fundus Images using Image ...
 
Feature Extraction in Content based Image Retrieval
Feature Extraction in Content based Image RetrievalFeature Extraction in Content based Image Retrieval
Feature Extraction in Content based Image Retrieval
 
Challenges and Mechanisms for Securing Data in Mobile Cloud Computing
Challenges and Mechanisms for Securing Data in Mobile Cloud ComputingChallenges and Mechanisms for Securing Data in Mobile Cloud Computing
Challenges and Mechanisms for Securing Data in Mobile Cloud Computing
 
Detection of Node Activity and Selfish & Malicious Behavioral Patterns using ...
Detection of Node Activity and Selfish & Malicious Behavioral Patterns using ...Detection of Node Activity and Selfish & Malicious Behavioral Patterns using ...
Detection of Node Activity and Selfish & Malicious Behavioral Patterns using ...
 
Optimal Channel and Relay Assignment in Ofdmbased Multi-Relay Multi-Pair Two-...
Optimal Channel and Relay Assignment in Ofdmbased Multi-Relay Multi-Pair Two-...Optimal Channel and Relay Assignment in Ofdmbased Multi-Relay Multi-Pair Two-...
Optimal Channel and Relay Assignment in Ofdmbased Multi-Relay Multi-Pair Two-...
 
An Effective and Scalable AODV for Wireless Ad hoc Sensor Networks
An Effective and Scalable AODV for Wireless Ad hoc Sensor NetworksAn Effective and Scalable AODV for Wireless Ad hoc Sensor Networks
An Effective and Scalable AODV for Wireless Ad hoc Sensor Networks
 
Secured Seamless Wi-Fi Enhancement in Dynamic Vehicles
Secured Seamless Wi-Fi Enhancement in Dynamic VehiclesSecured Seamless Wi-Fi Enhancement in Dynamic Vehicles
Secured Seamless Wi-Fi Enhancement in Dynamic Vehicles
 
Virtual Position based Olsr Protocol for Wireless Sensor Networks
Virtual Position based Olsr Protocol for Wireless Sensor NetworksVirtual Position based Olsr Protocol for Wireless Sensor Networks
Virtual Position based Olsr Protocol for Wireless Sensor Networks
 
Mitigation and control of Defeating Jammers using P-1 Factorization
Mitigation and control of Defeating Jammers using P-1 FactorizationMitigation and control of Defeating Jammers using P-1 Factorization
Mitigation and control of Defeating Jammers using P-1 Factorization
 
An analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining TechniquesAn analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining Techniques
 
A Study on Code Smell Detection with Refactoring Tools in Object Oriented Lan...
A Study on Code Smell Detection with Refactoring Tools in Object Oriented Lan...A Study on Code Smell Detection with Refactoring Tools in Object Oriented Lan...
A Study on Code Smell Detection with Refactoring Tools in Object Oriented Lan...
 
Priority Based Multi Sen Car Technique in WSN
Priority Based Multi Sen Car Technique in WSNPriority Based Multi Sen Car Technique in WSN
Priority Based Multi Sen Car Technique in WSN
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based System
 

Último

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Último (20)

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 

Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for Student�s Absenteeism at the Undergraduate Level in a College for an Academic Year

  • 1. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 62 Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for Student’s Absenteeism at the Undergraduate Level in a College for an Academic Year G. Suresh1 , K. Arunmozhi Arasan2 , S. Muthukumaran3 1 Assistant Professor, PG and Research Dept. of Computer Applications, St. Joseph’s College of Arts and Science (Autonomous), Cuddalore. 2 HOD and Assistant Professor, Department of Computer Science and Applications, Siga College of Management and Computer Science, Villupuram. 3 Assistant Professor, Department of Computer Science and Applications, Siga College of Management and Computer Science, Villupuram. Email:sureshg2233@yahoo.co.in, arunlucks@yahoo.co.in, muthulecturer@rediffmail.com Abstract- Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semi- rural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism. Keywords: Data Mining, Decision Tree Induction, ID3, C4.5 and CARTalgorithm. I. INTRODUCTION Currently many educational institutions especially small- medium education institutions are facing problems with the lack of attendance among the students[1]. The students who possess less than 80% percentage of attendance will not be permitted by the concerned universities to appear for the semester exams. In the recent years all educational institutions are facing this lack of attendance problem[2]. Hence,this research aims to find suitable decision tree algorithm in predicting the reason for student lack of attendance. II. LITERATURE SURVEY Ekkachai Naenudorn and Jatsada Singthongchaip resented[3,4] their study on student recruiting on higher education institutions. The objectives of this study are to test the validity of the model derived fromdecision rules and to find the right algorithm for data classification task. From comparison of 4 algorithms; J48,Id3, Naïve Bayes and OneR, with the goal to predict the features of students who are likely to undergo thestudent admission process.Sunita B. Aher and Lobo L.M.R.J[5]presented their paper in that they compare the five classification algorithm to choose the best classification algorithm for CourseRecommendation system. These five classification algorithms are ADTree, Simple Cart, J48, ZeroR& Naive Bays Classification Algorithm. They compare these six algorithms using open source data mining tool Weka& present the result[6].DursunDelen, Glenn Walker and AmitKadam[7]presented a paper on predicting the breast cancer survivability; they used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used logistic regression method to develop the prediction models using a large data set[8]. They also used 10-fold cross-validation methods for performance comparison purposes and the results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample. III. BACKGROUND KNOWLEDGE Decision tree induction is the learning of decision trees from class-labeled training tuples[9]. A decision tree is a flow chart like tree structure, where each internal node(non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node(or terminal node) holds a class label. The topmost node in a tree is the root node. A. Decision Tree Induction Algorithm During the late 1970s and early 1980s J.Ross Quinlan a researcher in machine learning developed a decision tree algorithm known as ID3(Iterative Dichotomiser)[10]. ID3 adopt a greedy(i.e. nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide- and-conquer manner. A basic decision tree algorithm is summarized below. Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition D. Input:  Data partition, D, which is a set of training tuples and their associated class lables.
  • 2. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 63  Attribute_list, the set of candidate attributes.  Attribute_selection_method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting_attribute and possibly either a split point or splitting subset. Output: A decision tree Method: 1. Create a node N: 2. If tuples in D are all of the same class, C then 3. Return N as a leaf node labeled with the class C 4. Ifattribute_list is empty then 5. Return N as a leaf node labeled with the majority class in D; //majority voting 6. Apply Attribute_selection_method 7. (D, attribute_list) to find the “best” splitting_criterion”; 8. Label node N with splitting_criterion; 9. If splitting_attribute is discrete-valued and multiway splits allowed then //not restricted to binary trees 10. Attribute_listattribute_list-splitting attribute; //remove splitting_attribute 11. For each outcome j of splitting_criterion //partition the tuples and grow subtrees for each partition 12. Let Dj be the set of data tuples in D satisfying outcome j; //a partition 13. If Dj is empty then 14. Attach a leaf labeled with the majority class in D to node N; 15. Else attach the node returned by Generate_decision_tree(Dj,attribute_list)to node N; End for 16. Return N; B. Information Gain Entropy: Entropy[11] uses information gain as its attributes selection measure. The attribute with highest information gain is chosen as the splitting attribute for node N. The expected information needed to classify an example in dataset D is given by Info(D)= -∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖) 𝑚 𝑖=1 Partitioning to produce an exact classification of the examples was done by InfoA(D)= ∑ |𝐷𝑗| |𝐷| 𝑣 𝑗=1 * Info(Dj) The term |D1|/|D| acts as the weight of the jth partition Infoi(D) is the expressed information required to classify an example from dataset D based on the partitioning by A[12]. The information gain is defined as the difference between the original information requirement and the new requirement that is. Gain (A) =Info (D) -InfoA(D) The attribute A with the highest information gain (Gain (A)) is chosen as the splitting attribute at node N. C. C4.5 Algorithm GAIN RATIO The Gain ratio[13] applies a kind of normalization to information gain using a split information value defined analogously with Info(D) as SplitInfoA(D)=-∑ |𝐷𝑗| |𝐷| 𝑣 𝑗=1 ∗ log2( |𝐷𝑗| |𝐷| ) Gain ratio differs from Information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as GainRatio(A)= 𝐺𝑎𝑖𝑛(𝐴) 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴) The attribute with the maximum gain ratio is selected as the splitting attribute. D. C4.5 The C4.5 is the extension of the ID3 algorithm[15]. It uses the gain ratio as additional information to select the best attribute to act as root node which has the maximum gain ratio.The general algorithm for building the decision tree is: Algorithm 1. Check for the base cases 2. For each attribute a A). Find the normalized information ratio from splitting on a 3. Let a-best be the attribute with the highest normalized information gain. 4. Create a decision node that splits on a-best. 5. Recursively on the sub lists obtained by splitting on a- best, and add those nodes as children of node. E. Classification and Regression Tree(Cart) Classification and Regression Trees (CART) was introduced by Breiman et al. It is a statistical technique that can select from a large number of explanatory variables(x) those that are most important in determining the response variable (y) to be explained[16]. The CART steps can be summarized as follows. 1. Assign all objects to root node. 2. Split each explanatory variable at all its possible split points (that it is in between all the values observed for that variable in the considered node.) 3. For each split point, split the parent node into two child nodes by separating the objects with values lower and higher than the split point for the considered explanatory variable. 4. Select the variable and split point with the highest reduction of impurity. 5. Perform the split of the parent node into two child nodes according to the selected split point. 6. Repeat steps 2-5, using each node as a new parent node, until the tree has maximum size. 7. Prune the tree back using cross-validation to select the optimal sized tree. For a node with n objects the impurity is then defined as: Impurity=∑ (𝑦𝑖 − 𝑦 ̅) 𝑛 𝑖=1 2 The Gini index of a node with n objects and c possible classes is defined as: Gini= 1- ∑ ( 𝑛𝑗 𝑛 ) 𝑐 𝑗=1 2 IV. DATA COLLECTION AND SYSTEM FRAMEWORK
  • 3. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 64 The data are collected from a private college at Ulundurpet in Villupuram district. Among the 123 students 85 were male and 38 were female candidates. Figure 1: System FrameWork A. Experimental Setup and Evaluation Tanagra is open source software used for data mining. It supports all the basic type of data formats like *.xls, *.txt etc. and it is very user friendly[17]. The data set is implemented in Tanagra.The analysis of the data is done on the basis of gender i.e. the reason why a male student take leave and the reason why a female student take leave to the college by applying Entropy and Information gain Table 1:Total records in our dataset MALE FEMALE TOTAL 85 38 123 Info(D)= -∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖) 𝑚 𝑖=1 = ( 85 123 log2 85 123 - 38 123 log2 38 123 ) =0.892 Table 2:Total records for the attribute Job JOB MALE FEMALE TOTAL Strong agree 25 1 26 Agree 28 2 30 Neutral 6 8 14 Disagree 14 13 27 Strong Disagree 12 14 26 TOTAL 85 38 123 The gain for the attribute part time job is InfoA(D)= ∑ |𝐷𝑗| |𝐷| 𝑣 𝑗=1 * Info(Dj) = 26 123 ( 25 26 log2 25 26 - 1 26 log2 1 26 ) + 30 123 ( 28 30 log2 28 30 - 2 30 log2 2 30 ) + 14 123 ( 6 14 log2 6 14 - 8 14 log2 8 14 ) + 27 123 ( 14 27 log2 14 27 - 13 27 log2 13 27 ) + 26 123 ( 12 26 log2 12 26 - 14 26 log2 14 26 ) =0.050+0.086+0.112+0.219+0.210 =0.678 Gain(A)=Info(D)- InfoA(D) =0.892-0.678 =0.214 The ID3 algorithm uses the entropy and information gain values to select the root node. The information gain for the attribute job has the highest value in the database and the rest of the attributes has the least values. ‘ Figure 2:Decision Tree(ID3) in Tanagra The job attribute is taken as the root node of the tree and the 123 records are split by the branches of the job as strongly agree 26 records and Agree 30 records and Neutral 14 records and Disagree 27 records and Strongly Disagree 26 records and the process is applied for the each table and decision tree is formed as shown in figure3.C4.5 algorithm uses an extension to information known as gain ratio[18] and it applies a kind of normalization to information gain using a Split information value. The split information is calculated as below, SplitInfoA(D) = - ∑ |𝐷𝑗| |𝐷| 𝑣 𝑗=1 ∗ log2( |𝐷𝑗| |𝐷| ) = - 26 123 ∗ log2( 26 123 ) - 30 123 ∗ log2( 30 123 ) - 14 123 ∗ log2( 14 123 ) - 27 123 ∗ log2 ( 27 123 ) - 26 123 ∗ log2( 26 123 ) = 04740+0.4965+0.3568+0.4802+0.4740 = 2.2815 The Gain Ratio for the part time job attribute is GainRatio(A)= 𝐺𝑎𝑖𝑛(𝐴) 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴) = 0.214 2.2815 =0.0938 The results obtained from Tanagra using C4.5 algorithm is shown in the figure below. Here job attribute is taken as the rootnodebecauseit has the maximum gain ratio and it is used as the root node for the above decision tree. Figure 3:Decision Tree(C4.5) in Tanagra Gini index is used in the CART algorithm. The Giniindex value for the job attribute is calculated as below. Gini= 1- ∑ ( 𝑛𝑗 𝑛 ) 𝑐 𝑗=1 2 Gini(D) =1- ( 85 123 ) 2 -( 38 123 ) 2 = 0.428
  • 4. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 65 Figure 4:Decision Tree(CART) in Tanagra The Gini for the attribute part time job is =( 26 123 ) [1 − ( 25 26 ) 2 − ( 1 26 ) 2 ] + ( 30 123 ) [1 − ( 28 30 ) 2 − ( 2 30 ) 2 ] + ( 14 123 ) [1 − ( 6 14 ) 2 − ( 8 14 ) 2 ] + ( 27 123 ) [1 − ( 14 27 ) 2 − ( 13 27 ) 2 ] + ( 26 123 ) [1 − ( 12 26 ) 2 − ( 14 26 ) 2 ] + =0.0156+0.0302+0.0553+0.1096+0.1050 =0.3157 Gini(job) = 0.428-0.3157 = 0.1123 The attribute that maximizes the reduction impurity (or, equivalently, has the minimum Gini index) is selected as the splitting attribute. Here, job attribute has the minimum Gini index value in the database. B. Performance Measures The confusion matrix is a tool for the analysis of a classifier. Given m classes, a confusion matrix is a table of at least size m by m. An entry CMi, j in the first m rows and m columns indicates the number of tuples of classithat were labelled by the classifier as class j. For a classifier to have good accuracy, ideally most of the tupleswould be represented along the diagonal of the confusion matrix, from entry CM1,1 to entry CMm,m with the rest of the entries being to close to zero. Table 3: Confusion Matrix The performance of the ID3, C4.5 and CART decision tree classification algorithms were evaluated by Table 4: Confusion Matrix of C4.5 for our dataset the accuracy % on the student leave data set[19]. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. Accuracy (M)= 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 For our data set usingC4.5 algorithm we have, Accuracy (M) = 81+29 81+29+9+4 = 0.8943 A true positive(TP) occurs when a classifier correctly classified class1. A true negative (TN) occurs when a classifier correctly classified class2. A false positive (FP) occurs when a classifier incorrectly classified class1. A false negative (FN) occurs when a classifier incorrectly classified class2. Error Rate = 1- Accuracy (M) For our data set usingC4.5 algorithm we have, Error Rate = 1 - 0.8943 = 0.1057 The sensitivity measures the proportion of the actual positives which are correctly identified. Sensitivity = 𝑇𝑃 𝑇𝑃+𝐹𝑁 For our data set usingC4.5 algorithm we have, Sensitivity = 81 81+4 = 0.9529 The specificity measures the proportion of negatives which are correctly identified. Specificity = 𝑇𝑁 𝑇𝑁+𝐹𝑃 For our data set usingC4.5 algorithm we have, Specificity = 29 29+9 = 0.7632 Precision is the percentage of true positives (TP) compared to the total number of cases classified as positive events[13]. Precision = 𝑇𝑃 𝑇𝑃+𝐹𝑃 1-Precision = 1 − 𝑇𝑃 𝑇𝑃+𝐹𝑃 For our data set usingC4.5 algorithm we have, Precision = 81 81+9 = 0.9000 1-Precision = 1- 0.9000 = 0.1000 The following table values are calculated by using the confusion matrix obtained from Tanagra tool. Table 5: Comparison values of the three Decision Tree Algorithms Measures ID3 C-RT C4.5 Accuracy 0.7317 0.8211 0.8943 Error Rate 0.2683 0.1789 0.1057 Recall 0.8706 0.9176 0.9529 Specificity 0.4211 0.6053 0.7632 Precision 0.7708 0.8387 0.9000 1-Precision 0.2292 0.1613 0.1000 The table4 shows that the C4.5 algorithm has the highest accuracy rate (89.43%) and lowest error rate (10.57%) among the three algorithms. The ID3 and C4.5 has the accuracy rates 73.17% and 82.11% respectively. C1 C2 C1 C2 True Positives False Positives False Negatives True Negatives Male Female sum Male 81 4 85 Female 9 29 38 90 33 123
  • 5. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 66 The C4.5 algorithm gave the highest specificity value which is the 0.7632% of negative class tuples correctly classified and the highest sensitivity value which is the 0.9529% of positive class tuples correctly classified. C. Rule Extraction From C4.5 Decision Tree To extract rules from a decision tree, one rule is created for each path from the root to a leaf node[20]. Each splitting criterion along a given path is logically ANDed to form the rule antecedent (IF part). The leaf node holds the class prediction forming the rule consequent (THEN part). IF condition THEN conclusion Using the above method we extract the rules obtained from the decision tree using the C4.5 algorithm. R1:If(job=strongly agree)→(gender=male) [26/123] Coverage (R1) =26/123=21.13 % R2:If(job=agree)→(gender=male)[30/123] Coverage (R2) =30/123=24.39 % R3:If(job=neutral)˄(staff problem)→(geder=male)[6/123] Coverage (R3) =6/123=4.87 % R4:If(job=neutral)˄(staff problem)→(geder=female)[8/123] Coverage (R4) =8/123=6.5 % R5:If(job=disagree)˄(occasionaly)→(geder=male) [12/123] Coverage (R5) =12/123=9.75 % R6:If(job=disagree)˄(occasionaly)→(geder=female) [15/123] Coverage (R1) =15/123=12.19 % R7:If(job=strongly disagree)˄(accident)→(geder=male) [16/123] Coverage (R1) =16/123=13 % R6:If(job= strongly disagree)˄(accident)→(geder=female) [10/123] Coverage (R1) =10/123=8.13 % D. Decision Tree Obtained Using Tanagra Tool Table 6: Decision Tree obtained from Tanagra tool S.NO. NAME OF THE ALGORITHM DECISION TREE OBTAINED FROM TANAGRA TOOL 1. ID3  Job in [B] o examstudy in [A] then Gender = A (85.71 % of 7 examples) o examstudy in [C] then Gender = A (0.00 % of 0 examples) o examstudy in [B] then Gender = A (100.00 % of 10 examples) o examstudy in [D] then Gender = A (100.00 % of 11 examples) o examstudy in [E] then Gender = A (50.00 % of 2 examples)  Job in [E] o Fess in [B] then Gender = B (100.00 % of 2 examples) o Fess in [E] then Gender = A (54.55 % of 11 examples) o Fess in [A] then Gender = B (54.55 % of 11 examples) o Fess in [C] then Gender = A (50.00 % of 2 examples) o Fess in [D] then Gender = A (0.00 % of 0 examples)  Job in [A] o Dept in [BCA] then Gender = A (93.33 % of 15 examples) o Dept in [BBA] then Gender = A (100.00 % of 10 examples) o Dept in [BCOM] then Gender = A (100.00 % of 1 examples)  Job in [C] then Gender = B (57.14 % of 14 examples)  Job in [D] then Gender = A (51.85 % of 27 examples) 2. C4.5  Job in [B] then Gender = A (93.33 % of 30 examples)  Job in [E] o Accident in [E] then Gender = B (85.71 % of 7 examples) o Accident in [B] then Gender = A (100.00 % of 2 examples) o Accident in [D] then Gender = B (100.00 % of 1 examples) o Accident in [A] then Gender = A (64.29 % of 14 examples) o Accident in [C] then Gender = B (100.00 % of 2 examples)  Job in [A] then Gender = A (96.15 % of 26 examples)  Job in [C] o Stafproblm in [E] then Gender = A (83.33 % of 6 examples) o Stafproblm in [A] then Gender = B (100.00 % of 1 examples) o Stafproblm in [C] then Gender = A (0.00 % of 0 examples) o Stafproblm in [D] then Gender = B (85.71 % of 7 examples) o Stafproblm in [B] then Gender = A (0.00 % of 0 examples)  Job in [D] o Occasionaly in [E] then Gender = A (100.00 % of 6 examples) o Occasionaly in [C] then Gender = A (100.00 % of 3 examples) o Occasionaly in [D] then Gender = B (86.67 % of 15 examples)
  • 6. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 67 o Occasionaly in [A] then Gender = A (100.00 % of 2 examples) o Occasionaly in [B] then Gender = A (100.00 % of 1 examples) 3. CART  Job in [B,A] then Gender = A (92.86 % of 26 examples)  Job in [E,C,D] o misbus in [B,D] then Gender = B (83.33 % of 3 examples) o misbus in [E,C,A] then Gender = A (63.64 % of 12 examples) V. RESULTS AND DISCUSSION From the above computation and decision tree obtained using tanagra tool the following results were obtained using C4.5 decision tree algorithm.  56 male students strongly agree that they take leave for the reason of going to job.  6 male students and 8 female who were neutral about going to job, agree that they take leave on the reason of problems with the staff.  12 male and 15 female students who disagree of going to job, agree that they take leave occasionaly for no reason.  16 male and 10 female students who strongly agree for the reason of going to job also agree that they take leave if they met with an accident. When the same procedure is carried out to predict the reason for student absenteism using ID3 and CART decsion tree algorithms compared to C4.5, it was proved that C4.5 algorithm has relatively high performance. VI. CONCLUSION The results obtained from Tanagra clearly show that C4.5 algorithm has relatively high performance when compared with the other two decision tree algorithms namely ID3 and CART.Hence to achieve the best results for classifying a small size dataset,C4.5 algorithm is the best choice among the other two decision tree algorithms. REFERENCES [1] Waheed, T., Bonnell, R. B., Prasher, S. O., &Paulet, E. “Measuring performance in precision agriculture: CART—A decision tree approach”, Agricultural water management, Vol. 84, Issue 1, 2006, pp. 173-185. [2] Naenudorn, E. K. K. A. C. H. A. I., Singthongchai, J. A. T. S. A. D. A., Kerdprasop, N. I. T. T. A. Y. A., &Kerdprasop, K. I. T. T. I. S. A. K. “Classification model induction for student recruiting”. Latest Advances in Educational Technologies, 2012, pp. 117-122. [3] Aher, S. B., & Lobo, L. M. R. J.,.“Comparative Study Of Classification Algorithms”. International Journal of Information Technology, Vol.5, Issue 2, 2012, pp. 239-243. [4] Hsieh, A. R., Hsiao, C. L., Chang, S. W., Wang, H. M., &Fann, C. S.,. “On the use of multifactor dimensionality reduction (MDR) and classification and regression tree (CART) to identify haplotype– haplotype interactions in genetic studies”. Genomics,Vol9, Issue2,2012, pp.77-85. [5] Delen, D., Walker, G., &Kadam, A.,“Predicting breast cancer survivability: a comparison of three data mining methods”. Artificial intelligence in medicine, Vol34, Issue 2, 2005,pp.113-127. [6] Kumar, S. A., &Vijayalakshmi, M. N.,.” Efficiency of decision trees in predicting student's academic performance”, In First International Conference on Computer Science, Engineering and Applications, CS and IT , Vol. 2, 2011, pp. 335-343. [7] Han, J., Kamber, M., & Pei, J.,“Data mining, southeast asia edition: Concepts and techniques”, Morgan kaufmann, 2006. [8] e Sun, H., “ Research on Student Learning Result System based on Data Mining”, IJCSNS, Vol. 10, Issue 4, 2010, pp.203. [9] Sehgal, L., Mohan, N., &Sandhu, P. S.,“Quality prediction of function based software using decision tree approach”, In International Conference on Computer Engineering and Multimedia Technologies (ICCEMT),2012, pp. 43-47. [10] Hsu, Chung-Chian, Yan-Ping Huang, and Keng-Wei Chang. "Extended Naive Bayes classifier for mixed data." Expert Systems with Applications, Vol. 35 Issue 3,2008, pp. 1080-1083. [11] Questier, F.,"The use of CART and multivariate regression trees for supervised and unsupervised feature selection." Chemometrics and Intelligent Laboratory Systems, Vol. 76, Issue 1, 2005, pp. 45-54. [12] Zighed, Djamel A., "Decision trees with optimal joint partitioning.", International journal of intelligent systemsVol.20, Issue 7, 2005, pp. 693-718. [13] Deconinck, Eric, "Classification trees based on infrared spectroscopic data to discriminate between genuine and counterfeit medicines.",Journal of pharmaceutical and biomedical analysis, Vol. 57, 2012, pp.68-75. [14] Sanders, Peter., "Analysis of nearest neighbor load balancing algorithms for random loads." Parallel Computing, Vol. 25, Issue 8, 1999,pp.1013- 1033. [15]Garcı, Salvador,"Evolutionary-based selection of generalized instances for imbalanced classification." Knowledge-Based Systems, Vol. 25, Issue 1, 2012, pp. 3-12. [16] Mohamed, Marghny H. "Rules extraction from constructively trained neural networks based on genetic algorithms.",Neurocomputing, Vol.74, Issue 17, 2011, pp. 3180-3192.
  • 7. Integrated Intelligent Research(IIR) International Journal of Business Intelligent Volume: 04 Issue: 02 December 2015,Pages No.62- 68 ISSN: 2278-2400 68 [17] Wang, L. M., Li, X. L., Cao, C. H., & Yuan, S. M.,“Combining decision tree and Naive Bayes for classification”,Knowledge-Based Systems,Vol.19, Issue 7, 2006, pp.511-515. [18] Talaie, A., Esmaili, N., Taguchi, T., Romagnoli, J. A., &Talaie, F., “ Application of an engineering algorithm with software code (C4. 5) for specific ion detection in the chemical industries”, Advances in Engineering Software, Vol. 30, Issue 1, 1999, pp.13-19. [19] Lu, S. H., Chiang, D. A., Keh, H. C., & Huang, H. H., “Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values”, Knowledge-based systems,Vol .23, Issue 6, 2010, pp.598-604. [20] Vlahou, Antonia, "Diagnosis of ovarian cancer using decision tree classification of mass spectral data", BioMed Research International,Vol.5, 2003, pp. 308-314.