10BM60080 - Weka Term Paper

Weka Term Paper
Submission
ITB Assignment

Sathiyaseelan M
10BM60080

Table of Contents
1. Classification via Decision Trees ....................................................................................................... 3
1.1 Car Evaluation Database........................................................................................................... 3
1.2 J48 pruned classification tree ................................................................................................... 4
1.3 Summary of Results................................................................................................................. 6
1.4 Simplified Decision Tree ........................................................................................................... 7
1.5 Test Set .................................................................................................................................... 7
2 K-Means Clustering .......................................................................................................................... 8
2.1 Bank Database ......................................................................................................................... 8
2.2 Summary of Results.................................................................................................................. 9
2.3 Cluster Explanation ................................................................................................................ 10

1. Classification via Decision Trees

The Car Evaluation Database contains data pertaining to six attributes buying price, maintenance price,
no of persons, no of doors, safety and size of the luggage boot. Certain attributes related to structural
information are removed for simplification of analysis. Because of known underlying concept structure,
this database may be particularly useful for testing constructive induction and structure discovery
methods.

1.1 Car Evaluation Database
This model evaluates cars according to the following concept structure.

PRICE
 buying buying price
 maint price of the maintenance
TECHNICAL CHARACTERISTICS
……. (Removed for simplification of analysis)
COMFORT
 doors number of doors
 persons capacity in terms of persons to carry
 lug_boot the size of luggage boot
SAFETY
 safety estimated safety of the car

Number of Instances: 1728

Attribute Values

buying  v-high, high, med, low
maint  v-high, high, med, low
1. doors  2, 4, 5-more
persons  2, 4, more
lug_boot  small, med, big
safety  low, med, high

class N N[%]
---------------------------------------
unacc 1210 (70.023 %)  Unacceptable
acc 384 (22.222 %)  Acceptable
good 69 ( 3.993 %)  Good
v-good 65 ( 3.762 %)  Very Good

J48 (implementation of C4.5 algorithm) is used for classification.
Test Mode: 10-fold cross-validation & min no. of objects required is 2.

| | | maint = med: acc (12.0/1.0)
| | | maint = low: acc (12.0/1.0)
| | buying = high
| | | maint = high: acc (12.0/1.0)
| | | maint = med: acc (12.0/1.0)
| | | maint = low: acc (12.0/1.0)
| | buying = med
| | | maint = vhigh: acc (12.0/1.0)
| | | maint = high: acc (12.0/1.0)
| | | maint = med
| | | | lug_boot = small: acc (4.0/1.0)
| | | | lug_boot = med: vgood (4.0/1.0)
| | | maint = low
| | | | lug_boot = small: good (4.0/1.0)
| | buying = low
| | | maint = vhigh: acc (12.0/1.0)
| | | maint = high
| | | | lug_boot = small: acc (4.0/1.0)
| | | maint = med
| | | maint = low

Number of Leaves : 131
Size of the tree : 182

1.3 Summary of Results

Correctly Classified Instances 1596 92.3611 %
Incorrectly Classified Instances 132 7.6389 %
Kappa statistic 0.8343
Mean absolute error 0.0421
Root mean squared error 0.1718
Relative absolute error 18.3833 %
Root relative squared error 50.8176 %
Coverage of cases (0.95 level) 97.2222 %
Mean rel. region size (0.95 level) 29.1088 %
Total Number of Instances 1728

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.962 0.064 0.972 0.962 0.967 0.983 unacc
0.867 0.047 0.841 0.867 0.854 0.962 acc
0.609 0.011 0.689 0.609 0.646 0.918 good
0.877 0.01 0.77 0.877 0.82 0.995 vgood
Weighted Avg. 0.924 0.056 0.924 0.924 0.924 0.976

=== Confusion Matrix ===

Unacceptable (a) Acceptable(b) Good(c) Very Good(d)
Unacceptable (a) 1164 43 3 0
Acceptable(b) 33 333 11 7
Good(c) 0 17 42 10
Very Good(d) 0 3 5 57

Diagonal elements correctly classified and the rest are not.

1.4 Simplified Decision Tree

When repeated the same with no of folds=10 and min no. of objects =25 [To Simplify
the Classification tree], it produced an accuracy of 81.3079%. Reduction in accuracy is due to the
relaxation on the minimum number of objects. Below is the simplified version of the classification tree.

1.5 Test Set

When applied on the test set, it correctly classified 94.87% of the instances.

2 K-Means Clustering

This example illustrates the use of k-means with Weka.

2.1 Bank Database

The sample data set used for this example is of bank maintaining their customer’s age, gender, region
type, income, marital status, no of children, owning a car and mortgage. The Bank wants to find the

savings pattern of their customer’s of the age group It has 600 instances and 8 attributes with
corresponding values listed below.

 @attribute age numeric
 @attribute sex {FEMALE,MALE}
 @attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}
 @attribute income numeric
 @attribute married {NO,YES}
 @attribute children {0,1,2,3}
 @attribute car {NO,YES}
 @attribute mortgage {NO,YES}

2.2 Summary of Results

Number of iterations: 5
Within cluster sum of squared errors: 1201.3638013812113
Missing values globally replaced with mean/mode

Time taken to build model (full training data) : 0.08 seconds

Clustered Instances

0  163 ( 27%)

1  100 ( 17%)

2  159 ( 27%)

3  178 ( 30%)

Below figure shows the plot of age of customers vs. income for various clusters.

Above picture gives a glimpse of the clusters. It can be observed that age and income are significant
variables in determining the clusters.

2.3 Cluster Explanation

Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid
represents the mean value for that dimension in the cluster). Thus, centroids can be used to
characterize the clusters. For example centroid for cluster 0 Sex=Male implies, this cluster is centered
around Male Population and doesn’t imply that this cluster contain only Male Population

Cluster 0: Consists of predominantly male population of age group around 35 residing in Inner city
and doesn’t have car and children. This cluster consists of men in early stages of the career.

Cluster 1: Consists of predominantly female population of age group around 53 residing in rural areas
and have car and children. They also earn more than other clusters. This cluster predominantly consists
of ladies in fifties.

Cluster 2: Consists of predominantly male population of age group around 43 residing in inner city
and have car and children. They also earn more than cluster 0. This cluster predominantly consists of
men in late forties of their career.

Cluster 3: Consists of predominantly female population of age group around 40 residing in town and
doesn’t have car and children. They also earn lesser than ladies in cluster1. This cluster predominantly
consists of ladies in early forties.

10BM60080 - Weka Term Paper

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

10BM60080 - Weka Term Paper