2. Table of Contents
1. Classification via Decision Trees ....................................................................................................... 3
1.1 Car Evaluation Database........................................................................................................... 3
1.2 J48 pruned classification tree ................................................................................................... 4
1.3 Summary of Results................................................................................................................. 6
1.4 Simplified Decision Tree ........................................................................................................... 7
1.5 Test Set .................................................................................................................................... 7
2 K-Means Clustering .......................................................................................................................... 8
2.1 Bank Database ......................................................................................................................... 8
2.2 Summary of Results.................................................................................................................. 9
2.3 Cluster Explanation ................................................................................................................ 10
3. 1. Classification via Decision Trees
The Car Evaluation Database contains data pertaining to six attributes buying price, maintenance price,
no of persons, no of doors, safety and size of the luggage boot. Certain attributes related to structural
information are removed for simplification of analysis. Because of known underlying concept structure,
this database may be particularly useful for testing constructive induction and structure discovery
methods.
1.1 Car Evaluation Database
This model evaluates cars according to the following concept structure.
PRICE
buying buying price
maint price of the maintenance
TECHNICAL CHARACTERISTICS
……. (Removed for simplification of analysis)
COMFORT
doors number of doors
persons capacity in terms of persons to carry
lug_boot the size of luggage boot
SAFETY
safety estimated safety of the car
Number of Instances: 1728
Attribute Values
buying v-high, high, med, low
maint v-high, high, med, low
1. doors 2, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
class N N[%]
---------------------------------------
unacc 1210 (70.023 %) Unacceptable
acc 384 (22.222 %) Acceptable
good 69 ( 3.993 %) Good
v-good 65 ( 3.762 %) Very Good
J48 (implementation of C4.5 algorithm) is used for classification.
Test Mode: 10-fold cross-validation & min no. of objects required is 2.
7. === Confusion Matrix ===
Unacceptable (a) Acceptable(b) Good(c) Very Good(d)
Unacceptable (a) 1164 43 3 0
Acceptable(b) 33 333 11 7
Good(c) 0 17 42 10
Very Good(d) 0 3 5 57
Diagonal elements correctly classified and the rest are not.
1.4 Simplified Decision Tree
When repeated the same with no of folds=10 and min no. of objects =25 [To Simplify
the Classification tree], it produced an accuracy of 81.3079%. Reduction in accuracy is due to the
relaxation on the minimum number of objects. Below is the simplified version of the classification tree.
1.5 Test Set
When applied on the test set, it correctly classified 94.87% of the instances.
8. 2 K-Means Clustering
This example illustrates the use of k-means with Weka.
2.1 Bank Database
The sample data set used for this example is of bank maintaining their customer’s age, gender, region
type, income, marital status, no of children, owning a car and mortgage. The Bank wants to find the
9. savings pattern of their customer’s of the age group It has 600 instances and 8 attributes with
corresponding values listed below.
@attribute age numeric
@attribute sex {FEMALE,MALE}
@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}
@attribute income numeric
@attribute married {NO,YES}
@attribute children {0,1,2,3}
@attribute car {NO,YES}
@attribute mortgage {NO,YES}
2.2 Summary of Results
Number of iterations: 5
Within cluster sum of squared errors: 1201.3638013812113
Missing values globally replaced with mean/mode
Time taken to build model (full training data) : 0.08 seconds
Clustered Instances
0 163 ( 27%)
1 100 ( 17%)
2 159 ( 27%)
3 178 ( 30%)
10. Below figure shows the plot of age of customers vs. income for various clusters.
Above picture gives a glimpse of the clusters. It can be observed that age and income are significant
variables in determining the clusters.
2.3 Cluster Explanation
Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid
represents the mean value for that dimension in the cluster). Thus, centroids can be used to
characterize the clusters. For example centroid for cluster 0 Sex=Male implies, this cluster is centered
around Male Population and doesn’t imply that this cluster contain only Male Population
11. Cluster 0: Consists of predominantly male population of age group around 35 residing in Inner city
and doesn’t have car and children. This cluster consists of men in early stages of the career.
Cluster 1: Consists of predominantly female population of age group around 53 residing in rural areas
and have car and children. They also earn more than other clusters. This cluster predominantly consists
of ladies in fifties.
Cluster 2: Consists of predominantly male population of age group around 43 residing in inner city
and have car and children. They also earn more than cluster 0. This cluster predominantly consists of
men in late forties of their career.
Cluster 3: Consists of predominantly female population of age group around 40 residing in town and
doesn’t have car and children. They also earn lesser than ladies in cluster1. This cluster predominantly
consists of ladies in early forties.