K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
Premeditated Initial Points for K-Means Clustering
1. Premeditated initial points for K-Means Clustering
PL. Chithra
Department of Computer Science
University of Madras
Chennai, India
chitrasp2001@yahoo.com
Jeyapriya.U
Department of Computer Science
University of Madras
Chennai, India
jeyapriya75@gmail.com
Abstract — K-Means clustering uses an iterative procedure
which is very much sensitive and dependent upon the initial
centroids. The initial centroids in the k-means clustering are
chosen randomly, and hence the clustering also changes with
respect to the initial centroids. This paper tries to overcome this
problem of random selection of centroids and hence change of
clusters with a premeditated selection of initial centroids. We
have used the iris, abalone and wine data sets to demonstrate that
the proposed method of finding the initial centroids and using the
centroids in k-means algorithm improves the clustering
performance. The clustering also remains the same in every run
as the initial centroids are not randomly selected but through
premeditated method.
Keywords-clustering algorithm; centroids; unsupervised
learning; k-means algorithm; Proposed k-means algorithm
I. INTRODUCTION
Clustering algorithms help to organize data logically.
Clustering is useful in several exploratory pattern-analysis,
grouping, decision-making, and machine-learning situations. In
many cases, there is very little prior information available
about the data and the data analyst must make few initial
assumptions about the data. The clustering algorithms are very
much useful in these cases to do a preliminary analysis of the
data. It is a fundamental data analysis task. The recent dramatic
increase in the availability of data and computing technology
has made clustering as a more predominant fundamental task.
When data is clustered, elements of the same cluster have
similar characteristics and differ widely from other elements of
a different cluster. There are various clustering methods
available. Clustering algorithms can be partition based,
hierarchical based, grid based, density based or model based.
The most popular, simple and efficient partition based method
of clustering is k-means Clustering.
The k-means algorithm [3, 4] is effective in producing
clusters for many practical applications. But the computational
complexity of the original k-means algorithm is very high,
especially for large data sets. The challenge with this algorithm
is producing different types of clusters using random choice of
initial centroids for the same data set. Several researchers have
worked on improving the efficiency of K-means algorithm.
Clustering if done properly will help in various areas especially
in education. Data available in educational institutions is very
vast nowadays and hence analysis of data is very important.
Clustering plays a vital role in educational data, as students can
be clustered based on their interest, their learning comforts and
so on and teaching and learning process can be according to the
clusters.
This paper deals with a method for improving the efficiency
of the k-means clustering algorithm. The rest of the paper is
organized as follows. Section II discusses on the related works
in K-Means Clustering, Section III discusses on the original
K-Means Clustering Algorithm, Section IV discusses on the
proposed K-Means Algorithm followed by Experimental
results for the proposed Algorithm in Section V. The paper has
been concluded with the reiterating the improvement in
performance of the proposed k-means clustering algorithm.
II.RELATED WORK
Several attempts were made by researchers to improve the
effectiveness and efficiency of the k-means algorithm. A
variant of the k-means algorithm is the k-modes method which
replaces the means of clusters by modes and uses a frequency-
based method to update modes in the clustering process to
minimize the clustering cost function[5]. The k-prototypes
algorithm [5] integrates the k-means and k-modes methods for
clustering the data. Fang Yuan et al. [8] proposed a systematic
method for finding the initial centroids. First the distances
between every pair of data-points are evaluated and the initial
centroids with the data-points that are similar are constructed.
The centroids obtained by this method are consistent with the
distribution of data. Hence it produces clusters with better
precision, compared to the original k-means algorithm. Fahim
A M et al. [9] proposed an efficient method for assigning data-
points to clusters. In Fahim’s approach, for each data point the
distance to the nearest cluster is taken. At the next iteration,
the distance to the previous nearest cluster is calculated. If the
new distance is less than or equal to the previous distance, the
point stays in cluster and the distance need not be computed to
other clusters again. But this method presumes that the initial
centroids are determined randomly, as in the case of the
original k-means algorithm. Hence there is no guarantee for
the accuracy of the final clusters.
Grigorios Tzortzis et al. [10] attempts the initialization
problem of k-Means by proposing the MinMax k-Means
algorithm, a method that assigns weights to the clusters
relative to their variance and optimizes a weighted version of
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
278 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. the k-Means objective. Weights are cultured together with the
cluster assignments, through an iterative procedure. The
proposed weighting scheme limits the emergence of large
variance clusters and allows high quality solutions to be
systematically uncovered, irrespective of the initialization.
Guojun G et al. [11] approach proposes the KMOR algorithm
by extending the k-means algorithm to provide data clustering
and outlier detection simultaneously.
III. K-MEANS CLUSTERING ALGORITHM
K-means is an unsupervised simple partitioning based
cluster algorithm. It groups similar members in a cluster. K-
means assumes that all data are available at a particular
instance of time. But it is not the case in recent times after the
concept of Big Data has evolved. Data comes continuously
and clustering needs to be done with every new instance. As
the initial centroids are selected randomly in k-means
clustering, as new instance comes in, the clustering will have
to be redone.
K-Means clustering algorithm is affected by two main
factors namely choosing the initial centroids and determining
the number of clusters. Several methods are proposed in the
literature to attempt the problems that affect the performance
of K-Means clustering.
K-means clustering algorithm starts the process with k
random clusters. The algorithm consists of two phases: The
first phase defines k centroids. The next phase is to take each
data point of the given data set and associate it to the nearest
centroid. Euclidean distance between each data point and the
centroids is calculated and the data point is associated to
cluster with which it has minimum distance. Once all data
points are included in one or the other cluster, the centroids are
recalculated. Once we find k new centroids, a new association
has to be created between the data points and the centroids as
earlier and this continues till the centroids do not change. As
the centroids are randomly chosen at the first step, clustering
will be different for every run.
Psuedo Code 1: K-means Clustering
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K clusters
Start
Fix the number of clusters (K) to be formed
Randomly choose K centroids
Do
Assign each object to the nearest cluster
Recalculate the K centroids
While centroids changes
End
Figure 1 shows the clustering plot for iris dataset using the
original k-means.
The k-means algorithm is computationally expensive and
requires time proportional to the product of the number of data
items, number of clusters and the number of iterations.
IV. PROPOSED K-MEANS ALGORITHM
Instead of choosing the initial centroids randomly, a
procedural way of finding the centres can be followed so that
the clustering remains the same for every run and number of
iterations are also reduced .
Psuedo Code 2: Proposed K-means Clustering
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K clusters
Start
Determine the initial centroids of the clusters by
using Psuedo Code 3
do
Assign each object to the nearest cluster
Recalculate the K centroids
While the centroids changes
End
Psuedo Code 3 : Finding the initial centroids
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K centroids
Start
Calculate the range(R) of selected features of the
dataset
G=R/K
Calculate centroids as follows
Centroid 1(C1)=minimum observation + integer(G/2)
Repeat for i times where i varies from 2 to K
Ci= Ci-1+G
The above algorithm does not require much iteration as the
centroids are calculated well in the first stage itself and the
grouping of data points do not change with every run. After
FIGURE 1. PLOT OF IRIS DATASET USING ORIGINAL K-MEANS
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
279 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. clustering of data is over and if the new instance comes in,
instead of redoing the entire clustering again, the new instance
can be fit in the appropriate cluster with its attribute values. As
can be seen, the iris dataset required a minimum of 2 iterations
to implement the original K-Means whereas it converged in
the first iteration itself in the proposed k-means clustering
algorithm. Figure 2 shows the clustering plot for iris dataset
using proposed k-means.
Table 1 shows the actual cluster distribution in different
datasets.
TABLE 1. DETAILS OF IRIS, WINE AND ABALONE DATASETS
V. EXPERIMENTAL RESULTS
The Proposed K-means approach was validated using
Confusion matrix to find the performance of clustering and
Silhouette value to interpret and validate the consistency
within clusters of data. Silhouette technique provides to the
point, visual representation of how well each object lies within
its cluster.
Confusion matrix reveals that the proposed k-means is
better than the original k-means. As the initial centroids are
calculated rather than randomly selecting them, the clustering
remains the same irrespective of many runs. The number of
iterations is also found to reduce in some cases.
Table 2, 3 and 4 illustrates the performance of clustering
using original and proposed k-means algorithm in iris, wine
and abalone datasets respectively. The clustering was carried
out using k-means and Proposed k-means. The performance of
clustering was found to be better in proposed k-means
compared with original k-means.
TABLE 2: PERFORMANCE IN IRIS DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
Cluster
1
Cluster
2
Cluster
3
Cluster
1
Cluster
2
Cluster
3
Setosa 50 0 0 50 0 0
Versic
olor
0 48 2 0 48 2
Virgin
ica
0 6 44 0 4 46
TABLE 3: PERFORMANCE IN WINE DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
1 2 3 1 2 3
1 23 1 35 1 35 23
2 0 64 7 64 7 0
3 0 17 3 17 3 0
TABLE 4: PERFORMANCE IN ABALONE DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
1 2 3 1 2 3
1 1648 0 1121 2274 495 0
2 0 364 596 0 699 261
3 448 0 0 448 0 0
VI. CONCLUSION
One of the problems in k-means clustering is random selection of
initial centroids. Due to random selection of initial centroids, the
clustering changes from one run to another as it is dependent on the
initial centroids. In this work, we have proposed a better way of
calculating initial centroids for k-means clustering. The proposed
algorithm suggested in the paper helps to keep clustering stable as the
initial centroids are calculated rather than random selection. The
experimental results showed that the proposed k-means algorithm
worked well in terms of accuracy. The clustering was the same
irrespective of several runs of the proposed k-means whereas it is not
the case with tradition k-means.
REFERENCES
[1] Pena, J.M., Lozano, J.A., Larranaga, P, “An empirical comparison of
four initialization methods for the KMeans algorithm”, Pattern
Recognition Letters 20 (1999) pp. 1027-1040
[2] Chakrabarti,K., Mehrotra,S. “Local Dimensionality Reduction: A New
Approach to Indexing High Dimensional Spaces” , In Proceedings of the
26th
International Conference on Very Large Databases, Cairo,Egypt,89-
100,2000
Datasets Instances Features
Cluster
1
Cluster
2
Cluster
3
Iris 150 2 50 50 50
Wine 150 13 59 71 20
Abalone 4177 9 2768 448 960
Algorithm Dataset Clustering
Accuracy (%)
Average Silhouette
Width
k-means
Iris 94.7 0.66
Abalone 76.7 0.43
Wine 76 0.57
Proposed k-
means
Iris 96 0.89
Abalone 88.9 0.45
Wine 77.4 0.59
TABLE 5. PERFORMANCE COMPARISON TABLE
FIGURE 2. PLOT OF IRIS DATASET USING IMPROVED K-MEANS
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
280 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. [3] Jiawei Han M. K, “Data Mining Concepts and Techniques”, Morgan
Kaufmann Publishers, An Imprint of Elsevier, 2006.
[4] Margaret H. Dunham, “Data Mining- Introductory and Advanced
Concepts”, Pearson Education, 2006
[5] Huang Z, “Extensions to the k-means algorithm for clustering large data
sets with categorical values,” Data Mining and Knowledge Discovery,
(2):283–304, 1998.
[6] Chaturvedi J. C. A, Green P, “K-modes clustering,” J. Classification,
(18):35–55, 2001.
[7] Daxin Jiang, Chum Tong and Aidong Zhang, “Cluster Analysis for Gene
Expression Data,” IEEE Transactions on Data and Knowledge
Engineering, 16(11): 1370-1386, 2004
[8] Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New Algorithm to
Get the Initial Centroids,” Proc. of the 3rd International Conference on
Machine Learning and Cybernetics, pages 26–29, August 2004
[9] Fahim A.M, Salem A. M, Torkey A and Ramadan M. A, “An Efficient
enhanced k-means clustering algorithm,” Journal of Zhejiang University,
10(7):1626–1633, 2006.
[10] Grigorios Tzortzis and Aristidis Likas , “The MinMax k-Means
clustering algorithm”, An Imprint of Elsevier, 2014
[11] Guojun Gan and Michael Kwok-Po Ng, “k-means clustering with
outlier removal”, Pattern Recognition Letters 90 (2017) 8–14
[12] Kun Niu, Zhipeng Gao, Haizhen Jiao, Nanjie Deng, “K-Means+: A
Developed Clustering Algorithm for Big Data”, Proceedings of CCIS
2016
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
281 https://sites.google.com/site/ijcsis/
ISSN 1947-5500