SlideShare a Scribd company logo
1 of 4
Download to read offline
Premeditated initial points for K-Means Clustering
PL. Chithra
Department of Computer Science
University of Madras
Chennai, India
chitrasp2001@yahoo.com
Jeyapriya.U
Department of Computer Science
University of Madras
Chennai, India
jeyapriya75@gmail.com
Abstract — K-Means clustering uses an iterative procedure
which is very much sensitive and dependent upon the initial
centroids. The initial centroids in the k-means clustering are
chosen randomly, and hence the clustering also changes with
respect to the initial centroids. This paper tries to overcome this
problem of random selection of centroids and hence change of
clusters with a premeditated selection of initial centroids. We
have used the iris, abalone and wine data sets to demonstrate that
the proposed method of finding the initial centroids and using the
centroids in k-means algorithm improves the clustering
performance. The clustering also remains the same in every run
as the initial centroids are not randomly selected but through
premeditated method.
Keywords-clustering algorithm; centroids; unsupervised
learning; k-means algorithm; Proposed k-means algorithm
I. INTRODUCTION
Clustering algorithms help to organize data logically.
Clustering is useful in several exploratory pattern-analysis,
grouping, decision-making, and machine-learning situations. In
many cases, there is very little prior information available
about the data and the data analyst must make few initial
assumptions about the data. The clustering algorithms are very
much useful in these cases to do a preliminary analysis of the
data. It is a fundamental data analysis task. The recent dramatic
increase in the availability of data and computing technology
has made clustering as a more predominant fundamental task.
When data is clustered, elements of the same cluster have
similar characteristics and differ widely from other elements of
a different cluster. There are various clustering methods
available. Clustering algorithms can be partition based,
hierarchical based, grid based, density based or model based.
The most popular, simple and efficient partition based method
of clustering is k-means Clustering.
The k-means algorithm [3, 4] is effective in producing
clusters for many practical applications. But the computational
complexity of the original k-means algorithm is very high,
especially for large data sets. The challenge with this algorithm
is producing different types of clusters using random choice of
initial centroids for the same data set. Several researchers have
worked on improving the efficiency of K-means algorithm.
Clustering if done properly will help in various areas especially
in education. Data available in educational institutions is very
vast nowadays and hence analysis of data is very important.
Clustering plays a vital role in educational data, as students can
be clustered based on their interest, their learning comforts and
so on and teaching and learning process can be according to the
clusters.
This paper deals with a method for improving the efficiency
of the k-means clustering algorithm. The rest of the paper is
organized as follows. Section II discusses on the related works
in K-Means Clustering, Section III discusses on the original
K-Means Clustering Algorithm, Section IV discusses on the
proposed K-Means Algorithm followed by Experimental
results for the proposed Algorithm in Section V. The paper has
been concluded with the reiterating the improvement in
performance of the proposed k-means clustering algorithm.
II.RELATED WORK
Several attempts were made by researchers to improve the
effectiveness and efficiency of the k-means algorithm. A
variant of the k-means algorithm is the k-modes method which
replaces the means of clusters by modes and uses a frequency-
based method to update modes in the clustering process to
minimize the clustering cost function[5]. The k-prototypes
algorithm [5] integrates the k-means and k-modes methods for
clustering the data. Fang Yuan et al. [8] proposed a systematic
method for finding the initial centroids. First the distances
between every pair of data-points are evaluated and the initial
centroids with the data-points that are similar are constructed.
The centroids obtained by this method are consistent with the
distribution of data. Hence it produces clusters with better
precision, compared to the original k-means algorithm. Fahim
A M et al. [9] proposed an efficient method for assigning data-
points to clusters. In Fahim’s approach, for each data point the
distance to the nearest cluster is taken. At the next iteration,
the distance to the previous nearest cluster is calculated. If the
new distance is less than or equal to the previous distance, the
point stays in cluster and the distance need not be computed to
other clusters again. But this method presumes that the initial
centroids are determined randomly, as in the case of the
original k-means algorithm. Hence there is no guarantee for
the accuracy of the final clusters.
Grigorios Tzortzis et al. [10] attempts the initialization
problem of k-Means by proposing the MinMax k-Means
algorithm, a method that assigns weights to the clusters
relative to their variance and optimizes a weighted version of
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
278 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
the k-Means objective. Weights are cultured together with the
cluster assignments, through an iterative procedure. The
proposed weighting scheme limits the emergence of large
variance clusters and allows high quality solutions to be
systematically uncovered, irrespective of the initialization.
Guojun G et al. [11] approach proposes the KMOR algorithm
by extending the k-means algorithm to provide data clustering
and outlier detection simultaneously.
III. K-MEANS CLUSTERING ALGORITHM
K-means is an unsupervised simple partitioning based
cluster algorithm. It groups similar members in a cluster. K-
means assumes that all data are available at a particular
instance of time. But it is not the case in recent times after the
concept of Big Data has evolved. Data comes continuously
and clustering needs to be done with every new instance. As
the initial centroids are selected randomly in k-means
clustering, as new instance comes in, the clustering will have
to be redone.
K-Means clustering algorithm is affected by two main
factors namely choosing the initial centroids and determining
the number of clusters. Several methods are proposed in the
literature to attempt the problems that affect the performance
of K-Means clustering.
K-means clustering algorithm starts the process with k
random clusters. The algorithm consists of two phases: The
first phase defines k centroids. The next phase is to take each
data point of the given data set and associate it to the nearest
centroid. Euclidean distance between each data point and the
centroids is calculated and the data point is associated to
cluster with which it has minimum distance. Once all data
points are included in one or the other cluster, the centroids are
recalculated. Once we find k new centroids, a new association
has to be created between the data points and the centroids as
earlier and this continues till the centroids do not change. As
the centroids are randomly chosen at the first step, clustering
will be different for every run.
Psuedo Code 1: K-means Clustering
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K clusters
Start
Fix the number of clusters (K) to be formed
Randomly choose K centroids
Do
Assign each object to the nearest cluster
Recalculate the K centroids
While centroids changes
End
Figure 1 shows the clustering plot for iris dataset using the
original k-means.
The k-means algorithm is computationally expensive and
requires time proportional to the product of the number of data
items, number of clusters and the number of iterations.
IV. PROPOSED K-MEANS ALGORITHM
Instead of choosing the initial centroids randomly, a
procedural way of finding the centres can be followed so that
the clustering remains the same for every run and number of
iterations are also reduced .
Psuedo Code 2: Proposed K-means Clustering
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K clusters
Start
Determine the initial centroids of the clusters by
using Psuedo Code 3
do
Assign each object to the nearest cluster
Recalculate the K centroids
While the centroids changes
End
Psuedo Code 3 : Finding the initial centroids
Input:
D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n
data items
K // Number of desired clusters
Output: A set of K centroids
Start
Calculate the range(R) of selected features of the
dataset
G=R/K
Calculate centroids as follows
Centroid 1(C1)=minimum observation + integer(G/2)
Repeat for i times where i varies from 2 to K
Ci= Ci-1+G
The above algorithm does not require much iteration as the
centroids are calculated well in the first stage itself and the
grouping of data points do not change with every run. After
FIGURE 1. PLOT OF IRIS DATASET USING ORIGINAL K-MEANS
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
279 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
clustering of data is over and if the new instance comes in,
instead of redoing the entire clustering again, the new instance
can be fit in the appropriate cluster with its attribute values. As
can be seen, the iris dataset required a minimum of 2 iterations
to implement the original K-Means whereas it converged in
the first iteration itself in the proposed k-means clustering
algorithm. Figure 2 shows the clustering plot for iris dataset
using proposed k-means.
Table 1 shows the actual cluster distribution in different
datasets.
TABLE 1. DETAILS OF IRIS, WINE AND ABALONE DATASETS
V. EXPERIMENTAL RESULTS
The Proposed K-means approach was validated using
Confusion matrix to find the performance of clustering and
Silhouette value to interpret and validate the consistency
within clusters of data. Silhouette technique provides to the
point, visual representation of how well each object lies within
its cluster.
Confusion matrix reveals that the proposed k-means is
better than the original k-means. As the initial centroids are
calculated rather than randomly selecting them, the clustering
remains the same irrespective of many runs. The number of
iterations is also found to reduce in some cases.
Table 2, 3 and 4 illustrates the performance of clustering
using original and proposed k-means algorithm in iris, wine
and abalone datasets respectively. The clustering was carried
out using k-means and Proposed k-means. The performance of
clustering was found to be better in proposed k-means
compared with original k-means.
TABLE 2: PERFORMANCE IN IRIS DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
Cluster
1
Cluster
2
Cluster
3
Cluster
1
Cluster
2
Cluster
3
Setosa 50 0 0 50 0 0
Versic
olor
0 48 2 0 48 2
Virgin
ica
0 6 44 0 4 46
TABLE 3: PERFORMANCE IN WINE DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
1 2 3 1 2 3
1 23 1 35 1 35 23
2 0 64 7 64 7 0
3 0 17 3 17 3 0
TABLE 4: PERFORMANCE IN ABALONE DATASET
ORIGINAL K-MEANS PROPOSED K-MEANS
1 2 3 1 2 3
1 1648 0 1121 2274 495 0
2 0 364 596 0 699 261
3 448 0 0 448 0 0
VI. CONCLUSION
One of the problems in k-means clustering is random selection of
initial centroids. Due to random selection of initial centroids, the
clustering changes from one run to another as it is dependent on the
initial centroids. In this work, we have proposed a better way of
calculating initial centroids for k-means clustering. The proposed
algorithm suggested in the paper helps to keep clustering stable as the
initial centroids are calculated rather than random selection. The
experimental results showed that the proposed k-means algorithm
worked well in terms of accuracy. The clustering was the same
irrespective of several runs of the proposed k-means whereas it is not
the case with tradition k-means.
REFERENCES
[1] Pena, J.M., Lozano, J.A., Larranaga, P, “An empirical comparison of
four initialization methods for the KMeans algorithm”, Pattern
Recognition Letters 20 (1999) pp. 1027-1040
[2] Chakrabarti,K., Mehrotra,S. “Local Dimensionality Reduction: A New
Approach to Indexing High Dimensional Spaces” , In Proceedings of the
26th
International Conference on Very Large Databases, Cairo,Egypt,89-
100,2000
Datasets Instances Features
Cluster
1
Cluster
2
Cluster
3
Iris 150 2 50 50 50
Wine 150 13 59 71 20
Abalone 4177 9 2768 448 960
Algorithm Dataset Clustering
Accuracy (%)
Average Silhouette
Width
k-means
Iris 94.7 0.66
Abalone 76.7 0.43
Wine 76 0.57
Proposed k-
means
Iris 96 0.89
Abalone 88.9 0.45
Wine 77.4 0.59
TABLE 5. PERFORMANCE COMPARISON TABLE
FIGURE 2. PLOT OF IRIS DATASET USING IMPROVED K-MEANS
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
280 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
[3] Jiawei Han M. K, “Data Mining Concepts and Techniques”, Morgan
Kaufmann Publishers, An Imprint of Elsevier, 2006.
[4] Margaret H. Dunham, “Data Mining- Introductory and Advanced
Concepts”, Pearson Education, 2006
[5] Huang Z, “Extensions to the k-means algorithm for clustering large data
sets with categorical values,” Data Mining and Knowledge Discovery,
(2):283–304, 1998.
[6] Chaturvedi J. C. A, Green P, “K-modes clustering,” J. Classification,
(18):35–55, 2001.
[7] Daxin Jiang, Chum Tong and Aidong Zhang, “Cluster Analysis for Gene
Expression Data,” IEEE Transactions on Data and Knowledge
Engineering, 16(11): 1370-1386, 2004
[8] Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New Algorithm to
Get the Initial Centroids,” Proc. of the 3rd International Conference on
Machine Learning and Cybernetics, pages 26–29, August 2004
[9] Fahim A.M, Salem A. M, Torkey A and Ramadan M. A, “An Efficient
enhanced k-means clustering algorithm,” Journal of Zhejiang University,
10(7):1626–1633, 2006.
[10] Grigorios Tzortzis and Aristidis Likas , “The MinMax k-Means
clustering algorithm”, An Imprint of Elsevier, 2014
[11] Guojun Gan and Michael Kwok-Po Ng, “k-means clustering with
outlier removal”, Pattern Recognition Letters 90 (2017) 8–14
[12] Kun Niu, Zhipeng Gao, Haizhen Jiao, Nanjie Deng, “K-Means+: A
Developed Clustering Algorithm for Big Data”, Proceedings of CCIS
2016
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
281 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimizationAlexander Decker
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAdeyemi Fowe
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 

What's hot (20)

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Data clustering
Data clustering Data clustering
Data clustering
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Lect4
Lect4Lect4
Lect4
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Chapter8
Chapter8Chapter8
Chapter8
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Noura2
Noura2Noura2
Noura2
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
K means report
K means reportK means report
K means report
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 

Similar to Premeditated Initial Points for K-Means Clustering

An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxUnsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate
 
Presentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxPresentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxSYETB202RandhirBhosa
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 

Similar to Premeditated Initial Points for K-Means Clustering (20)

An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxUnsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
 
Presentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxPresentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptx
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
I017235662
I017235662I017235662
I017235662
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
47 292-298
47 292-29847 292-298
47 292-298
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Premeditated Initial Points for K-Means Clustering

  • 1. Premeditated initial points for K-Means Clustering PL. Chithra Department of Computer Science University of Madras Chennai, India chitrasp2001@yahoo.com Jeyapriya.U Department of Computer Science University of Madras Chennai, India jeyapriya75@gmail.com Abstract — K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method. Keywords-clustering algorithm; centroids; unsupervised learning; k-means algorithm; Proposed k-means algorithm I. INTRODUCTION Clustering algorithms help to organize data logically. Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations. In many cases, there is very little prior information available about the data and the data analyst must make few initial assumptions about the data. The clustering algorithms are very much useful in these cases to do a preliminary analysis of the data. It is a fundamental data analysis task. The recent dramatic increase in the availability of data and computing technology has made clustering as a more predominant fundamental task. When data is clustered, elements of the same cluster have similar characteristics and differ widely from other elements of a different cluster. There are various clustering methods available. Clustering algorithms can be partition based, hierarchical based, grid based, density based or model based. The most popular, simple and efficient partition based method of clustering is k-means Clustering. The k-means algorithm [3, 4] is effective in producing clusters for many practical applications. But the computational complexity of the original k-means algorithm is very high, especially for large data sets. The challenge with this algorithm is producing different types of clusters using random choice of initial centroids for the same data set. Several researchers have worked on improving the efficiency of K-means algorithm. Clustering if done properly will help in various areas especially in education. Data available in educational institutions is very vast nowadays and hence analysis of data is very important. Clustering plays a vital role in educational data, as students can be clustered based on their interest, their learning comforts and so on and teaching and learning process can be according to the clusters. This paper deals with a method for improving the efficiency of the k-means clustering algorithm. The rest of the paper is organized as follows. Section II discusses on the related works in K-Means Clustering, Section III discusses on the original K-Means Clustering Algorithm, Section IV discusses on the proposed K-Means Algorithm followed by Experimental results for the proposed Algorithm in Section V. The paper has been concluded with the reiterating the improvement in performance of the proposed k-means clustering algorithm. II.RELATED WORK Several attempts were made by researchers to improve the effectiveness and efficiency of the k-means algorithm. A variant of the k-means algorithm is the k-modes method which replaces the means of clusters by modes and uses a frequency- based method to update modes in the clustering process to minimize the clustering cost function[5]. The k-prototypes algorithm [5] integrates the k-means and k-modes methods for clustering the data. Fang Yuan et al. [8] proposed a systematic method for finding the initial centroids. First the distances between every pair of data-points are evaluated and the initial centroids with the data-points that are similar are constructed. The centroids obtained by this method are consistent with the distribution of data. Hence it produces clusters with better precision, compared to the original k-means algorithm. Fahim A M et al. [9] proposed an efficient method for assigning data- points to clusters. In Fahim’s approach, for each data point the distance to the nearest cluster is taken. At the next iteration, the distance to the previous nearest cluster is calculated. If the new distance is less than or equal to the previous distance, the point stays in cluster and the distance need not be computed to other clusters again. But this method presumes that the initial centroids are determined randomly, as in the case of the original k-means algorithm. Hence there is no guarantee for the accuracy of the final clusters. Grigorios Tzortzis et al. [10] attempts the initialization problem of k-Means by proposing the MinMax k-Means algorithm, a method that assigns weights to the clusters relative to their variance and optimizes a weighted version of International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 278 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. the k-Means objective. Weights are cultured together with the cluster assignments, through an iterative procedure. The proposed weighting scheme limits the emergence of large variance clusters and allows high quality solutions to be systematically uncovered, irrespective of the initialization. Guojun G et al. [11] approach proposes the KMOR algorithm by extending the k-means algorithm to provide data clustering and outlier detection simultaneously. III. K-MEANS CLUSTERING ALGORITHM K-means is an unsupervised simple partitioning based cluster algorithm. It groups similar members in a cluster. K- means assumes that all data are available at a particular instance of time. But it is not the case in recent times after the concept of Big Data has evolved. Data comes continuously and clustering needs to be done with every new instance. As the initial centroids are selected randomly in k-means clustering, as new instance comes in, the clustering will have to be redone. K-Means clustering algorithm is affected by two main factors namely choosing the initial centroids and determining the number of clusters. Several methods are proposed in the literature to attempt the problems that affect the performance of K-Means clustering. K-means clustering algorithm starts the process with k random clusters. The algorithm consists of two phases: The first phase defines k centroids. The next phase is to take each data point of the given data set and associate it to the nearest centroid. Euclidean distance between each data point and the centroids is calculated and the data point is associated to cluster with which it has minimum distance. Once all data points are included in one or the other cluster, the centroids are recalculated. Once we find k new centroids, a new association has to be created between the data points and the centroids as earlier and this continues till the centroids do not change. As the centroids are randomly chosen at the first step, clustering will be different for every run. Psuedo Code 1: K-means Clustering Input: D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n data items K // Number of desired clusters Output: A set of K clusters Start Fix the number of clusters (K) to be formed Randomly choose K centroids Do Assign each object to the nearest cluster Recalculate the K centroids While centroids changes End Figure 1 shows the clustering plot for iris dataset using the original k-means. The k-means algorithm is computationally expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations. IV. PROPOSED K-MEANS ALGORITHM Instead of choosing the initial centroids randomly, a procedural way of finding the centres can be followed so that the clustering remains the same for every run and number of iterations are also reduced . Psuedo Code 2: Proposed K-means Clustering Input: D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n data items K // Number of desired clusters Output: A set of K clusters Start Determine the initial centroids of the clusters by using Psuedo Code 3 do Assign each object to the nearest cluster Recalculate the K centroids While the centroids changes End Psuedo Code 3 : Finding the initial centroids Input: D = {d1, d2,......,dn} where d1,d2,...,dn are the set of n data items K // Number of desired clusters Output: A set of K centroids Start Calculate the range(R) of selected features of the dataset G=R/K Calculate centroids as follows Centroid 1(C1)=minimum observation + integer(G/2) Repeat for i times where i varies from 2 to K Ci= Ci-1+G The above algorithm does not require much iteration as the centroids are calculated well in the first stage itself and the grouping of data points do not change with every run. After FIGURE 1. PLOT OF IRIS DATASET USING ORIGINAL K-MEANS International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 279 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. clustering of data is over and if the new instance comes in, instead of redoing the entire clustering again, the new instance can be fit in the appropriate cluster with its attribute values. As can be seen, the iris dataset required a minimum of 2 iterations to implement the original K-Means whereas it converged in the first iteration itself in the proposed k-means clustering algorithm. Figure 2 shows the clustering plot for iris dataset using proposed k-means. Table 1 shows the actual cluster distribution in different datasets. TABLE 1. DETAILS OF IRIS, WINE AND ABALONE DATASETS V. EXPERIMENTAL RESULTS The Proposed K-means approach was validated using Confusion matrix to find the performance of clustering and Silhouette value to interpret and validate the consistency within clusters of data. Silhouette technique provides to the point, visual representation of how well each object lies within its cluster. Confusion matrix reveals that the proposed k-means is better than the original k-means. As the initial centroids are calculated rather than randomly selecting them, the clustering remains the same irrespective of many runs. The number of iterations is also found to reduce in some cases. Table 2, 3 and 4 illustrates the performance of clustering using original and proposed k-means algorithm in iris, wine and abalone datasets respectively. The clustering was carried out using k-means and Proposed k-means. The performance of clustering was found to be better in proposed k-means compared with original k-means. TABLE 2: PERFORMANCE IN IRIS DATASET ORIGINAL K-MEANS PROPOSED K-MEANS Cluster 1 Cluster 2 Cluster 3 Cluster 1 Cluster 2 Cluster 3 Setosa 50 0 0 50 0 0 Versic olor 0 48 2 0 48 2 Virgin ica 0 6 44 0 4 46 TABLE 3: PERFORMANCE IN WINE DATASET ORIGINAL K-MEANS PROPOSED K-MEANS 1 2 3 1 2 3 1 23 1 35 1 35 23 2 0 64 7 64 7 0 3 0 17 3 17 3 0 TABLE 4: PERFORMANCE IN ABALONE DATASET ORIGINAL K-MEANS PROPOSED K-MEANS 1 2 3 1 2 3 1 1648 0 1121 2274 495 0 2 0 364 596 0 699 261 3 448 0 0 448 0 0 VI. CONCLUSION One of the problems in k-means clustering is random selection of initial centroids. Due to random selection of initial centroids, the clustering changes from one run to another as it is dependent on the initial centroids. In this work, we have proposed a better way of calculating initial centroids for k-means clustering. The proposed algorithm suggested in the paper helps to keep clustering stable as the initial centroids are calculated rather than random selection. The experimental results showed that the proposed k-means algorithm worked well in terms of accuracy. The clustering was the same irrespective of several runs of the proposed k-means whereas it is not the case with tradition k-means. REFERENCES [1] Pena, J.M., Lozano, J.A., Larranaga, P, “An empirical comparison of four initialization methods for the KMeans algorithm”, Pattern Recognition Letters 20 (1999) pp. 1027-1040 [2] Chakrabarti,K., Mehrotra,S. “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces” , In Proceedings of the 26th International Conference on Very Large Databases, Cairo,Egypt,89- 100,2000 Datasets Instances Features Cluster 1 Cluster 2 Cluster 3 Iris 150 2 50 50 50 Wine 150 13 59 71 20 Abalone 4177 9 2768 448 960 Algorithm Dataset Clustering Accuracy (%) Average Silhouette Width k-means Iris 94.7 0.66 Abalone 76.7 0.43 Wine 76 0.57 Proposed k- means Iris 96 0.89 Abalone 88.9 0.45 Wine 77.4 0.59 TABLE 5. PERFORMANCE COMPARISON TABLE FIGURE 2. PLOT OF IRIS DATASET USING IMPROVED K-MEANS International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 280 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. [3] Jiawei Han M. K, “Data Mining Concepts and Techniques”, Morgan Kaufmann Publishers, An Imprint of Elsevier, 2006. [4] Margaret H. Dunham, “Data Mining- Introductory and Advanced Concepts”, Pearson Education, 2006 [5] Huang Z, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, (2):283–304, 1998. [6] Chaturvedi J. C. A, Green P, “K-modes clustering,” J. Classification, (18):35–55, 2001. [7] Daxin Jiang, Chum Tong and Aidong Zhang, “Cluster Analysis for Gene Expression Data,” IEEE Transactions on Data and Knowledge Engineering, 16(11): 1370-1386, 2004 [8] Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New Algorithm to Get the Initial Centroids,” Proc. of the 3rd International Conference on Machine Learning and Cybernetics, pages 26–29, August 2004 [9] Fahim A.M, Salem A. M, Torkey A and Ramadan M. A, “An Efficient enhanced k-means clustering algorithm,” Journal of Zhejiang University, 10(7):1626–1633, 2006. [10] Grigorios Tzortzis and Aristidis Likas , “The MinMax k-Means clustering algorithm”, An Imprint of Elsevier, 2014 [11] Guojun Gan and Michael Kwok-Po Ng, “k-means clustering with outlier removal”, Pattern Recognition Letters 90 (2017) 8–14 [12] Kun Niu, Zhipeng Gao, Haizhen Jiao, Nanjie Deng, “K-Means+: A Developed Clustering Algorithm for Big Data”, Proceedings of CCIS 2016 International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 281 https://sites.google.com/site/ijcsis/ ISSN 1947-5500