2. numerical data for each attribute to a series of items. cardinality of attributes except continuous numeric
For example, there are certain conventions to data are not high in medical domain, these attribute
consider a person is young, adult, or elder with values are mapped to integer values using medical
respect to age. A set of rules is created for each domain dictionaries. Therefore, the mapping process
continuous numerical attribute using the knowledge is divided in two phases. Phase 1: a rule base is
of medical domain experts. A rule engine is used to constructed based on the knowledge of medical
map continuous numerical data to items using these domain experts and dictionaries are constructed for
developed rules. attributes where domain expert knowledge is not
applicable, Phase 2: attribute values are mapped to
We have used domain dictionary approach to integer values using the corresponding rule base and
transform the data, for which medical domain expert the dictionaries.
knowledge is not applicable, to numerical form. As
Original Mapped Original Mapped
Generate dictionary for value value value value
each categorical attribute Headache 1 Yes 1
Fever 2 No 2
PatientActual Data
Age Smoke Diagnosis Dictionary of Dictionary of
ID Diagnosis attribute Smoke attribute
1020D 33 Yes Headache
1021D 63 No Fever Map to integer items using
rule base and dictionaries
Actual data
If age <= 12 then 1
Medical If 13<=age<=60 then 2
domain If 60 <=age then 3 Patient Age Smoke Diagnosis
knowledge If smoke = y then 1 ID
If smoke = n then 2 1020D 2 1 1
If Sex = M then 1
1021D 3 2 2
If Sex = F then 2
Rule Base Data suitable for Knowledge Discovery
Figure 1. Data transformation of medical data
3.1. Updating cluster center
We need to update the k clusters centre
3. The proposed algorithm dynamically in order to minimize the intra cluster
distance of patients. Here k is the number of clusters
Figure 2 shows the proposed hybrid-partitioning
we would like to make and Pi is the ith patient
algorithm, which can handle both continuous and
attribute and Ci is the ith mean-mode value of cluster
discrete data and perform clustering based on
C. As the patient attributes are both continuous and
anticipated likelihood attributes with core attributes
discrete, each cluster center is an array of both
of disease in data point. In this algorithm, the user
average and mode values where average and mode
will set which attributes will be used as data point for
are computed for continuous and discrete attributes
a patient and which attributes will participate in
respectively. Mean is computed for each continuous
clustering process. The goal of this algorithm is
attribute by calculating average of that attribute
making clusters to find likelihood. Healthcare data
among the data points in that cluster. Mode is
are sparse as doctors perform only few different
computed for each discrete attribute by calculating
clinical lab tests for a patient over his lifetime. This
maximum frequent value of that attribute among the
is natural many patients have not all anticipated
data points in that cluster.
attributes for likelihood. When a patient does not
have one or more anticipated attributes for
likelihood, keeping this patient in clustering process 3.2. Dissimilarity measure
will make clusters useless to find likelihood.
Therefore, we are ignoring that patient in the The object dissimilarity measure is derived from
clustering process. both numeric and categorical attributes. For discrete
features, the dissimilarity measure between two data
point depends on the number of different values in
45
3. Algorithm: Partition patients to find likelihood of 1.2.1 If A is continuous attribute
disease based on MeanMode value of patients. 1.2.1.1 MeanModec [i] = Find the mean
1. Read the metadata about which attributes will only among the attribute named A values of data points
appear in clustering process. in cluster c.
2. Partition patient data into k cluster in random and assign 1.2.2 else If A is category attribute
each partition to each cluster. To retrieve paient data use 1.2.2.1 MeanModec [i] = Find the mode
the corresponding RetrieveAllPatientsRecord() for each among the attribute named A values of data points
data model. in cluster c.
3. Repeat 1.2.3 i++;
3.1 Call UpdateMeanModeofClusters(K, M ) to
update Mean-Mode value of k clusters Procedure Distance (P: Patient, C: Cluster, m: Number
3.2 Move patient Pi to the cluster with least of attributes)
distance and find the distance between a patient //Here Pi represent the ith attribute value of Patient P and Ci
and a cluster using the function Distance (P, C, represents ith MeanMode value of Cluster C
m);
Until no patient is moved 1. for i = 1 to m where ith attribute value of Patient
can appear in clustering
Procedure UpdateMeanModeofClusters(K: Number of 1.1 If Pi is continuous
clusters, M: Medical attributes) 1.1.1 Then D1 = D1+ (Pi - Ci) 2
1. For each cluster c K 1.2 Else (categorical)
1.2.1 Then D2 = D2 + NumberofOnes (Pi ^ Ci);
1.1 i = 0 1.3 d = SQRT (D1) + D2;
1.2 For each attribute A M where A can appear in 2. return d;
clustering
Figure 2. Constraint k-Means-Mode clustering algorithm
each categorical feature. For continuous features, the Distance between based on continuous
dissimilarity measure between two data point
depends on Euclidean distance. Here we have used attributes is =1 ( )2 where , and n
the following two functions to measure dissimilarity: is the number of patients. Distance is measured using
hamming distance function for categorical Hamming distance function for categorical attributes.
objects and Euclidean distance function for Distance between based on categorical
continuous data. To measure distance between two
attributes is = ( , ) where , =
objects based on several features, for each feature we
0 ==
test whether this feature is discrete or continuous. If
the feature is continuous, distance is measured using 1
Euclidean distance and added it to D1 and if the
feature is discrete, the dissimilarity is measured 3.3. Likelihood
using hamming distance and added it to D2. The
resultant distance is computed by adding square root Likelihood is the probability of a specified
of D1 with D2. The computational complexity of the outcome. After clustering using constrained K-
algorithm is O ((I+1) k p), where p is the number of Means-Mode algorithm we get a set of clusters,
patients, k the number of clusters and I is the number C = {c1 , c2 , c3 , ck }. Each cluster contains a set
of iterations. of data points, which consist of anticipated
Let the anticipated likelihood attributes be = likelihood attributes and core attributes of disease.
{ 1 , 2 , 3 , . . . }. Let the core attributes of disease, Data points for cluster cj is
= { 1, 2, 3, }. In the clustering Dj = {dj1 , dj2 , dj3 , . . dju }. There are a set of
process, only anticipated likelihood attributes boolean functions on core attributes of disease to
participate. The anticipated likelihood attributes determine whether a data point has the presence of a
consist of both continuous and categorical attribute. disease or not. Let the set of boolean functions be
Let first attributes of are continuous and the F = {f1 , f2 , f3 , fv }. A data point dt has presence
remaining attributes are categorical. Let the of the disease if v fi (dt ) == true for the data
i=1
anticipated likelihood attributes of two data points point. In a cluster, the number of data points which
are . Dissimilarity between the anticipated has presence of the disease is u v
j=1 i=1 fi (dj ). The
likelihood attributes of two data points is the sum of number of total data points in the cluster is u u. j=1
dissimilarity of continuous attribute and dissimilarity So likelihood of a cluster for the disease is
of categorical attribute. Distance is measured using
Euclidian distance function for continuous attributes.
46
4. u v
j=1 i=1 f i d j Microsoft Vista and implementation language was
u u where fi is the function, which returns
j=1 c#. We used 2 datasets to verify our method. The
either one or zero. first data set of interest is patient dataset collected
Here each cluster is represented by the mean and preprocessed from Bangladeshi hospitals, which
mode value of that cluster. Now we will find the has 50273 instances with 514 attributes (included
equation of mean mode value of a cluster c. Mean is 150 discrete and 364 numerical attributes). The
calculated among the continuous attributes and mode Patient Dataset was clustered in 5 classes (Very High
is calculated among the categorical attributes. Let the Risk, High Risk, Medium Risk, Low Risk, No Risk)
mean mode value of a cluster be MM = using proposed algorithm to find likelihood of
mm1 , mm2 , mm3 , mmz where z is the number Diabetic. The next data set of interest is the Zoo Data
of attributes in the clustering process. Let first y Set [15] from UCI Machine Learning Repository,
attributes of MM are continuous and remaining which has the similar characteristics like medical
z y are categorical. The continuous part of mean data. It contains 101 instances with 7 classes
mode value is MMi i=1, y = the mean among ith {mammal, bird, reptile, fish, amphibian, insect, and
attribute values of cluster c. The categorical part of invertebrate}, each described by 18 attributes
(included 16 discrete and 2 numerical attributes). We
mean mode value is MMj = the mode
j=y+1, z have taken an average value from 10 trials for each
among jth attribute values of cluster c. of the test result. Likelihood is the probability of a
specified disease. Here average likelihood is the
4. Results and discussion average of all cluster likelihood. Actual likelihood is
the actual probability of the disease in the data,
The experiments were done using PC with core 2 which has been found using brute force approach.
duo processor with a clock rate of 1.8 GHz and 3GB Accuracy is the ratio between average likelihood and
of main memory. The operating system was actual likelihood.
K-Means K-Mode
K-Means with BK K-Mode with BK
K-Means-Mode K-Means-Mode with BK
1
Accuracy
0.5
0
64 47 33
Number of boolean functions
Figure 3. Accuracy of test result for the patient dataset to find likelihood of diabetic
For the Patient Dataset to find likelihood of without background knowledge achieves an average
Diabetic, Figure 3 presents accuracy results for K- accuracy of 17.7%. Both K-mode without
Means, K-Mode, K-Means-Mode, K-Means with background knowledge and K-mode with
background knowledge (BK), K-Mode with BK and background knowledge perform much worse,
K-Means-Mode with BK algorithms over the number averaging 12.1% and 30.2 accuracy respectively. The
of boolean functions. The number of boolean proposed method gives better results about 39-40%
functions for each presented result is also indicated. over k-means with background knowledge as
It shows that an average accuracy of 95.1% is illustrated in Figures 1 and about 64-65% over k-
achieved using the medical background information mode with background knowledge as illustrated in
and hybrid clustering algorithm. K-means algorithm Figures 1. The proposed method also gives much
with background knowledge (BK) achieves an better accuracy when compared to the k-means and
average accuracy of 56%. K-means algorithm K-Mode with about 77-78% over k-means and about
47
5. 82-83% over k-mode. It shows that an average 28%. What this demonstrates is that neither the
accuracy of 30.2-56% can be achieved by K-Means medical background information nor hybrid-
or K-Mode using the background information alone. clustering algorithm alone performs very well, but
K-Means-Mode algorithm without background combining the two effectively produces excellent
knowledge achieves an average accuracy of about results.
K-Means K-Mode
K-Means with BK K-Mode with BK
K-Means-Mode K-Means-Mode with BK
1
Accuracy
0.5
0
87 63 12
Number of boolean functions
Figure 4. Accuracy of test result for the zoo data set
For the Zoo Data Set [15], Figure 4 shows combining the two effectively produces excellent
accuracy results for K-Means, K-Mode K-Means- results.
Mode, K-Means with background knowledge (BK),
K-Mode with BK and K-Means-Mode with BK 6. References
algorithms over the number of boolean functions.
The number of boolean functions for each presented [1] P. B. Torben and J. S. Christian, "Research Issues in
result is also indicated. It also demonstrates that Clinical Data Warehousing," in Proceedings of the
neither the medical background information nor 10th International Conference on Scientific and
hybrid-clustering algorithm alone performs very Statistical Database Management , Capri, 1998, p.
well, but combining the two effectively produces 43 52.
excellent results. [2] J. B. Macqueen, "Some methods of classification and
analysis of multivariate observations," in Proceedings
5. Conclusion of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, Berkelely, CA,
1967, p. 281 297.
Clustering medical data is important as the results
of such analysis can be used for improving patient [3] Z. Huang, "Extensions to the k-Means Algorithm for
Clustering Large Data Sets with Categorical Values,"
care and treatment. We have proposed a clustering
Data Mining and Knowledge Discovery, vol. 2, no. 2,
method for medical data to predict the likelihood of pp. 283 - 304, September 1998.
diseases by combining k-means and k-mode
[4] O. M. San, V. N. Huynh, and Y. Nakamori, "An
algorithm and incorporating medical background Alternative Extension of the k-Means Algorithm for
knowledge. It clusters both numerical and categorical Clustering Categorical Data," JAMCS, vol. 14, no. 2,
data efficiently and allows user to specify constraint pp. 241-247, 2004.
on what attributes will participate in clustering [5] H. C. Hongch and D. Y. Yeung, "Locally linear
process and what attributes will be selected as data metric adaptation for semi-supervised clustering," in
point. The method has also been applied to a real Proceedings of the twenty-first international
world medical data set and the Zoo Data Set from conference on Machine learning, Banff, Alberta,
UCI Machine Learning Repository. We have shown Canada, 2004, pp. 153--160.
significant improvements in accuracy. We have the [6] H. C. Hongch and D. Y. Yeung, "Locally linear
following conclusions from the work: neither the metric adaptation with application to semi-supervised
medical background information nor hybrid- clustering and image retrieval," Pattern Recognition,
clustering algorithm alone performs very well, but vol. 39, no. 7, pp. 1253-1264, July 2006.
48
6. [7] K. Shin and A. Abraham, "Two Phase Semi- Computer Science and Network Security, vol. 9, no.
supervised Clustering Using Background 2, pp. 228-235, February 2009.
Knowledge," Lecture Notes in Computer Science, [13] S. B. Patil and Y.S. Kumaraswamy, "Intelligent and
vol. 4224, pp. 707-712, September 2006. Effective Heart Attack Prediction System Using Data
[8] M. S. Baghshaha and S. B. Shourakib, "Kernel-based Mining and Artificial Neural Network," European
metric learning for semi-supervised clustering," Journal of Scientific Research, vol. 31, no. 4, pp. 642-
Neurocomputing, December 2009. 656, 2009.
[9] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, [14] M. Sacha. (2008) Clustering of an aperiodical medical
"Constrained K-means Clustering with Background data.
Knowledge," in Proceedings of the Eighteenth http://www.mareksacha.com/blog/clustering-of-an-
International Conference on Machine Learning, 2001, aperiodical-medical-data.
pp. 577 - 584. [15] Zoo Data Set. (n.d.). Retrieved 03 01, 2010, from
[10] G. Y. Hang, D. Zhang, J. Ren, and C. Hu, "A Machine Learning Repository:
Hierarchical Clustering Algorithm Based on K-Means http://archive.ics.uci.edu/ml/support/Zoo
with Constraints," in Fourth International Conference
on Innovative Computing, Information and Control,
Kaohsiung, Taiwan, 2009, pp. 1479-1482.
[11] K. Li, Z. Cao, L. Cao, and R. Zhao, "A novel semi-
supervised fuzzy C-means clustering method," in
Proceedings of the 21st annual international
conference on Chinese control and decision
conference, Guilin, China, 2009, pp. 3804-3808.
[12] S. B. Patil and Y. S. Kumaraswamy, "Extraction of
Significant Patterns from Heart Disease Warehouses
for Heart Attack Prediction," International Journal of
49