SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Clustering Medical Data to Predict the Likelihood of Diseases

                        Razan Paul, Abu Sayed Md. Latiful Hoque
                    Department of Computer Science and Engineering,
       Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
                  razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd


                      Abstract                                   numerical attributes. In [3-4], the authors extend k-
                                                                 means algorithm to partition large data sets with
    Several studies show that background knowledge               categorical objects. K-means [2] and K-modes [3-4]
of a domain can improve the results of clustering                clustering algorithms are recognized techniques to
algorithms. In this paper, we illustrate how to use              partition large data sets based on numerical attributes
the background knowledge of medical domain in                    and categorical attributes respectively. To find
clustering process to predict the likelihood of                  likelihood of disease we need a clustering algorithm,
diseases. To find the likelihood of diseases,                    which can partition objects consisting of both
clustering has to be done based on anticipated                   numerical and categorical attributes and can set
likelihood attributes with core attributes of disease            constraint on presence or absence of items in
in data point. To find the likelihood of diseases, we            clustering process and on datapoint.
have proposed constraint k-Means-Mode clustering                    A number of work [5-11] has proposed different
algorithm. Attributes of Medical data are both                   technique to address a variant of the conventional
continuous and categorical. The developed                        clustering problem. These works include clustering
algorithm can handle both continuous and discrete                in the presence of information about the problem
data and perform clustering based on anticipated                 domain or some background knowledge. Here our
likelihood attributes with core attributes of disease            proposed algorithm performs clustering in the
in data point. We have demonstrated its effectiveness            presence of information about the medical domain to
by testing it for a real world patient data set.                 predict the likelihood of diseases. However, the
                                                                 technique to use medical background knowledge in
1. Introduction                                                  our proposed algorithm is different from the
                                                                 techniques [5-11].
   Clustering is an attractive approach for finding                 For Heart Attack Prediction, in [12-14] authors
similarities in data and putting similar data into               have performed clustering on the preprocessed data
groups. Due to high dimensionality of medical data               warehouse using K-means clustering algorithm. The
[1], if clustering is done based on all the attributes of        data for Heart Attack Prediction are a mixture of
medical domain, resultant clusters will not be useful            continuous and discrete data. However, K-means
because they are medically irrelevant, contain                   cannot cluster categorical attributes. Therefore, the
redundant information. Moreover, this property                   approaches [12-13]will not work to predict Heart
makes likelihood analysis hard and the partitioning              Attack. In [14], the author performs clustering
process slow. To find the likelihood of a disease                aperiodical medical data, which are both continuous
clustering has to be done based on anticipated                   and discrete, using K-means clustering algorithm.
likelihood attributes with core attributes of disease in
data point. For example, clustering a large number of            2. Mapping complex medical data to
patients with selecting age, weight, sex, smoke,                 mineable items
HbA1c% as data point and allowing only age,
weight, sex, smoke in clustering process, we can find               For knowledge discovery, the medical data have
clusters partitioned by age, weight, sex, smoke. This            to be transformed into a suitable transaction format
way we get clusters that have similar age, weight,               to discover knowledge. We have addressed the
sex, smoke value. Then analyzing each cluster based              problem of mapping complex medical data to items
on HbA1c% can give likelihood information of                     using domain dictionary and rule base as shown in
diabetes.                                                        figure 1. The medical data are types of categorical,
   Attributes of Medical data are both continuous                continuous numerical data, boolean, interval,
and categorical. K-means clustering [2] is widely                percentage, fraction and ratio. Medical domain
used technique to partition large data sets with                 experts have the knowledge of how to map ranges of




978-1-4244-7571-1/10/$26.00 ©2010 IEEE                      44
numerical data for each attribute to a series of items.          cardinality of attributes except continuous numeric
For example, there are certain conventions to                    data are not high in medical domain, these attribute
consider a person is young, adult, or elder with                 values are mapped to integer values using medical
respect to age. A set of rules is created for each               domain dictionaries. Therefore, the mapping process
continuous numerical attribute using the knowledge               is divided in two phases. Phase 1: a rule base is
of medical domain experts. A rule engine is used to              constructed based on the knowledge of medical
map continuous numerical data to items using these               domain experts and dictionaries are constructed for
developed rules.                                                 attributes where domain expert knowledge is not
                                                                 applicable, Phase 2: attribute values are mapped to
   We have used domain dictionary approach to                    integer values using the corresponding rule base and
transform the data, for which medical domain expert              the dictionaries.
knowledge is not applicable, to numerical form. As
                                                            Original       Mapped            Original    Mapped
                           Generate dictionary for          value          value             value       value
                           each categorical attribute       Headache        1                Yes          1
                                                            Fever          2                 No          2
      PatientActual Data
               Age Smoke         Diagnosis                      Dictionary of                       Dictionary of
       ID                                                       Diagnosis attribute                 Smoke attribute
      1020D 33        Yes        Headache
      1021D 63        No         Fever                                                Map to integer items using
                                                                                      rule base and dictionaries
             Actual data

                                  If age <= 12 then 1
         Medical                  If 13<=age<=60 then 2
         domain                   If 60 <=age then 3                        Patient      Age    Smoke     Diagnosis
         knowledge                If smoke = y then 1                        ID
                                  If smoke = n then 2                       1020D        2      1         1
                                  If Sex = M then 1
                                                                            1021D        3      2         2
                                  If Sex = F then 2

                                         Rule Base                           Data suitable for Knowledge Discovery


                                  Figure 1. Data transformation of medical data
                                                                 3.1. Updating cluster center

                                                                     We need to update the k clusters centre
3. The proposed algorithm                                        dynamically in order to minimize the intra cluster
                                                                 distance of patients. Here k is the number of clusters
    Figure 2 shows the proposed hybrid-partitioning
                                                                 we would like to make and Pi is the ith patient
algorithm, which can handle both continuous and
                                                                 attribute and Ci is the ith mean-mode value of cluster
discrete data and perform clustering based on
                                                                 C. As the patient attributes are both continuous and
anticipated likelihood attributes with core attributes
                                                                 discrete, each cluster center is an array of both
of disease in data point. In this algorithm, the user
                                                                 average and mode values where average and mode
will set which attributes will be used as data point for
                                                                 are computed for continuous and discrete attributes
a patient and which attributes will participate in
                                                                 respectively. Mean is computed for each continuous
clustering process. The goal of this algorithm is
                                                                 attribute by calculating average of that attribute
making clusters to find likelihood. Healthcare data
                                                                 among       the data points in that cluster. Mode is
are sparse as doctors perform only few different
                                                                  computed for each discrete attribute by calculating
clinical lab tests for a patient over his lifetime. This
                                                                 maximum frequent value of that attribute among the
is natural many patients have not all anticipated
                                                                 data points in that cluster.
attributes for likelihood. When a patient does not
have one or more anticipated attributes for
likelihood, keeping this patient in clustering process           3.2. Dissimilarity measure
will make clusters useless to find likelihood.
Therefore, we are ignoring that patient in the                      The object dissimilarity measure is derived from
clustering process.                                              both numeric and categorical attributes. For discrete
                                                                 features, the dissimilarity measure between two data
                                                                 point depends on the number of different values in




                                                           45
Algorithm: Partition patients to find likelihood of                        1.2.1 If A is continuous attribute
  disease based on MeanMode value of patients.                                      1.2.1.1 MeanModec [i] = Find the mean
  1. Read the metadata about which attributes will only                    among the attribute named A values of data points
  appear in clustering process.                                            in cluster c.
   2. Partition patient data into k cluster in random and assign              1.2.2 else If A is category attribute
  each partition to each cluster. To retrieve paient data use                       1.2.2.1 MeanModec [i] = Find the mode
  the corresponding RetrieveAllPatientsRecord() for each                   among the attribute named A values of data points
  data model.                                                              in cluster c.
  3. Repeat                                                                    1.2.3 i++;
             3.1 Call UpdateMeanModeofClusters(K, M ) to
                 update Mean-Mode value of k clusters                  Procedure Distance (P: Patient, C: Cluster, m: Number
             3.2 Move patient Pi to the cluster with least             of attributes)
             distance and find the distance between a patient          //Here Pi represent the ith attribute value of Patient P and Ci
             and a     cluster using the function Distance (P, C,      represents ith MeanMode value of Cluster C
             m);
  Until no patient is moved                                                1. for i = 1 to m where ith attribute value of Patient
                                                                              can appear in clustering
  Procedure UpdateMeanModeofClusters(K: Number of                             1.1 If Pi is continuous
  clusters, M: Medical attributes)                                                    1.1.1 Then D1 = D1+ (Pi - Ci) 2
  1. For each cluster c K                                                     1.2 Else (categorical)
                                                                                 1.2.1 Then D2 = D2 + NumberofOnes (Pi ^ Ci);
    1.1 i = 0                                                                  1.3 d = SQRT (D1) + D2;
    1.2 For each attribute A    M where A can appear in                    2. return d;
        clustering


                             Figure 2. Constraint k-Means-Mode clustering algorithm

each categorical feature. For continuous features, the                 Distance between                      based on continuous
dissimilarity measure between two data point
depends on Euclidean distance. Here we have used                       attributes is         =1 (         )2 where ,            and n
the following two functions to measure dissimilarity:                  is the number of patients. Distance is measured using
hamming                distance function        for categorical        Hamming distance function for categorical attributes.
objects and Euclidean              distance    function     for        Distance between                based on categorical
continuous data. To measure distance between two
                                                                       attributes is     =   ( , ) where             ,     =
objects based on several features, for each feature we
                                                                          0      ==
test whether this feature is discrete or continuous. If
the feature is continuous, distance is measured using                     1
Euclidean distance and added it to D1 and if the
feature is discrete, the dissimilarity is measured                     3.3. Likelihood
using hamming distance and added it to D2. The
resultant distance is computed by adding square root                      Likelihood is the probability of a specified
of D1 with D2. The computational complexity of the                     outcome. After clustering using constrained K-
algorithm is O ((I+1) k p), where p is the number of                   Means-Mode algorithm we get a set of clusters,
patients, k the number of clusters and I is the number                 C = {c1 , c2 , c3 ,      ck }. Each cluster contains a set
of iterations.                                                         of data points, which consist of anticipated
     Let the anticipated likelihood attributes be            =         likelihood attributes and core attributes of disease.
{ 1 , 2 , 3 , . . . }. Let the core attributes of disease,             Data       points         for      cluster     cj            is
      = { 1, 2, 3,                    }. In the clustering             Dj = {dj1 , dj2 , dj3 , . . dju }. There are a set of
process, only anticipated likelihood attributes                        boolean functions on core attributes of disease to
participate. The anticipated likelihood attributes                     determine whether a data point has the presence of a
consist of both continuous and categorical attribute.                  disease or not. Let the set of boolean functions be
Let first          attributes of      are continuous and the           F = {f1 , f2 , f3 ,    fv }. A data point dt has presence
remaining                 attributes are categorical. Let the          of the disease if v fi (dt ) == true for the data
                                                                                                 i=1
anticipated likelihood attributes of two data points                   point. In a cluster, the number of data points which
are                  . Dissimilarity between the anticipated           has presence of the disease is u             v
                                                                                                                j=1 i=1 fi (dj ). The
likelihood attributes of two data points is the sum of                 number of total data points in the cluster is u u.       j=1
dissimilarity of continuous attribute and dissimilarity                So likelihood of a cluster for the disease is
of categorical attribute. Distance is measured using
Euclidian distance function for continuous attributes.




                                                                  46
u      v
 j=1    i=1 f i d j                                                     Microsoft Vista and implementation language was
       u u            where fi is the function, which returns
       j=1                                                              c#. We used 2 datasets to verify our method. The
either one or zero.                                                     first data set of interest is patient dataset collected
     Here each cluster is represented by the mean                       and preprocessed from Bangladeshi hospitals, which
mode value of that cluster. Now we will find the                        has 50273 instances with 514 attributes (included
equation of mean mode value of a cluster c. Mean is                     150 discrete and 364 numerical attributes). The
calculated among the continuous attributes and mode                     Patient Dataset was clustered in 5 classes (Very High
is calculated among the categorical attributes. Let the                 Risk, High Risk, Medium Risk, Low Risk, No Risk)
mean mode value of a cluster be MM =                                    using proposed algorithm to find likelihood of
 mm1 , mm2 , mm3 ,      mmz where z is the number                       Diabetic. The next data set of interest is the Zoo Data
of attributes in the clustering process. Let first y                    Set [15] from UCI Machine Learning Repository,
attributes of MM are continuous and remaining                           which has the similar characteristics like medical
z y are categorical. The continuous part of mean                        data. It contains 101 instances with 7 classes
mode value is MMi i=1, y = the mean among ith                           {mammal, bird, reptile, fish, amphibian, insect, and
 attribute values of cluster c. The categorical part of                 invertebrate}, each described by 18 attributes
                                                                        (included 16 discrete and 2 numerical attributes). We
mean mode value is MMj              = the mode
                                  j=y+1, z                              have taken an average value from 10 trials for each
among jth attribute values of cluster c.                                of the test result. Likelihood is the probability of a
                                                                        specified disease. Here average likelihood is the
4. Results and discussion                                               average of all cluster likelihood. Actual likelihood is
                                                                        the actual probability of the disease in the data,
  The experiments were done using PC with core 2                        which has been found using brute force approach.
duo processor with a clock rate of 1.8 GHz and 3GB                      Accuracy is the ratio between average likelihood and
of main memory. The operating system was                                actual likelihood.



                                                K-Means                     K-Mode
                                                K-Means with BK             K-Mode with BK
                                                K-Means-Mode                K-Means-Mode with BK
                                                   1
                                     Accuracy




                                                 0.5




                                                   0
                                                           64             47              33
                                                             Number of boolean functions

          Figure 3. Accuracy of test result for the patient dataset to find likelihood of diabetic


   For the Patient Dataset to find likelihood of                        without background knowledge achieves an average
Diabetic, Figure 3 presents accuracy results for K-                     accuracy of 17.7%. Both K-mode without
Means, K-Mode, K-Means-Mode, K-Means with                               background knowledge and K-mode with
background knowledge (BK), K-Mode with BK and                           background knowledge perform much worse,
K-Means-Mode with BK algorithms over the number                         averaging 12.1% and 30.2 accuracy respectively. The
of boolean functions. The number of boolean                             proposed method gives better results about 39-40%
functions for each presented result is also indicated.                  over k-means with background knowledge as
It shows that an average accuracy of 95.1% is                           illustrated in Figures 1 and about 64-65% over k-
achieved using the medical background information                       mode with background knowledge as illustrated in
and hybrid clustering algorithm. K-means algorithm                      Figures 1. The proposed method also gives much
with background knowledge (BK) achieves an                              better accuracy when compared to the k-means and
average accuracy of 56%. K-means algorithm                              K-Mode with about 77-78% over k-means and about




                                                                   47
82-83% over k-mode. It shows that an average                  28%. What this demonstrates is that neither the
accuracy of 30.2-56% can be achieved by K-Means               medical background information nor hybrid-
or K-Mode using the background information alone.             clustering algorithm alone performs very well, but
K-Means-Mode algorithm without background                     combining the two effectively produces excellent
knowledge achieves an average accuracy of about               results.



                                       K-Means                       K-Mode
                                       K-Means with BK               K-Mode with BK
                                       K-Means-Mode                  K-Means-Mode with BK
                                        1
                            Accuracy




                                       0.5




                                        0
                                                87              63                12
                                                   Number of boolean functions

                         Figure 4. Accuracy of test result for the zoo data set

   For the Zoo Data Set [15], Figure 4 shows                  combining the two effectively produces excellent
accuracy results for K-Means, K-Mode K-Means-                 results.
Mode, K-Means with background knowledge (BK),
K-Mode with BK and K-Means-Mode with BK                       6. References
algorithms over the number of boolean functions.
The number of boolean functions for each presented            [1] P. B. Torben and J. S. Christian, "Research Issues in
result is also indicated. It also demonstrates that               Clinical Data Warehousing," in Proceedings of the
neither the medical background information nor                    10th International Conference on Scientific and
hybrid-clustering algorithm alone performs very                   Statistical Database Management , Capri, 1998, p.
well, but combining the two effectively produces                  43 52.
excellent results.                                            [2] J. B. Macqueen, "Some methods of classification and
                                                                  analysis of multivariate observations," in Proceedings
5. Conclusion                                                     of the Fifth Berkeley Symposium on Mathematical
                                                                  Statistics and Probability, vol. 1, Berkelely, CA,
                                                                  1967, p. 281 297.
   Clustering medical data is important as the results
of such analysis can be used for improving patient            [3] Z. Huang, "Extensions to the k-Means Algorithm for
                                                                  Clustering Large Data Sets with Categorical Values,"
care and treatment. We have proposed a clustering
                                                                  Data Mining and Knowledge Discovery, vol. 2, no. 2,
method for medical data to predict the likelihood of              pp. 283 - 304, September 1998.
diseases by combining k-means and k-mode
                                                              [4] O. M. San, V. N. Huynh, and Y. Nakamori, "An
algorithm and incorporating medical background                    Alternative Extension of the k-Means Algorithm for
knowledge. It clusters both numerical and categorical             Clustering Categorical Data," JAMCS, vol. 14, no. 2,
data efficiently and allows user to specify constraint            pp. 241-247, 2004.
on what attributes will participate in clustering             [5] H. C. Hongch and D. Y. Yeung, "Locally linear
process and what attributes will be selected as data              metric adaptation for semi-supervised clustering," in
point. The method has also been applied to a real                 Proceedings of the twenty-first international
world medical data set and the Zoo Data Set from                  conference on Machine learning, Banff, Alberta,
UCI Machine Learning Repository. We have shown                    Canada, 2004, pp. 153--160.
significant improvements in accuracy. We have the             [6] H. C. Hongch and D. Y. Yeung, "Locally linear
following conclusions from the work: neither the                  metric adaptation with application to semi-supervised
medical background information nor hybrid-                        clustering and image retrieval," Pattern Recognition,
clustering algorithm alone performs very well, but                vol. 39, no. 7, pp. 1253-1264, July 2006.




                                                         48
[7] K. Shin and A. Abraham, "Two Phase Semi-                            Computer Science and Network Security, vol. 9, no.
     supervised      Clustering     Using      Background               2, pp. 228-235, February 2009.
     Knowledge," Lecture Notes in Computer Science,                [13] S. B. Patil and Y.S. Kumaraswamy, "Intelligent and
     vol. 4224, pp. 707-712, September 2006.                            Effective Heart Attack Prediction System Using Data
[8] M. S. Baghshaha and S. B. Shourakib, "Kernel-based                  Mining and Artificial Neural Network," European
     metric learning for semi-supervised clustering,"                   Journal of Scientific Research, vol. 31, no. 4, pp. 642-
     Neurocomputing, December 2009.                                     656, 2009.
[9] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl,             [14] M. Sacha. (2008) Clustering of an aperiodical medical
     "Constrained K-means Clustering with Background                    data.
     Knowledge," in Proceedings of the Eighteenth                        http://www.mareksacha.com/blog/clustering-of-an-
     International Conference on Machine Learning, 2001,                aperiodical-medical-data.
     pp. 577 - 584.                                                [15] Zoo Data Set. (n.d.). Retrieved 03 01, 2010, from
[10] G. Y. Hang, D. Zhang, J. Ren, and C. Hu, "A                        Machine Learning Repository:
     Hierarchical Clustering Algorithm Based on K-Means                 http://archive.ics.uci.edu/ml/support/Zoo
     with Constraints," in Fourth International Conference
     on Innovative Computing, Information and Control,
     Kaohsiung, Taiwan, 2009, pp. 1479-1482.
[11] K. Li, Z. Cao, L. Cao, and R. Zhao, "A novel semi-
     supervised fuzzy C-means clustering method," in
     Proceedings of the 21st annual international
     conference on Chinese control and decision
     conference, Guilin, China, 2009, pp. 3804-3808.
[12] S. B. Patil and Y. S. Kumaraswamy, "Extraction of
     Significant Patterns from Heart Disease Warehouses
     for Heart Attack Prediction," International Journal of




                                                              49

Mais conteúdo relacionado

Mais procurados

Heart Disease Prediction Using Associative Relational Classification Techniq...
Heart Disease Prediction Using Associative Relational  Classification Techniq...Heart Disease Prediction Using Associative Relational  Classification Techniq...
Heart Disease Prediction Using Associative Relational Classification Techniq...IJMER
 
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A SurveyPrediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A Surveyrahulmonikasharma
 
A Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic RegressionA Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic Regressionijtsrd
 
Detection of heart diseases by data mining
Detection of heart diseases by data miningDetection of heart diseases by data mining
Detection of heart diseases by data miningAbheepsa Pattnaik
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseA Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseijtsrd
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.SUJIT SHIBAPRASAD MAITY
 
Heart Disease Prediction Using Data Mining Techniques
Heart Disease Prediction Using Data Mining TechniquesHeart Disease Prediction Using Data Mining Techniques
Heart Disease Prediction Using Data Mining TechniquesIJRES Journal
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 
A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeIOSR Journals
 
Smart health disease prediction python django
Smart health disease prediction python djangoSmart health disease prediction python django
Smart health disease prediction python djangoShaikSalman28
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMamiteshg
 
Heart disease prediction
Heart disease predictionHeart disease prediction
Heart disease predictionAriful Haque
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesJagdeep Singh Malhi
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
A data mining approach for prediction of heart disease using neural networks
A data mining approach for prediction of heart disease using neural networksA data mining approach for prediction of heart disease using neural networks
A data mining approach for prediction of heart disease using neural networksIAEME Publication
 
Data mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisData mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisSteve Iduye
 
Machine learning in disease diagnosis
Machine learning in disease diagnosisMachine learning in disease diagnosis
Machine learning in disease diagnosisSushrutaMishra1
 

Mais procurados (20)

Heart Disease Prediction Using Associative Relational Classification Techniq...
Heart Disease Prediction Using Associative Relational  Classification Techniq...Heart Disease Prediction Using Associative Relational  Classification Techniq...
Heart Disease Prediction Using Associative Relational Classification Techniq...
 
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A SurveyPrediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
 
A Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic RegressionA Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic Regression
 
Detection of heart diseases by data mining
Detection of heart diseases by data miningDetection of heart diseases by data mining
Detection of heart diseases by data mining
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseA Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease prediction
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
Heart Disease Prediction Using Data Mining Techniques
Heart Disease Prediction Using Data Mining TechniquesHeart Disease Prediction Using Data Mining Techniques
Heart Disease Prediction Using Data Mining Techniques
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision Tree
 
Smart health disease prediction python django
Smart health disease prediction python djangoSmart health disease prediction python django
Smart health disease prediction python django
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease prediction
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
 
Heart disease prediction
Heart disease predictionHeart disease prediction
Heart disease prediction
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart Diseases
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
A data mining approach for prediction of heart disease using neural networks
A data mining approach for prediction of heart disease using neural networksA data mining approach for prediction of heart disease using neural networks
A data mining approach for prediction of heart disease using neural networks
 
Data mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisData mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosis
 
Machine learning in disease diagnosis
Machine learning in disease diagnosisMachine learning in disease diagnosis
Machine learning in disease diagnosis
 

Semelhante a Clustering Medical Data to Predict the Likelihood of Diseases

Finding Symmetric Association Rules to Support Medical Qualitative Research
Finding Symmetric Association Rules to Support Medical Qualitative ResearchFinding Symmetric Association Rules to Support Medical Qualitative Research
Finding Symmetric Association Rules to Support Medical Qualitative Researchrazanpaul
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Surveyijtsrd
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkdbpublications
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
Android Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction SystemAndroid Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction Systemijtsrd
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease predictionKOYELMAJUMDAR1
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statisticsSagar Kamble
 
IRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET Journal
 
Analysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetAnalysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetIRJET Journal
 
Paper id 212014112
Paper id 212014112Paper id 212014112
Paper id 212014112IJRAT
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAIRCC Publishing Corporation
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISijcsit
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIRJET Journal
 
IRJET- Disease Prediction using Machine Learning
IRJET-  	  Disease Prediction using Machine LearningIRJET-  	  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
Prediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachPrediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachBRNSSPublicationHubI
 
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...IRJET Journal
 
H0333039042
H0333039042H0333039042
H0333039042theijes
 

Semelhante a Clustering Medical Data to Predict the Likelihood of Diseases (20)

Finding Symmetric Association Rules to Support Medical Qualitative Research
Finding Symmetric Association Rules to Support Medical Qualitative ResearchFinding Symmetric Association Rules to Support Medical Qualitative Research
Finding Symmetric Association Rules to Support Medical Qualitative Research
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on spark
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
Android Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction SystemAndroid Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction System
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease prediction
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Secondary Use of Healthcare Data for Translational Research
Secondary Use of Healthcare Data for Translational ResearchSecondary Use of Healthcare Data for Translational Research
Secondary Use of Healthcare Data for Translational Research
 
IRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN Algorithm
 
Analysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetAnalysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease Dataset
 
Paper id 212014112
Paper id 212014112Paper id 212014112
Paper id 212014112
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosis
 
IRJET- Disease Prediction using Machine Learning
IRJET-  	  Disease Prediction using Machine LearningIRJET-  	  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
Prediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachPrediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification Approach
 
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
 
H0333039042
H0333039042H0333039042
H0333039042
 

Último

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Clustering Medical Data to Predict the Likelihood of Diseases

  • 1. Clustering Medical Data to Predict the Likelihood of Diseases Razan Paul, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd Abstract numerical attributes. In [3-4], the authors extend k- means algorithm to partition large data sets with Several studies show that background knowledge categorical objects. K-means [2] and K-modes [3-4] of a domain can improve the results of clustering clustering algorithms are recognized techniques to algorithms. In this paper, we illustrate how to use partition large data sets based on numerical attributes the background knowledge of medical domain in and categorical attributes respectively. To find clustering process to predict the likelihood of likelihood of disease we need a clustering algorithm, diseases. To find the likelihood of diseases, which can partition objects consisting of both clustering has to be done based on anticipated numerical and categorical attributes and can set likelihood attributes with core attributes of disease constraint on presence or absence of items in in data point. To find the likelihood of diseases, we clustering process and on datapoint. have proposed constraint k-Means-Mode clustering A number of work [5-11] has proposed different algorithm. Attributes of Medical data are both technique to address a variant of the conventional continuous and categorical. The developed clustering problem. These works include clustering algorithm can handle both continuous and discrete in the presence of information about the problem data and perform clustering based on anticipated domain or some background knowledge. Here our likelihood attributes with core attributes of disease proposed algorithm performs clustering in the in data point. We have demonstrated its effectiveness presence of information about the medical domain to by testing it for a real world patient data set. predict the likelihood of diseases. However, the technique to use medical background knowledge in 1. Introduction our proposed algorithm is different from the techniques [5-11]. Clustering is an attractive approach for finding For Heart Attack Prediction, in [12-14] authors similarities in data and putting similar data into have performed clustering on the preprocessed data groups. Due to high dimensionality of medical data warehouse using K-means clustering algorithm. The [1], if clustering is done based on all the attributes of data for Heart Attack Prediction are a mixture of medical domain, resultant clusters will not be useful continuous and discrete data. However, K-means because they are medically irrelevant, contain cannot cluster categorical attributes. Therefore, the redundant information. Moreover, this property approaches [12-13]will not work to predict Heart makes likelihood analysis hard and the partitioning Attack. In [14], the author performs clustering process slow. To find the likelihood of a disease aperiodical medical data, which are both continuous clustering has to be done based on anticipated and discrete, using K-means clustering algorithm. likelihood attributes with core attributes of disease in data point. For example, clustering a large number of 2. Mapping complex medical data to patients with selecting age, weight, sex, smoke, mineable items HbA1c% as data point and allowing only age, weight, sex, smoke in clustering process, we can find For knowledge discovery, the medical data have clusters partitioned by age, weight, sex, smoke. This to be transformed into a suitable transaction format way we get clusters that have similar age, weight, to discover knowledge. We have addressed the sex, smoke value. Then analyzing each cluster based problem of mapping complex medical data to items on HbA1c% can give likelihood information of using domain dictionary and rule base as shown in diabetes. figure 1. The medical data are types of categorical, Attributes of Medical data are both continuous continuous numerical data, boolean, interval, and categorical. K-means clustering [2] is widely percentage, fraction and ratio. Medical domain used technique to partition large data sets with experts have the knowledge of how to map ranges of 978-1-4244-7571-1/10/$26.00 ©2010 IEEE 44
  • 2. numerical data for each attribute to a series of items. cardinality of attributes except continuous numeric For example, there are certain conventions to data are not high in medical domain, these attribute consider a person is young, adult, or elder with values are mapped to integer values using medical respect to age. A set of rules is created for each domain dictionaries. Therefore, the mapping process continuous numerical attribute using the knowledge is divided in two phases. Phase 1: a rule base is of medical domain experts. A rule engine is used to constructed based on the knowledge of medical map continuous numerical data to items using these domain experts and dictionaries are constructed for developed rules. attributes where domain expert knowledge is not applicable, Phase 2: attribute values are mapped to We have used domain dictionary approach to integer values using the corresponding rule base and transform the data, for which medical domain expert the dictionaries. knowledge is not applicable, to numerical form. As Original Mapped Original Mapped Generate dictionary for value value value value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to integer items using rule base and dictionaries Actual data If age <= 12 then 1 Medical If 13<=age<=60 then 2 domain If 60 <=age then 3 Patient Age Smoke Diagnosis knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Figure 1. Data transformation of medical data 3.1. Updating cluster center We need to update the k clusters centre 3. The proposed algorithm dynamically in order to minimize the intra cluster distance of patients. Here k is the number of clusters Figure 2 shows the proposed hybrid-partitioning we would like to make and Pi is the ith patient algorithm, which can handle both continuous and attribute and Ci is the ith mean-mode value of cluster discrete data and perform clustering based on C. As the patient attributes are both continuous and anticipated likelihood attributes with core attributes discrete, each cluster center is an array of both of disease in data point. In this algorithm, the user average and mode values where average and mode will set which attributes will be used as data point for are computed for continuous and discrete attributes a patient and which attributes will participate in respectively. Mean is computed for each continuous clustering process. The goal of this algorithm is attribute by calculating average of that attribute making clusters to find likelihood. Healthcare data among the data points in that cluster. Mode is are sparse as doctors perform only few different computed for each discrete attribute by calculating clinical lab tests for a patient over his lifetime. This maximum frequent value of that attribute among the is natural many patients have not all anticipated data points in that cluster. attributes for likelihood. When a patient does not have one or more anticipated attributes for likelihood, keeping this patient in clustering process 3.2. Dissimilarity measure will make clusters useless to find likelihood. Therefore, we are ignoring that patient in the The object dissimilarity measure is derived from clustering process. both numeric and categorical attributes. For discrete features, the dissimilarity measure between two data point depends on the number of different values in 45
  • 3. Algorithm: Partition patients to find likelihood of 1.2.1 If A is continuous attribute disease based on MeanMode value of patients. 1.2.1.1 MeanModec [i] = Find the mean 1. Read the metadata about which attributes will only among the attribute named A values of data points appear in clustering process. in cluster c. 2. Partition patient data into k cluster in random and assign 1.2.2 else If A is category attribute each partition to each cluster. To retrieve paient data use 1.2.2.1 MeanModec [i] = Find the mode the corresponding RetrieveAllPatientsRecord() for each among the attribute named A values of data points data model. in cluster c. 3. Repeat 1.2.3 i++; 3.1 Call UpdateMeanModeofClusters(K, M ) to update Mean-Mode value of k clusters Procedure Distance (P: Patient, C: Cluster, m: Number 3.2 Move patient Pi to the cluster with least of attributes) distance and find the distance between a patient //Here Pi represent the ith attribute value of Patient P and Ci and a cluster using the function Distance (P, C, represents ith MeanMode value of Cluster C m); Until no patient is moved 1. for i = 1 to m where ith attribute value of Patient can appear in clustering Procedure UpdateMeanModeofClusters(K: Number of 1.1 If Pi is continuous clusters, M: Medical attributes) 1.1.1 Then D1 = D1+ (Pi - Ci) 2 1. For each cluster c K 1.2 Else (categorical) 1.2.1 Then D2 = D2 + NumberofOnes (Pi ^ Ci); 1.1 i = 0 1.3 d = SQRT (D1) + D2; 1.2 For each attribute A M where A can appear in 2. return d; clustering Figure 2. Constraint k-Means-Mode clustering algorithm each categorical feature. For continuous features, the Distance between based on continuous dissimilarity measure between two data point depends on Euclidean distance. Here we have used attributes is =1 ( )2 where , and n the following two functions to measure dissimilarity: is the number of patients. Distance is measured using hamming distance function for categorical Hamming distance function for categorical attributes. objects and Euclidean distance function for Distance between based on categorical continuous data. To measure distance between two attributes is = ( , ) where , = objects based on several features, for each feature we 0 == test whether this feature is discrete or continuous. If the feature is continuous, distance is measured using 1 Euclidean distance and added it to D1 and if the feature is discrete, the dissimilarity is measured 3.3. Likelihood using hamming distance and added it to D2. The resultant distance is computed by adding square root Likelihood is the probability of a specified of D1 with D2. The computational complexity of the outcome. After clustering using constrained K- algorithm is O ((I+1) k p), where p is the number of Means-Mode algorithm we get a set of clusters, patients, k the number of clusters and I is the number C = {c1 , c2 , c3 , ck }. Each cluster contains a set of iterations. of data points, which consist of anticipated Let the anticipated likelihood attributes be = likelihood attributes and core attributes of disease. { 1 , 2 , 3 , . . . }. Let the core attributes of disease, Data points for cluster cj is = { 1, 2, 3, }. In the clustering Dj = {dj1 , dj2 , dj3 , . . dju }. There are a set of process, only anticipated likelihood attributes boolean functions on core attributes of disease to participate. The anticipated likelihood attributes determine whether a data point has the presence of a consist of both continuous and categorical attribute. disease or not. Let the set of boolean functions be Let first attributes of are continuous and the F = {f1 , f2 , f3 , fv }. A data point dt has presence remaining attributes are categorical. Let the of the disease if v fi (dt ) == true for the data i=1 anticipated likelihood attributes of two data points point. In a cluster, the number of data points which are . Dissimilarity between the anticipated has presence of the disease is u v j=1 i=1 fi (dj ). The likelihood attributes of two data points is the sum of number of total data points in the cluster is u u. j=1 dissimilarity of continuous attribute and dissimilarity So likelihood of a cluster for the disease is of categorical attribute. Distance is measured using Euclidian distance function for continuous attributes. 46
  • 4. u v j=1 i=1 f i d j Microsoft Vista and implementation language was u u where fi is the function, which returns j=1 c#. We used 2 datasets to verify our method. The either one or zero. first data set of interest is patient dataset collected Here each cluster is represented by the mean and preprocessed from Bangladeshi hospitals, which mode value of that cluster. Now we will find the has 50273 instances with 514 attributes (included equation of mean mode value of a cluster c. Mean is 150 discrete and 364 numerical attributes). The calculated among the continuous attributes and mode Patient Dataset was clustered in 5 classes (Very High is calculated among the categorical attributes. Let the Risk, High Risk, Medium Risk, Low Risk, No Risk) mean mode value of a cluster be MM = using proposed algorithm to find likelihood of mm1 , mm2 , mm3 , mmz where z is the number Diabetic. The next data set of interest is the Zoo Data of attributes in the clustering process. Let first y Set [15] from UCI Machine Learning Repository, attributes of MM are continuous and remaining which has the similar characteristics like medical z y are categorical. The continuous part of mean data. It contains 101 instances with 7 classes mode value is MMi i=1, y = the mean among ith {mammal, bird, reptile, fish, amphibian, insect, and attribute values of cluster c. The categorical part of invertebrate}, each described by 18 attributes (included 16 discrete and 2 numerical attributes). We mean mode value is MMj = the mode j=y+1, z have taken an average value from 10 trials for each among jth attribute values of cluster c. of the test result. Likelihood is the probability of a specified disease. Here average likelihood is the 4. Results and discussion average of all cluster likelihood. Actual likelihood is the actual probability of the disease in the data, The experiments were done using PC with core 2 which has been found using brute force approach. duo processor with a clock rate of 1.8 GHz and 3GB Accuracy is the ratio between average likelihood and of main memory. The operating system was actual likelihood. K-Means K-Mode K-Means with BK K-Mode with BK K-Means-Mode K-Means-Mode with BK 1 Accuracy 0.5 0 64 47 33 Number of boolean functions Figure 3. Accuracy of test result for the patient dataset to find likelihood of diabetic For the Patient Dataset to find likelihood of without background knowledge achieves an average Diabetic, Figure 3 presents accuracy results for K- accuracy of 17.7%. Both K-mode without Means, K-Mode, K-Means-Mode, K-Means with background knowledge and K-mode with background knowledge (BK), K-Mode with BK and background knowledge perform much worse, K-Means-Mode with BK algorithms over the number averaging 12.1% and 30.2 accuracy respectively. The of boolean functions. The number of boolean proposed method gives better results about 39-40% functions for each presented result is also indicated. over k-means with background knowledge as It shows that an average accuracy of 95.1% is illustrated in Figures 1 and about 64-65% over k- achieved using the medical background information mode with background knowledge as illustrated in and hybrid clustering algorithm. K-means algorithm Figures 1. The proposed method also gives much with background knowledge (BK) achieves an better accuracy when compared to the k-means and average accuracy of 56%. K-means algorithm K-Mode with about 77-78% over k-means and about 47
  • 5. 82-83% over k-mode. It shows that an average 28%. What this demonstrates is that neither the accuracy of 30.2-56% can be achieved by K-Means medical background information nor hybrid- or K-Mode using the background information alone. clustering algorithm alone performs very well, but K-Means-Mode algorithm without background combining the two effectively produces excellent knowledge achieves an average accuracy of about results. K-Means K-Mode K-Means with BK K-Mode with BK K-Means-Mode K-Means-Mode with BK 1 Accuracy 0.5 0 87 63 12 Number of boolean functions Figure 4. Accuracy of test result for the zoo data set For the Zoo Data Set [15], Figure 4 shows combining the two effectively produces excellent accuracy results for K-Means, K-Mode K-Means- results. Mode, K-Means with background knowledge (BK), K-Mode with BK and K-Means-Mode with BK 6. References algorithms over the number of boolean functions. The number of boolean functions for each presented [1] P. B. Torben and J. S. Christian, "Research Issues in result is also indicated. It also demonstrates that Clinical Data Warehousing," in Proceedings of the neither the medical background information nor 10th International Conference on Scientific and hybrid-clustering algorithm alone performs very Statistical Database Management , Capri, 1998, p. well, but combining the two effectively produces 43 52. excellent results. [2] J. B. Macqueen, "Some methods of classification and analysis of multivariate observations," in Proceedings 5. Conclusion of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, Berkelely, CA, 1967, p. 281 297. Clustering medical data is important as the results of such analysis can be used for improving patient [3] Z. Huang, "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," care and treatment. We have proposed a clustering Data Mining and Knowledge Discovery, vol. 2, no. 2, method for medical data to predict the likelihood of pp. 283 - 304, September 1998. diseases by combining k-means and k-mode [4] O. M. San, V. N. Huynh, and Y. Nakamori, "An algorithm and incorporating medical background Alternative Extension of the k-Means Algorithm for knowledge. It clusters both numerical and categorical Clustering Categorical Data," JAMCS, vol. 14, no. 2, data efficiently and allows user to specify constraint pp. 241-247, 2004. on what attributes will participate in clustering [5] H. C. Hongch and D. Y. Yeung, "Locally linear process and what attributes will be selected as data metric adaptation for semi-supervised clustering," in point. The method has also been applied to a real Proceedings of the twenty-first international world medical data set and the Zoo Data Set from conference on Machine learning, Banff, Alberta, UCI Machine Learning Repository. We have shown Canada, 2004, pp. 153--160. significant improvements in accuracy. We have the [6] H. C. Hongch and D. Y. Yeung, "Locally linear following conclusions from the work: neither the metric adaptation with application to semi-supervised medical background information nor hybrid- clustering and image retrieval," Pattern Recognition, clustering algorithm alone performs very well, but vol. 39, no. 7, pp. 1253-1264, July 2006. 48
  • 6. [7] K. Shin and A. Abraham, "Two Phase Semi- Computer Science and Network Security, vol. 9, no. supervised Clustering Using Background 2, pp. 228-235, February 2009. Knowledge," Lecture Notes in Computer Science, [13] S. B. Patil and Y.S. Kumaraswamy, "Intelligent and vol. 4224, pp. 707-712, September 2006. Effective Heart Attack Prediction System Using Data [8] M. S. Baghshaha and S. B. Shourakib, "Kernel-based Mining and Artificial Neural Network," European metric learning for semi-supervised clustering," Journal of Scientific Research, vol. 31, no. 4, pp. 642- Neurocomputing, December 2009. 656, 2009. [9] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, [14] M. Sacha. (2008) Clustering of an aperiodical medical "Constrained K-means Clustering with Background data. Knowledge," in Proceedings of the Eighteenth http://www.mareksacha.com/blog/clustering-of-an- International Conference on Machine Learning, 2001, aperiodical-medical-data. pp. 577 - 584. [15] Zoo Data Set. (n.d.). Retrieved 03 01, 2010, from [10] G. Y. Hang, D. Zhang, J. Ren, and C. Hu, "A Machine Learning Repository: Hierarchical Clustering Algorithm Based on K-Means http://archive.ics.uci.edu/ml/support/Zoo with Constraints," in Fourth International Conference on Innovative Computing, Information and Control, Kaohsiung, Taiwan, 2009, pp. 1479-1482. [11] K. Li, Z. Cao, L. Cao, and R. Zhao, "A novel semi- supervised fuzzy C-means clustering method," in Proceedings of the 21st annual international conference on Chinese control and decision conference, Guilin, China, 2009, pp. 3804-3808. [12] S. B. Patil and Y. S. Kumaraswamy, "Extraction of Significant Patterns from Heart Disease Warehouses for Heart Attack Prediction," International Journal of 49