SlideShare a Scribd company logo
1 of 3
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 01 June 2013 Page No.1-3
ISSN: 2278-2419
1
Identification of Differentially Expressed Genes by
unsupervised Learning Method
P.Venkatesan1
,J.I. Jamal Fathima2
1
Department of statistics National Institute for Research in Tuberculosis, Indian Council for Medical Research,chennai
2
Research Scholar Department of Statistics National Institute for Research , Indian Council for Medical Research, Chennai
Email:Venkaticmr@gmail.com, jf_ayub@yahoo.co.in
Abstract-Microarrays are one of the latest breakthroughs in
experimental molecular biology that allow monitoring of gene
expression of tens of thousands of genes in parallel. Micro
array analysis include many stages. Extracting samples from
the cells, getting the gene expression matrix from the raw data,
and data normalization which are low level analysis.Cluster
analysis for genome-wide expression data from DNA micro
array data is described as a high level analysis that uses
standard statistical algorithms to arrange genes according to
similarity patterns of expression levels. This paper presents a
method for the number of clusters using the divisive
hierarchical clustering, and k-means clustering of significant
genes. The goal of this method is to identify genes that are
strongly associated with disease in 12607 genes. Gene filtering
is applied to identify the clusters. k-means shows that about
four to seven genes or less than one percent of the genes
account for the disease group which are the outliers, more than
seventy percent falls as undefined group. The hierarchical
clustering dendo gram shows clusters at two levels which
shows again less than one percent of the genes are
differentially expressed.
I. INTRODUCTION
Cluster analysis is a high level analysis .It is traditionally used
in phylogenetic research and has been adopted in micro array
analysis.DNA micro array experiment allows us to measure the
expression levels of thousands of genes simultaneously under
various conditions. The primary objective of supervised
clustering (Carr et al 1997,Cho et al 1998,Szabho et al 2002) is
to predict the functions of unknown genes by comparing the
gene expression patterns of unknown genes. Gene clustering is
one of the widely used statistical tools in microarray data
analysis. Due to the high dimensionality of the data set; data
mining methods are applied to summarize the information for a
synthetic interpretation .Sandrine Dudoit and Rebert
Gentleman(2002). Unsupervised learning technique aims in
detecting the relation between the tissues or genes and within
them(Hastie et al 2002). Thus in cluster analysis similarity
between samples or genes have been identified. Unsupervised
clustering is otherwise called hierarchical clustering, with
samples /gene clustering within clusters (Eisen et al
2000).Grouping of genes having similar expression pattern is
finding at most interest in micro array data analysis. Many
clustering procedures have shown success in micro array gene
clustering, most of them belong to the family of heuristic
clustering algorithm, which are based on the assumption that
the whole set of micro array data is a finite mixture of a certain
type of distributions with different parameter. The commonly
used are hierarchical clustering (Carr et al 1997),k-
means(Tavazoie et al1999),model based clustering (Das Gupta
and Raftery 1998),Support Vector Machine(SVM)(Brown et al
2000).
II. MATERIAL AND METHODS
The main goal is to find clusters of samples or clusters of
genes such that observations within a cluster are more similar
to each other then they are in different clusters. Gene clustering
has two main goals: (i) Understanding the gene function. (ii)
Discovery of co-regulated genes. This paper concentrates on
the second goal. To achieve this we compare the genes having
similar expression profiles in a set of conditions among
different tissues or different times into clusters. Class
discovery depends on the gender or tissue type or by the
combination of both. In micro array data analysis class
discovery process is dominated by sample characteristics –
patients gender or tissue type. In Our method we construct
clusters of samples in the data set by k-means/median applying
Euclidean and Pearson distance measure to find the number of
clusters.Let x1,X2,...Xp be the p-dimensional vector where each
Xi is a n-dimension ,that isit gives the estimate of ith
gene of
pth
sample.
We take the log2 ratio of the pre value where pre value is
obtained as the ratio of tumour to normal tissue.The data
corresponding to the log intensity ratios of ith
gene in jth
experiment is defined as :xij =log(prevalue), where each xi is a
p-dimensional vector π‘₯𝑖 = (π‘₯𝑖1, π‘₯𝑖2 … . . π‘₯𝑖𝑝) .Expression data
are analysed in a matrix form, each row represents a gene and
each column a sample .The matrix entries π‘₯𝑖𝑗 corresponding to
the expression values of ith
gene in the jth
sample where
(i=1,2,....12607 genes) and j=1,..4. Sample subsets).many
software’s are available for performing cluster analysis on the
matrix β€˜x’. To perform analysis the matrix X is transformed to
XT
=Y, such that yij= xji. (1). The node of the cluster dendo
gram represents the entire data point as a single cluster.
Hierarchical clustering has different approaches-single linkage,
complete linkage and average linkage (Gorden 1999).
A. Gene Filtering
As the data size is huge, gene filtering plays a vital role in
removing the noise or unexpressed genes. Gene filtering can be
done at a threshold of a statistic. An arbitrary threshold based
on a value for t-statistic can be applied to filter put a certain
percentage of the genes. In addition to identifying the groups
of related genes, hierarchical clustering helps in screening out
the uninteresting genes. Removal of uninteresting genes is
especially important in micro array data where the number of
gene is very large.
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 01 June 2013 Page No.1-3
ISSN: 2278-2419
2
B. K-means clustering
(i)’k’ specified in advance
(ii) Select a set of k-points as clusters seed these seeds
represent a first guess at the centroids of the k-means.
(iii)Assign each individual observation to the clusters defined
whose centroid is nearest. The centroids are recalculated for
the clusters receiving the new object and for the clusters losing
the object.
(iv)Repeat (i)-(ii) until no changes occur in the cluster
composition.K-means clustering does not give ordering of
objects within a cluster .The final assignment depends on the
initial selection of seed points. As the number of clusters β€˜k’ is
changed, the cluster grouping may change. K-means clustering
assigns each gene to only one clusters, genes assigned to the
same clusters may not be necessarily have similar expression
patterns.
C. Hierarchical Clustering
Given a set of data the hierarchical clustering algorithm aims in
finding the groups or clusters in the data. There are two general
forms of hierarchical clusters may be either agglomerative
(bottom-up) or divisisive (top-bottom).In agglomerative
approach each data point initially forms a cluster and the two
closest cluster are merged at each step . In divisisive method
one large cluster which contains all the data points and splits
off a cluster at each step. Cluster analysis depends on some
measurement of similarity or dissimilarity between two vectors
to be clustered. Commonly used method in cluster
analysis(BenDoretal 1999,Tavazoie et al 1999,Brazma and
Vilo 2000,Horimoto and Toh 2001)have defined the
dissimilarity between two objects ,i and j as their (i)Euclidean
distance,(ii)Linear correlation co-efficient.. Here we have
adopted Divisive method for clustering and the distance
measure is the Euclidean distance given by 𝐷𝑖𝑗
(i)Euclidean distance:𝐷𝑖𝑗 = βˆšβˆ‘ (π‘₯𝑖𝑗 βˆ’ π‘₯π‘—π‘˜)2
𝑛
𝑗=1
(ii)Distance based on correlation:𝑅𝑖𝑗 =
βˆ‘ (π‘₯π‘–π‘™βˆ’π‘₯𝑖
Μ…Μ…Μ…)(π‘₯Μ…π‘—π‘–βˆ’π‘₯̅𝑗)
𝑗
βˆšβˆ‘ (π‘₯π‘–π‘™βˆ’π‘₯𝑖
Μ…Μ…Μ…)2
𝑙 βˆšβˆ‘ (π‘₯Μ…π‘—π‘–βˆ’π‘₯̅𝑗)
2
𝑗
With β€˜n’ data points the hierarchical clustering merges the two
closest clusters providing a nested sequence of clusters .The
clustering results are displayed in a dendogram .Nodes or
clusters forming lower on the dendogram are closest together
and the upper nodes represent the clusters that are far apart.
Each data point beginning as a single cluster as a single cluster
,the leaves (the terminating node at the bottom of the
dendogram).Each node represent a data point, while the nodes
in the interior point of a cluster represent cluster of more than
one data point..In single linkage the distance between any two
clusters of points is the smallest distance between any point in
cluster one and two. In complete linkage the largest distance
between any point between the two clusters is taken as the
intercluster distance. Average linkage takes the average of all
pair wise distances between the points in the first cluster and
second cluster (Sokal and Michener 1958).Ward (1963)
examines the sum of squared distance from each point to the
mean of its cluster and merges the two clusters and is adopted
to obtain the clusters in this work.
III. MICRO ARRAY DATABASE
The database consists of log ratios of gene expression values of
breast cancer FBN1 cell expression profiles-cDNA micro array
(G4100A)fibrillin1,organism:Homosapiens with ID Ref
no6775502.Data consist of 12607 unique clones .The
expression profiling of breast cancer cell lines HCC1954 and
MDA-MB-436 in reference to mammary epithelial cells.
IV. APPLICATION TO MICRO ARRAY DATA
Data studied was obtained from NCBI GENE OMNIBUS. The
log intensities of 12607 genes in breast cancer data set. Data
filtering is carried out by leaf and stem method. We have made
a comparative study based on k-means clustering hierarchical
clustering
A. K-means clustering
By applying k- means clustering on the log ratio intensities one
can see that 5 to 12% of the genes accounts for the total
variation or differentially expressed.. The results obtained from
k-means clustering for k=3, 4, 5 are given in the following
table.Distance measure: Euclidean distance.
Table 1
Cluster
number
Cluster
size
Gene Percentage in each
cluster
=3
1 2 1.3*
2 8 5.2***
3 145 93.5
K=4
1 2 1.3*
2 3 1.9**
3 15 9.7
4 135 87.1
K=5
1 2 1.3*
2 3 1.9**
3 10 6.5
4 21 13.5
5 119 76.8
*Gene id - {16990, 8822}
** Gene id- {938, 2859, 18236}
*** Gene id - {938, 2859, 12298, 1268, 13256, 13258, 2610,
18236}
The results of table shows thatgenes with gene id:{938, 2859,
18236,16990, 8822} are the most expressive genes or the
genes which accounts for maximum variability. Altogether out
of 12607 genes five genes are most differentially expressed or
significant. As the cluster number is increased the within
variance decreases and between variance increases leading to
lesser error within the clusters. Hence the percentage of
misclassification is 84.4% or 85% approximately.
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 01 June 2013 Page No.1-3
ISSN: 2278-2419
3
A. Hierarchical clustering
The cluster dendogram by Ward’s method is obtained for the
gene expression data. The results of which are shown below.
method is obtained for the gene expression data. The results of
which are shown below.
Figure 1
The cluster dendogram of hierarchical clustering is obtained by
divisisive method .The sample of 12607 genes are considered
as one cluster whichsplits up to different clusters step by step
applying the Euclidean distance .From the dendogram we can
see if the cluster tree is cut at height=250 we get two cluster
where the second cluster accounts for less than ten percent of
genes and the second cluster for the remaining. If the tree is cut
at height =1500 the number of clusters are three .Cluster two
and three accounts for less than twenty percentage of genes
and the remaining forms one large cluster.
V. SUMMARY
We have illustrated here how k-means clustering performs
better when compared to hierarchical clustering. Though
hierarchical clustering gives a pictorial representation the
dendogram which is easier to identify the clusters it becomes
difficult when the size of the observations (genes) is very large.
From the results of the k-means and Hierarchical clustering
one can see that out of the 12607 genes less than one
percentthat is a total of five to fifteen genes only are
differentially expressed or significant. K-means clustering
gives a better interpretation of the clusters compared
hierarchical as the numbers of genes belonging to the different
clusters are clearly identified where hierarchical clustering
gives a pictorial representation (a dendogram) of the different
clusters. When the data is very large (huge gene size)
identification of the genes belonging to each cluster becomes
difficult.
REFERENCE
[1] Pand Hatfield G.W. (2002), DNA Micro arrays and Gene expression
from Experiments to Data Analysis and Modelling. Cambridge
University Press
[2] Bishop C. M. ,(2006),Pattern Reorganization and machine learning -
Springer
[3] Brown PO, Botstein D,(1999), Exploring the new world of the genome
with DNA microarrays.Nat Genet;21(1 Suppl):33-7. Review.
[4] Cuverhouse Robert, DuncanJilland Shannon William
(2003),AnalyzingMicro array Data using Cluster Analysis-
Pharmogenomics 41-51.
[5] Darlene R,Goldstein ,DebashisGhosh and Erin M
Conlon(2002),Statistical Issues in Clustering-StatisticaSinica 12219-
240.
[6] Dudoit S, Gentleman RC, Quackenbush J,(2003), Open source software
for the analysis of Microarray data. Bio techniques. Mar;Suppl:45-51.
[7] Everitt,B.S.(1993),Cluster Analysis. Arnold,London.
[8] EisenM,SpellmanP,Brown Po et al (1998), Cluster analysis and display
of genome-wide expression patterns.Proc.National Academic Science
95;14863-14868.
[9] Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J.,Coller, H., Loh,M., Downing, J., and Caligiuri, M. (1999).
Molecular 10.HartiganJ.A .,Mohanty.S.(1992), The Run test for
Multimodality .Journal of Classification ,9:63-70.
[10] Hartigan.J,Wong.M (1979),A K-means clustering algorithm, Applied
Statistics 28,pg100-108
[11] Hastie T ,Tibshirani R et al(2001),The elements of statistical learning
,Spinger ,New York City.
[12] LevyBelitska: (2006),a generalized clustering problem-Published by
The Berkeley Electronic press.
[13] Yeung .K.Y.,Fraley .C,Murna .A,Raftery A.E and Ruzzo
W.L(2001),Model Based clustering and data transformations for gene
expression data; Bioinformatics ,17(10)977-987.
[14] Ward J.(1963),Hierarchical groupings to optimize an objective function
.Journal of the American Statistical Association 58:234-44.

More Related Content

Similar to Identification of Differentially Expressed Genes by unsupervised Learning Method

COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Β 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesCSCJournals
Β 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
Β 
Session ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrSession ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrUSD Bioinformatics
Β 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
Β 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
Β 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data setsimprovemed
Β 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
Β 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataijcsit
Β 
Microarray and its application
Microarray and its applicationMicroarray and its application
Microarray and its applicationprateek kumar
Β 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
Β 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Β 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceIJSTA
Β 
Hierarchical clustering of multi class data (the zoo dataset)
Hierarchical clustering of multi class data (the zoo dataset)Hierarchical clustering of multi class data (the zoo dataset)
Hierarchical clustering of multi class data (the zoo dataset)Raid Mahbouba
Β 
Classification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfClassification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfAgathaHaselvin
Β 
Gene expression profiling i
Gene expression profiling  iGene expression profiling  i
Gene expression profiling iPrasanthperceptron
Β 
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Β 
Data reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataData reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataeSAT Journals
Β 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesIOSR Journals
Β 

Similar to Identification of Differentially Expressed Genes by unsupervised Learning Method (20)

COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
Β 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
Β 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
Β 
Session ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrSession ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corr
Β 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
Β 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
Β 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
Β 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
Β 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
Β 
Microarray and its application
Microarray and its applicationMicroarray and its application
Microarray and its application
Β 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
Β 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
Β 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Β 
Hierarchical clustering of multi class data (the zoo dataset)
Hierarchical clustering of multi class data (the zoo dataset)Hierarchical clustering of multi class data (the zoo dataset)
Hierarchical clustering of multi class data (the zoo dataset)
Β 
Classification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfClassification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdf
Β 
Gene expression profiling i
Gene expression profiling  iGene expression profiling  i
Gene expression profiling i
Β 
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...
Β 
Data reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataData reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological data
Β 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
Β 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Β 

More from praveena06

Methods to Maximize the well being and Vitality of Moribund Communities
Methods to Maximize the well being and Vitality of Moribund CommunitiesMethods to Maximize the well being and Vitality of Moribund Communities
Methods to Maximize the well being and Vitality of Moribund Communitiespraveena06
Β 
V1_I2_2012_Paper7.docx
V1_I2_2012_Paper7.docxV1_I2_2012_Paper7.docx
V1_I2_2012_Paper7.docxpraveena06
Β 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docpraveena06
Β 
V1_I2_2012_Paper5.docx
V1_I2_2012_Paper5.docxV1_I2_2012_Paper5.docx
V1_I2_2012_Paper5.docxpraveena06
Β 
V1_I2_2012_Paper4.doc
V1_I2_2012_Paper4.docV1_I2_2012_Paper4.doc
V1_I2_2012_Paper4.docpraveena06
Β 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docpraveena06
Β 
V1_I2_2012_Paper2.docx
V1_I2_2012_Paper2.docxV1_I2_2012_Paper2.docx
V1_I2_2012_Paper2.docxpraveena06
Β 
V1_I2_2012_Paper1.doc
V1_I2_2012_Paper1.docV1_I2_2012_Paper1.doc
V1_I2_2012_Paper1.docpraveena06
Β 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docpraveena06
Β 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docpraveena06
Β 
V1_I1_2012_Paper4.doc
V1_I1_2012_Paper4.docV1_I1_2012_Paper4.doc
V1_I1_2012_Paper4.docpraveena06
Β 
V1_I1_2012_Paper3.docx
V1_I1_2012_Paper3.docxV1_I1_2012_Paper3.docx
V1_I1_2012_Paper3.docxpraveena06
Β 
V1_I1_2012_Paper2.docx
V1_I1_2012_Paper2.docxV1_I1_2012_Paper2.docx
V1_I1_2012_Paper2.docxpraveena06
Β 
V1_I1_2012_Paper1.doc
V1_I1_2012_Paper1.docV1_I1_2012_Paper1.doc
V1_I1_2012_Paper1.docpraveena06
Β 

More from praveena06 (14)

Methods to Maximize the well being and Vitality of Moribund Communities
Methods to Maximize the well being and Vitality of Moribund CommunitiesMethods to Maximize the well being and Vitality of Moribund Communities
Methods to Maximize the well being and Vitality of Moribund Communities
Β 
V1_I2_2012_Paper7.docx
V1_I2_2012_Paper7.docxV1_I2_2012_Paper7.docx
V1_I2_2012_Paper7.docx
Β 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
Β 
V1_I2_2012_Paper5.docx
V1_I2_2012_Paper5.docxV1_I2_2012_Paper5.docx
V1_I2_2012_Paper5.docx
Β 
V1_I2_2012_Paper4.doc
V1_I2_2012_Paper4.docV1_I2_2012_Paper4.doc
V1_I2_2012_Paper4.doc
Β 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.doc
Β 
V1_I2_2012_Paper2.docx
V1_I2_2012_Paper2.docxV1_I2_2012_Paper2.docx
V1_I2_2012_Paper2.docx
Β 
V1_I2_2012_Paper1.doc
V1_I2_2012_Paper1.docV1_I2_2012_Paper1.doc
V1_I2_2012_Paper1.doc
Β 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
Β 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.doc
Β 
V1_I1_2012_Paper4.doc
V1_I1_2012_Paper4.docV1_I1_2012_Paper4.doc
V1_I1_2012_Paper4.doc
Β 
V1_I1_2012_Paper3.docx
V1_I1_2012_Paper3.docxV1_I1_2012_Paper3.docx
V1_I1_2012_Paper3.docx
Β 
V1_I1_2012_Paper2.docx
V1_I1_2012_Paper2.docxV1_I1_2012_Paper2.docx
V1_I1_2012_Paper2.docx
Β 
V1_I1_2012_Paper1.doc
V1_I1_2012_Paper1.docV1_I1_2012_Paper1.doc
V1_I1_2012_Paper1.doc
Β 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
Β 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
Β 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
Β 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Β 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
Β 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
Β 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
Β 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
Β 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
Β 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
Β 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
Β 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
Β 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
Β 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
Β 

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Β 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Β 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
Β 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Β 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
Β 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
Β 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
Β 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Β 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Β 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
Β 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Β 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
Β 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
Β 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
Β 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Β 

Identification of Differentially Expressed Genes by unsupervised Learning Method

  • 1. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 01 June 2013 Page No.1-3 ISSN: 2278-2419 1 Identification of Differentially Expressed Genes by unsupervised Learning Method P.Venkatesan1 ,J.I. Jamal Fathima2 1 Department of statistics National Institute for Research in Tuberculosis, Indian Council for Medical Research,chennai 2 Research Scholar Department of Statistics National Institute for Research , Indian Council for Medical Research, Chennai Email:Venkaticmr@gmail.com, jf_ayub@yahoo.co.in Abstract-Microarrays are one of the latest breakthroughs in experimental molecular biology that allow monitoring of gene expression of tens of thousands of genes in parallel. Micro array analysis include many stages. Extracting samples from the cells, getting the gene expression matrix from the raw data, and data normalization which are low level analysis.Cluster analysis for genome-wide expression data from DNA micro array data is described as a high level analysis that uses standard statistical algorithms to arrange genes according to similarity patterns of expression levels. This paper presents a method for the number of clusters using the divisive hierarchical clustering, and k-means clustering of significant genes. The goal of this method is to identify genes that are strongly associated with disease in 12607 genes. Gene filtering is applied to identify the clusters. k-means shows that about four to seven genes or less than one percent of the genes account for the disease group which are the outliers, more than seventy percent falls as undefined group. The hierarchical clustering dendo gram shows clusters at two levels which shows again less than one percent of the genes are differentially expressed. I. INTRODUCTION Cluster analysis is a high level analysis .It is traditionally used in phylogenetic research and has been adopted in micro array analysis.DNA micro array experiment allows us to measure the expression levels of thousands of genes simultaneously under various conditions. The primary objective of supervised clustering (Carr et al 1997,Cho et al 1998,Szabho et al 2002) is to predict the functions of unknown genes by comparing the gene expression patterns of unknown genes. Gene clustering is one of the widely used statistical tools in microarray data analysis. Due to the high dimensionality of the data set; data mining methods are applied to summarize the information for a synthetic interpretation .Sandrine Dudoit and Rebert Gentleman(2002). Unsupervised learning technique aims in detecting the relation between the tissues or genes and within them(Hastie et al 2002). Thus in cluster analysis similarity between samples or genes have been identified. Unsupervised clustering is otherwise called hierarchical clustering, with samples /gene clustering within clusters (Eisen et al 2000).Grouping of genes having similar expression pattern is finding at most interest in micro array data analysis. Many clustering procedures have shown success in micro array gene clustering, most of them belong to the family of heuristic clustering algorithm, which are based on the assumption that the whole set of micro array data is a finite mixture of a certain type of distributions with different parameter. The commonly used are hierarchical clustering (Carr et al 1997),k- means(Tavazoie et al1999),model based clustering (Das Gupta and Raftery 1998),Support Vector Machine(SVM)(Brown et al 2000). II. MATERIAL AND METHODS The main goal is to find clusters of samples or clusters of genes such that observations within a cluster are more similar to each other then they are in different clusters. Gene clustering has two main goals: (i) Understanding the gene function. (ii) Discovery of co-regulated genes. This paper concentrates on the second goal. To achieve this we compare the genes having similar expression profiles in a set of conditions among different tissues or different times into clusters. Class discovery depends on the gender or tissue type or by the combination of both. In micro array data analysis class discovery process is dominated by sample characteristics – patients gender or tissue type. In Our method we construct clusters of samples in the data set by k-means/median applying Euclidean and Pearson distance measure to find the number of clusters.Let x1,X2,...Xp be the p-dimensional vector where each Xi is a n-dimension ,that isit gives the estimate of ith gene of pth sample. We take the log2 ratio of the pre value where pre value is obtained as the ratio of tumour to normal tissue.The data corresponding to the log intensity ratios of ith gene in jth experiment is defined as :xij =log(prevalue), where each xi is a p-dimensional vector π‘₯𝑖 = (π‘₯𝑖1, π‘₯𝑖2 … . . π‘₯𝑖𝑝) .Expression data are analysed in a matrix form, each row represents a gene and each column a sample .The matrix entries π‘₯𝑖𝑗 corresponding to the expression values of ith gene in the jth sample where (i=1,2,....12607 genes) and j=1,..4. Sample subsets).many software’s are available for performing cluster analysis on the matrix β€˜x’. To perform analysis the matrix X is transformed to XT =Y, such that yij= xji. (1). The node of the cluster dendo gram represents the entire data point as a single cluster. Hierarchical clustering has different approaches-single linkage, complete linkage and average linkage (Gorden 1999). A. Gene Filtering As the data size is huge, gene filtering plays a vital role in removing the noise or unexpressed genes. Gene filtering can be done at a threshold of a statistic. An arbitrary threshold based on a value for t-statistic can be applied to filter put a certain percentage of the genes. In addition to identifying the groups of related genes, hierarchical clustering helps in screening out the uninteresting genes. Removal of uninteresting genes is especially important in micro array data where the number of gene is very large.
  • 2. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 01 June 2013 Page No.1-3 ISSN: 2278-2419 2 B. K-means clustering (i)’k’ specified in advance (ii) Select a set of k-points as clusters seed these seeds represent a first guess at the centroids of the k-means. (iii)Assign each individual observation to the clusters defined whose centroid is nearest. The centroids are recalculated for the clusters receiving the new object and for the clusters losing the object. (iv)Repeat (i)-(ii) until no changes occur in the cluster composition.K-means clustering does not give ordering of objects within a cluster .The final assignment depends on the initial selection of seed points. As the number of clusters β€˜k’ is changed, the cluster grouping may change. K-means clustering assigns each gene to only one clusters, genes assigned to the same clusters may not be necessarily have similar expression patterns. C. Hierarchical Clustering Given a set of data the hierarchical clustering algorithm aims in finding the groups or clusters in the data. There are two general forms of hierarchical clusters may be either agglomerative (bottom-up) or divisisive (top-bottom).In agglomerative approach each data point initially forms a cluster and the two closest cluster are merged at each step . In divisisive method one large cluster which contains all the data points and splits off a cluster at each step. Cluster analysis depends on some measurement of similarity or dissimilarity between two vectors to be clustered. Commonly used method in cluster analysis(BenDoretal 1999,Tavazoie et al 1999,Brazma and Vilo 2000,Horimoto and Toh 2001)have defined the dissimilarity between two objects ,i and j as their (i)Euclidean distance,(ii)Linear correlation co-efficient.. Here we have adopted Divisive method for clustering and the distance measure is the Euclidean distance given by 𝐷𝑖𝑗 (i)Euclidean distance:𝐷𝑖𝑗 = βˆšβˆ‘ (π‘₯𝑖𝑗 βˆ’ π‘₯π‘—π‘˜)2 𝑛 𝑗=1 (ii)Distance based on correlation:𝑅𝑖𝑗 = βˆ‘ (π‘₯π‘–π‘™βˆ’π‘₯𝑖 Μ…Μ…Μ…)(π‘₯Μ…π‘—π‘–βˆ’π‘₯̅𝑗) 𝑗 βˆšβˆ‘ (π‘₯π‘–π‘™βˆ’π‘₯𝑖 Μ…Μ…Μ…)2 𝑙 βˆšβˆ‘ (π‘₯Μ…π‘—π‘–βˆ’π‘₯̅𝑗) 2 𝑗 With β€˜n’ data points the hierarchical clustering merges the two closest clusters providing a nested sequence of clusters .The clustering results are displayed in a dendogram .Nodes or clusters forming lower on the dendogram are closest together and the upper nodes represent the clusters that are far apart. Each data point beginning as a single cluster as a single cluster ,the leaves (the terminating node at the bottom of the dendogram).Each node represent a data point, while the nodes in the interior point of a cluster represent cluster of more than one data point..In single linkage the distance between any two clusters of points is the smallest distance between any point in cluster one and two. In complete linkage the largest distance between any point between the two clusters is taken as the intercluster distance. Average linkage takes the average of all pair wise distances between the points in the first cluster and second cluster (Sokal and Michener 1958).Ward (1963) examines the sum of squared distance from each point to the mean of its cluster and merges the two clusters and is adopted to obtain the clusters in this work. III. MICRO ARRAY DATABASE The database consists of log ratios of gene expression values of breast cancer FBN1 cell expression profiles-cDNA micro array (G4100A)fibrillin1,organism:Homosapiens with ID Ref no6775502.Data consist of 12607 unique clones .The expression profiling of breast cancer cell lines HCC1954 and MDA-MB-436 in reference to mammary epithelial cells. IV. APPLICATION TO MICRO ARRAY DATA Data studied was obtained from NCBI GENE OMNIBUS. The log intensities of 12607 genes in breast cancer data set. Data filtering is carried out by leaf and stem method. We have made a comparative study based on k-means clustering hierarchical clustering A. K-means clustering By applying k- means clustering on the log ratio intensities one can see that 5 to 12% of the genes accounts for the total variation or differentially expressed.. The results obtained from k-means clustering for k=3, 4, 5 are given in the following table.Distance measure: Euclidean distance. Table 1 Cluster number Cluster size Gene Percentage in each cluster =3 1 2 1.3* 2 8 5.2*** 3 145 93.5 K=4 1 2 1.3* 2 3 1.9** 3 15 9.7 4 135 87.1 K=5 1 2 1.3* 2 3 1.9** 3 10 6.5 4 21 13.5 5 119 76.8 *Gene id - {16990, 8822} ** Gene id- {938, 2859, 18236} *** Gene id - {938, 2859, 12298, 1268, 13256, 13258, 2610, 18236} The results of table shows thatgenes with gene id:{938, 2859, 18236,16990, 8822} are the most expressive genes or the genes which accounts for maximum variability. Altogether out of 12607 genes five genes are most differentially expressed or significant. As the cluster number is increased the within variance decreases and between variance increases leading to lesser error within the clusters. Hence the percentage of misclassification is 84.4% or 85% approximately.
  • 3. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 01 June 2013 Page No.1-3 ISSN: 2278-2419 3 A. Hierarchical clustering The cluster dendogram by Ward’s method is obtained for the gene expression data. The results of which are shown below. method is obtained for the gene expression data. The results of which are shown below. Figure 1 The cluster dendogram of hierarchical clustering is obtained by divisisive method .The sample of 12607 genes are considered as one cluster whichsplits up to different clusters step by step applying the Euclidean distance .From the dendogram we can see if the cluster tree is cut at height=250 we get two cluster where the second cluster accounts for less than ten percent of genes and the second cluster for the remaining. If the tree is cut at height =1500 the number of clusters are three .Cluster two and three accounts for less than twenty percentage of genes and the remaining forms one large cluster. V. SUMMARY We have illustrated here how k-means clustering performs better when compared to hierarchical clustering. Though hierarchical clustering gives a pictorial representation the dendogram which is easier to identify the clusters it becomes difficult when the size of the observations (genes) is very large. From the results of the k-means and Hierarchical clustering one can see that out of the 12607 genes less than one percentthat is a total of five to fifteen genes only are differentially expressed or significant. K-means clustering gives a better interpretation of the clusters compared hierarchical as the numbers of genes belonging to the different clusters are clearly identified where hierarchical clustering gives a pictorial representation (a dendogram) of the different clusters. When the data is very large (huge gene size) identification of the genes belonging to each cluster becomes difficult. REFERENCE [1] Pand Hatfield G.W. (2002), DNA Micro arrays and Gene expression from Experiments to Data Analysis and Modelling. Cambridge University Press [2] Bishop C. M. ,(2006),Pattern Reorganization and machine learning - Springer [3] Brown PO, Botstein D,(1999), Exploring the new world of the genome with DNA microarrays.Nat Genet;21(1 Suppl):33-7. Review. [4] Cuverhouse Robert, DuncanJilland Shannon William (2003),AnalyzingMicro array Data using Cluster Analysis- Pharmogenomics 41-51. [5] Darlene R,Goldstein ,DebashisGhosh and Erin M Conlon(2002),Statistical Issues in Clustering-StatisticaSinica 12219- 240. [6] Dudoit S, Gentleman RC, Quackenbush J,(2003), Open source software for the analysis of Microarray data. Bio techniques. Mar;Suppl:45-51. [7] Everitt,B.S.(1993),Cluster Analysis. Arnold,London. [8] EisenM,SpellmanP,Brown Po et al (1998), Cluster analysis and display of genome-wide expression patterns.Proc.National Academic Science 95;14863-14868. [9] Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.,Coller, H., Loh,M., Downing, J., and Caligiuri, M. (1999). Molecular 10.HartiganJ.A .,Mohanty.S.(1992), The Run test for Multimodality .Journal of Classification ,9:63-70. [10] Hartigan.J,Wong.M (1979),A K-means clustering algorithm, Applied Statistics 28,pg100-108 [11] Hastie T ,Tibshirani R et al(2001),The elements of statistical learning ,Spinger ,New York City. [12] LevyBelitska: (2006),a generalized clustering problem-Published by The Berkeley Electronic press. [13] Yeung .K.Y.,Fraley .C,Murna .A,Raftery A.E and Ruzzo W.L(2001),Model Based clustering and data transformations for gene expression data; Bioinformatics ,17(10)977-987. [14] Ward J.(1963),Hierarchical groupings to optimize an objective function .Journal of the American Statistical Association 58:234-44.