Tissue Classification Using Gene Expression Data

1
Survey and Evaluation of
Methods for Tissue Classification Using
Gene Expression Data
Per-Olof Fjällström
Affibody
P.O. Box 20137
SE-161 02 Bromma, SWEDEN
Abstract
Microarray experiments allow us to simultaneously monitor the expression levels in cells of thousands
of genes. This may lead to both a better understanding of biological mechanisms and to more accurate
diagnosis methods. For example, for diagnostic purposes it would be very valuable if we could develop
class prediction methods that, given a collection of labeled gene expression profiles, accurately predict
the label of an unlabeled profile.
Many general-purpose class prediction methods are known from the statistical learning literature, and it
seems reasonable to first evaluate these methods before starting to develop gene-expression-specific
methods. In this report, we describe and compare a number of well-known class prediction methods, as
well as a number of methods for identifying so-called informative genes. That is, genes whose expres-
sion levels show a strong correlation with given phenotypes.
To evaluate the methods, we have implemented them in MATLAB, and then applied them to both real
and simulated gene expression data sets. The results of our evaluation suggest that simple, well-known
classification methods, such as nearest neighbor classification and linear discriminant analysis, are very
competitive. Not only are they easy to implement and fast to execute, but for most data sets they are al-
so as accurate (or more) as more advanced general- or specific-purpose methods. (These results agree
with the results presented in Dudoit, Fridlyand and Speed (2002), and Ben-Dor et al (2000).) Our eval-
uation of methods for finding informative genes indicates that they produce similar results. Interesting-
ly, we also found that constructing predictors using only the most informative genes sometimes may
lead to worse prediction accuracy than if all genes are used.
1 Introduction
Suppose that we have m mRNA tissue samples. Each sample jX , mj ,...,2,1= , con-
sists of mRNA expression levels, ijX , ni ,...,2,1= , measured from (the same) n
genes. With each sample is associated a class label jl such that },...,2,1{ Klj ∈ . The
class labels have usually not been determined by examining the mRNA samples, but
by examining morphological and clinical data about the patient from which the sam-
ple was taken. Often 2=K , in which case the class labels may correspond to sick or
healthy, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), etc.
Is it possible to use the given mRNA samples to quickly and correctly classify unla-
beled samples? More specifically, can we devise a class prediction function )(ˆ Yf by
which we accurately can label an unknown sample Y ?

2
One might ask why this is an interesting problem. After all, if one can determine if
some patients have ALL or AML without using gene expression data, one should be
able to do so in all cases. According to Golub et al (1999), current clinical methods
for distinguishing between ALL and AML are complicated but still not perfect. Un-
fortunately, distinguishing ALL from AML is crucial for successful treatment. There-
fore, one reason for studying the above problem is to find more accurate diagnosis
methods. Another reason, also mentioned by Golub et al, is that the classification
rules may provide a better understanding of the underlying biological mechanisms.
The problem of predicting the class of an unlabeled entity, given a set of labeled enti-
ties, occurs in many applications, and many methods for so-called supervised learning
(or discriminant analysis) have been proposed. (See e.g. Mitchell (1997).) The large
number of such methods is itself a problem. Which of the available well-known me-
thods are most appropriate for classifying mRNA samples? Are they at all appropri-
ate? If not, how can we improve them?
Recently, a number of papers on methods for tissue classification using gene expres-
sion data have been published. Some of them compare well-known methods (e.g.
Ben-Dor et al (2000), and Dudoit et al (2002)), while others appear to propose new
methods (e.g. Golub et al (1999)). The purpose of this report is similar to the former
category of papers.
1.1 A framework for classifying gene expression profiles
Radmacher et al (2001) propose a general framework for class prediction using gene
expression profiles. Their framework consists of four steps:
1. Evaluation of the appropriateness of the given data for class prediction. To be
appropriate each tissue sample must have a class label, and the class labels
should not be based on the gene expression profiles of the samples, e.g. the
class labels should not be derived by clustering the gene expression profiles.
2. Selection of classification (and gene selection) method. This step entails se-
lecting one or more methods that should be accurate, and simple to implement
and use.
3. Cross-validated class prediction. The accuracy of the methods selected in the
previous step have to be evaluated. Since the number of samples usually is rel-
atively small, Radmacher et al recommend using leave-one-out cross-
validation, that is, each sample is left out one at a time and its label is pre-
dicted based on the remaining samples. The smaller the error rate, the better
the classification method.
4. Assessing the significance of the cross-validation results. According to Rad-
macher et al, small error rates can be achieved even when there is no systemat-
ic difference in expression profiles between classes. They recommend a per-
mutation test to assess the significance of an observed cross-validation error
rate. More specifically, for each permutation of the class labels, perform cross-
validation as described above, and record the error rate. The proportion of the
error rates that are smaller than or equal to the observed error rate serves as the
significance level of the observed error rate. If the significance level is smaller
than 0.05, Radmacher et al reject the hypothesis that there is no systematic dif-
ference in expression profiles between classes. (In practice, it is too time-

3
consuming to examine every permutation. Instead, Radmacher et al estimate
the significance level by examining 2000 randomly selected permutations.)
In this report, we will follow the recommendations of Radmacher et al.
2 Data preprocessing
The mRNA expression levels have been measured using either cDNA microchips or
high-density oligonucleotide chips. Due to outliers, missing data points, etc. the “raw”
mRNA data must usually be preprocessed in various ways:
1. Thresholding, i.e. if necessary increase or decrease expression levels such
that all levels lie between specified lower and upper thresholds.
2. Filtering, e.g. removal of genes with too many missing data points or genes
that vary too little between samples.
3. Log-transformation: it seems to be standard to log-transform either the ratio
of red-green intensities (cDNA microchips) or the difference between average
PM and MM (oligonucleotide chips).
4. Standardization (of columns or rows);
5. Missing value imputation: e.g. using the k-NN approach described by
Troyanskaya et al (2001).
As an example, according to Dudoit et al (2002), the following data preprocessing
steps were applied by Golub et al (1999) on the leukemia dataset1
:
1. Thresholding: floor of 100 and ceiling of 16,000.
2. Filtering: exclusion of genes with max/min≤5 or (max-min)≤500, where max
and min refer to the maximum and minimum intensities for a particular gene
across all the mRNA samples.
3. Base 10 logarithm transformation.
3 Gene selection methods
Class prediction using gene expression samples differs from many other applications
in that m, the number of labeled entities, is much smaller than n, the number of fea-
tures. Usually, the number of mRNA samples is less than hundred, while there can be
tens of thousands of genes.
In this section, brief descriptions of a number of gene selection (or ranking) methods
are given. The purpose of these methods is to identify the genes whose expression le-
vels are informative with respect to class membership. For example, a gene that is
strongly down-regulated in all AML samples and strongly up-regulated in all ALL
samples is clearly informative, whereas a gene that is weakly up-regulated in all sam-
ples hardly can qualify as informative.
1
Downloadable from http://waldo.wi.mit.edu/MPR/data_set_ALL_AML.html

4
The methods presented below all begin by computing, for each gene, some kind of
gene-correlation score, which is intended to measure how informative a gene is. For
example, the higher the score, the more informative the gene. The next step is (or at
least should be) to assess the significance of the scores. That is, we must decide which
(if any) genes actually can be regarded as informative. Ideally, only informative genes
should be used to construct the class prediction function.
The expression levels are stored in an mn × matrix X, which we refer to as the gene
expression matrix. That is, the genes correspond to the rows of the matrix, and the
columns correspond to the labeled samples. The class labels are stored in an array
),...,,( 21 mlllL = . There are no missing data in X, and the data has been properly thre-
sholded, filtered and log-transformed. In the descriptions below, the following defini-
tions are used:
• iµ and iσ denote the sample mean and standard deviation, respectively, of
the elements mjXij ,...,2,1, = .
• c
iµ and c
iσ , },...,2,1{ Kc∈ , are the sample mean and standard deviation, re-
spectively, of the elements mjXij ,...,2,1, = , for which clj = .
• |}:},...,2,1{{| clmjm jc =∈= , },...,2,1{ Kc∈ .
3.1 Two-sample t-tests
The t-test score is:
2
22
1
21
21
)()(
),(
mm
LiT
ii
ii
σσ
µµ
+
−
=
The larger the absolute value of a t-test score, the more informative the gene is. To as-
sess the significance of ),( LiT , we can estimate its p-value by computing:






+
2
1
,
22
),(
v
I
LiTv
v
where ),( baIx denotes the incomplete Beta function (an algorithm for computing this
function is given in Press et al (1988)) and
2
2
22
2
2
1
21
1
2
2
22
1
21
)(
1
1)(
1
1
)()(






−
+





−






+
=
mmmm
mm
v
ii
ii
σσ
σσ
.
Note that the estimate of the p-value is reliable only if the preprocessed gene expres-
sion data can be assumed to be (approximately) normally distributed. Furthermore,
since there may be thousands of genes, the p-values should also be adjusted for multi-
ple testing, e.g. using the Bonferroni procedure.

5
3.2 The method of Golub et al
Golub et al (1999) normalize each row of X by first subtracting the row mean, and
then dividing by the row standard deviation. That is, the normalized entry is:
i
iij
ij
X
X
σ
µ−
=
~
Let c
iµ~ and c
iσ~ , }2,1{∈c , be the mean and standard deviation, respectively, of the
elements mjXij ,...,2,1,
~
= , such that clj = . The gene-class correlation score for the
ith gene is then computed as
21
21
~~
~~
),(
ii
ii
LiP
σσ
µµ
+
−
= .
Clearly, if a gene has high (low) expression levels for class 1 and low (high) expres-
sion levels for class 2 (and the standard deviations are not too large), the correspond-
ing score will be a relatively big positive (negative) number. On the other hand, if a
gene has similar expression levels for both classes, the score will be close to zero.
To assess the significance of the gene-class correlation scores, Golub et al perform a
neighborhood analysis. This is done as follows.
Let
|}),(:{|),(1 rLiPirLN ≥=
and
|}),(:{|),(2 rLiPirLN −≤= .
Suppose for example that 3.1for,10),(1 == rrLN , that is, there are ten genes with
score larger than or equal to 1.3. To decide if this is unusual, Golub et al compute
B
rLNrLNj j |)},(),(:{| 1≥
where jL , Bj ,...,2,1= , are random permutations of L . (Golub et al use 400=B .) If
this ratio is small, let’s say not larger than 0.05, then the gene-class correlations for
the ten genes are regarded as significant at the 5% level.
If the neighborhood analysis shows that there are genes with significant class correla-
tion, the next step is to select a subset of particularly informative genes to use as a
prediction set. Golub et al select the 2/n genes with smallest (i.e. negative) correla-
tion score, and 2/n genes with largest (i.e. positive) correlation score, where n is a
free parameter. Golub et al used 50=n . Alternative methods for selecting the predic-
tion set are discussed in Slonim et al (2000).
3.3 The method of Dudoit, Friedlyand and Speed
Dudoit et al (2002) rank genes using the following score:

6
∑∑
∑∑
= =
= =
−=
−=
= m
j
K
c
c
iijj
m
j
K
c
i
c
ij
XclI
clI
iLR
1 1
2
1 1
2
))((
))((
),(
µ
µµ
,
where )(conditionI is 1 if condition is true; otherwise 0. They select the p genes with
largest ratio (they use p ranging from 30 to 50). Dudoit et al briefly discuss the use-
fulness of p-values for ),( LiR , but do not describe how to compute them.
3.4 The TNoM method
Ben-Dor et al (2002) propose the threshold number of misclassification (TNoM)
score. The rank vector, ),...,,( 21 imiii vvvv = , for the ith gene is defined as follows. If
the sample corresponding to the kth smallest member of },...,2,1:{ mjXij = belongs
to class 1, then +=ikv ; otherwise −=ikv . )(vTNoM measures to which extent it is
possible to divide v into two homogeneous parts. More specifically,
))()((min)(
&
yMCxMCvTNoM
vyx
+=
=
,
where )(xMC is the cardinality of the minority element in the vector x, i.e.
)in#,inmin(#)( xxxMC −+= .
For example, if the ith gene is down-regulated in all AML samples but up-regulated in
all ALL samples (or vice versa), then )( ivTNoM is zero. Note that )( ivTNoM cannot
exceed 2/m . Therefore, if n is large, many genes will have the same TNoM score.
Ben-Dor et al describe an exact procedure for computing p-values for TNoM scores.
3.5 The method of Park, Pagano and Bonetti
Park et al (2001) propose the following score:
)(),(
2: 1:
ij
lj lk
ik XXhiLS
j k
−= ∑ ∑= =
,
where )(xh is 1 if 0>x ; otherwise zero. That is, the score for the ith gene is com-
puted by first, for each sample belonging to class 2, counting the number of samples
belonging to class 1 that have higher expression levels, and then by summing these
numbers. Note that if 0),( =iLS , then the ith gene has consistently higher expression
levels for class 2 than for class 1. Conversely, if 21),( mmiLS = , the expression levels
for class 2 are consistently lower than for class 1. P-values are computed by permut-
ing class labels in a manner similar to the neighborhood analysis proposed by Golub
et al (1999).
4 Classification methods
In this section, we describe a number of methods for constructing and using class pre-
diction functions. The primary input for the construction of a prediction function con-
sists of a training data set: }},...,2,1{,:),(),...,,{( 11 KlRXlXlXT i
n
imm ∈∈= .

7
The process of constructing a class prediction function based on the training data is
called the training phase.
The training phase may also require the user to specify various parameters. In general,
the more “sophisticated” the method, the more parameters it requires. The simplest
methods have essentially no training phase, while for other methods, the training
phase may be quite time consuming. Once the prediction function is constructed, we
can use it to predict the label of an unknown sample n
RY ∈ .
4.1 Nearest neighbor classification
Nearest-neighbors (NN) methods are based on some distance measure ),( yxd (e.g.
one minus the Pearson correlation) for pairs of samples. To classify a sample Y, we
first find the k samples in the training set, which are closest to Y. Then, either we can
use the majority or the distance-weighted rule to decide which class Y should be as-
signed to.
If kiii XXX ,...,, 21
are the k closest samples, the majority rule simply assigns Y to the
class to which most of them belong, i.e.
))((maxarg)(ˆ
1},...,2,1{
cXlIYf ji
k
jKc
== ∑=∈
,
where )( jiXl denotes the class label of jiX .
How find an appropriate value for k? One possibility, described in Dudoit et al
(2002), is to examine several values for k, and choose the value giving the smallest
cross-validation error.
4.2 Quadratic and linear discriminant analysis
The quadratic discriminant rule is
)(minarg)(ˆ
},...,2,1{
YYf c
Kc
δ
∈
= ,
where
c
c
c
Tc
cc YYY πµµδ log)()(
2
1
log
2
1
)( 1
−−Σ−+Σ= −
,
are called the quadratic discriminant functions, and c
c µ,Σ and c
π are population co-
variance matrix, population mean vector, and prior probability, respectively, of class
c. (See e.g. Hastie et al (2001).)
If we assume that all classes have a common population covariance matrix, Σ , the
discrimination functions simplify to the linear discriminant functions:
ccTc
c YYY πµµδ log)()(
2
1
)( 1
−−Σ−= −
.
In practice, we do not know the population covariance, mean, and prior probability,
but they can be estimated as follows:

8
Class mean:














== ∑=
c
n
c
c
clj
j
c
c
j
X
m
µ
µ
µ
µ
:
1
ˆ 2
1
:
,
Class covariance matrix: Tc
j
c
clj
j
c
c XX
m j
)ˆ)(ˆ(
1
1ˆ
:
µµ −−
−
=Σ ∑=
,
Common covariance matrix:
∑ ∑∑ = ==
−−
−
=Σ−
−
=Σ
K
c
Tc
j
c
clj
j
K
c
cc XX
Km
m
Km j1 :1
)ˆ)(ˆ(
1ˆ)1(
1ˆ µµ , and
Class prior probability:
m
mcc
=πˆ .
Dudoit et al (2002) reported surprisingly good results with a simplified version of the
linear discriminant function. In this version, only the diagonal elements of the com-
mon covariance matrix are used. More specifically,
∑∑
∑∑ ==
= =
−=
−=
−
=
n
i
c
iii
n
i
m
j
K
c
c
iijj
c
ii
c Yw
XclI
Y
Y
1
2
1
1 1
2
2
)(
))((
)(
)( µ
µ
µ
δ .
We can interpret )(Ycδ as the (squared) “weighted” Euclidean distance between Y
and the class mean c
µˆ . More weight is given to genes with expression values close to
the class means. This means that even if Y is closer to 1
ˆµ than to 2
ˆµ (according to
the ordinary Euclidean distance measure), Y may still be predicted to belong to class
2 if the class 1 training samples exhibit less variation than the class 2 training sam-
ples. Dudoit et al refer to class prediction using this function as diagonal linear dis-
criminant analysis (DLDA).
4.3 Weighted gene voting
In section 3.2 it was described how Golub et al (1999) propose to rank genes, and how
to select a subset, the so-called prediction set S, consisting of the most relevant genes.
To predict the class membership of a sample Y, each gene Si ∈ casts a weighted vote:
)~)(,(),( byLiPYiV i −=
where
,
log~
i
ii
i
y
y
σ
µ−
= ,
2
~~ 21
ii
b
µµ +
=
and iy is the expression level (before log-transformation) of the ith gene in sample Y.
The 2,1~
iµ , iµ and iσ are defined in Section 3.
Note that 0),( <YiV , either if 0),( <LiP and byi >~ , or if 0),( >LiP and byi <~ . In
the first case, since 12 ~~
ii µµ > , we see that iy~ is closer to 2~
iµ than to 1~
iµ . In the second

9
case, since 21 ~~
ii µµ > , we see again that iy~ is closer to 2~
iµ than to 1~
iµ . Golub b et al in-
terpret this as vote for class 2. By similar reasoning, 0),( >YiV can be interpreted as
a vote for class 1.
Next, we compute the 1−V and 1V , the total votes for each class, and the prediction
strength PS:
∑>
=
0),(
1 ),(
XiV
YiVV
∑<
=
0),(
2 ),(
XiV
YiVV
and
21
2121 ),min(),max(
VV
VVVV
PS
+
−
=
The sample Y is predicted to belong to the class with largest total vote unless PS is too
small. (Golub et al used 0.3 as threshold value, that is, the difference in votes between
the “winner” and the “loser” must be at least 30% of the total number of votes.)
4.4 The Arc-fs boosting method
To construct a prediction function, boosting methods repeatedly apply a weak classifi-
cation method2
to modified versions of the training data, producing a sequence
MiYfi ,...,2,1),( = , of prediction functions. The resulting prediction function is a
weighted vote:
)((sign)(ˆ
1
YfYf i
M
i
i∑=
= α ),
where the Mii ,...,2,1, =α are computed by the boosting method. (Here we assume
that there are only two classes, and that they are labeled with 1± .)
Several variations on boosting have been proposed. Breiman (1998) presented the fol-
lowing variation (referred to as Arc-fs):
1. Initialize the sampling probabilities: mpj /1= , mj ,...,2,1= .
2. For Mi ,...,2,1= do:
a. Using the current sampling probabilities, draw m samples (with re-
placement) from the original training set S to create a new training da-
ta set Sˆ .
b. Fit a prediction function )(Yfi to Sˆ .
c. Compute ∑ =
≠=
m
j jijji XflIp1
))((ε . (If 0=iε or 5.0≥iε , Breiman
recommends starting all over at Step 1.)
d. Compute iii εεβ /)1( −= .
2
A weak classification method is a method that is guaranteed to perform only slightly better than ran-
dom guessing.

10
e. Set
∑=
≠
≠
= m
j
XflI
ij
XflI
ij
j
jij
jij
p
p
p
1
))ˆ(ˆ(
))ˆ(ˆ(
β
β
, mj ,...,2,1= .
3. Output )()log((sign)(ˆ
1
YfYf i
M
i
i∑=
= β ).
As we can see, misclassified samples get their sampling probabilities increased, and
thus become more likely to be selected in Step 2a. The idea is that this will force suc-
cessive prediction functions to improve their accuracy on these samples.
As the weak classification method, one can use a very simple “classification tree”:
tiYd
tiYd
Yf
<−
>
=
)(
)(
)( ,
where the parameters 1±=d , i and t are determined such that )(Yf , when applied to
the training data gives the smallest number of errors.
4.5 Support vector machine classification
Support vector machines (SVM) (Vapnik (1999)) have become popular classifiers
within the machine learning community. SVMs have been applied to gene expression
data in several publications, e.g. Mukherjee et al (1999), Furey et al (2000), and Ben-
Dor et al (2000).
The training phase of an SVM consists in solving the following optimization problem.
Given training the data }}1,1{,:),(),...,,{( 11 +−∈∈= i
n
imm lRXlXlXS , a kernel func-
tion ),( yxK , and a positive real number C, find mαα ,...,1 that
maximize ),(
2
1
1,1
jijij
m
ji
i
m
i
i XXKll ααα ∑∑ ==
−
subject to ,0
1
=∑=
i
m
i
ilα
miC i ,...,1,0 =≥≥ α .
The prediction function is )),((sgn)(
1
∗∗
=
+= ∑ bYXKlYf ii
m
i
iα , where ∗∗
mαα ,...,1 are the
solutions to the optimization problem, and ∗
b is chosen such that 1)( =ii Xfl for any
i with 0>> iC α .
Examples of kernel functions are:
• Gaussian Radial Basis Function (RBF):
22
2/
),(
σyx
eyxK
−−
= .
• Polynomials of degree d: d
yxyxK )1(),( +⋅= .
• Multi-layer Perceptron function: ),tanh(),( δκ −⋅= yxyxK (for some values of
κ and δ ).

11
5 Evaluation
We have implemented all of the gene selection and class prediction methods de-
scribed above in MATLAB (except the SVM method where we used the implementa-
tion available in the OSU Support Vector Machines (SVMs) Toolbox). In this section,
we present the results of applying these methods to various data sets.
5.1 Data sets
Unfortunately, it is not so easy to find appropriate collections of classified mRNA
samples. Dudoit et al (2002) use only three such data sets: lymphoma, leukemia, and
NCI 60. Ben-Dor et al (2000) use, in addition to the leukemia data, colon and ovarian
data. Here, we only use one such collection: the leukemia data. Instead, we use com-
puter-generated data sets. Of course, these data sets cannot substitute real data sets,
but they allow systematic studies of e.g. sensitivity to “noise” that would be hard to
do using only a few real data sets.
5.1.1 Leukemia data
For the leukemia data, gene expression levels were measured using Affymetrix high-
density oligonucleotide arrays containing 6,817 human genes. There is a total of 72
samples of which 47 have been classified as ALL, and the remaining as AML. After
preprocessing as described in section 2, 3,517 genes remain.
5.1.2 Simulation of gene expression data
We developed a MATLAB function for generating “mRNA samples”. The input pa-
rameters are:
• Total number of genes;
• Number of informative genes;
• Total number of samples;
• Number of samples with class label “1” (the remaining samples are labeled
“2”).
The expression levels of a non-informative gene are normally distributed with stan-
dard deviation equal to 1500, and a gene-specific mean. The latter is determined by
drawing a random number between 3000 and 5000. For an informative gene, it is first
decided (with equal probability) if it is to be high for class 1 and low for class 2, or
vice versa. In the first case (the second case is treated analogously), the expression
values for class 1 are normally distributed (again with standard deviation 1500) with
mean 5000, while the class 2 expression values are normally distributed (again with
standard deviation 1500) with mean 3000. The output from the function consists of
the gene expression matrix and the class labels array.
5.2 Evaluation of gene selection methods
In section 3, we described five gene selection methods: the t-test, the method of Golub
et al, the method of Dudoit, Friedlyand and Speed, the TNoM method, and the method
of Park, Pagano and Bonetti. In this section, we investigate to which extent these me-
thods agree in their ranking of the “informativiness” of the genes, and if any method
can be regarded as more or less correct than the others.

12
For the sake of brevity, in the following we refer to the methods of:
1. Golub et al as the “GOLUB method”,
2. Dudoit, Friedlyand and Speed as the “DFS method”, and
3. Park, Pagano and Bonetti as the “PPB method”.
5.2.1 Leukemia data
Each method first ranked the genes and then selected the 1%, 5%, 10%, 15%, and
20% most informative genes. The sets of selected genes were then compared with
each other. The overall agreement, that is, the proportion of genes selected by all me-
thods, varied around 60%. For example, when each method selected the 1% (i.e. 36)
most informative genes, 21 genes were selected by all methods.
The methods were also compared pairwise. The two methods that disagree most are
the t-test and the TNoM methods. Their overlap (that is, the extent to which they se-
lect the same genes) varies around 66%. (However, the TNoM method tends to disag-
ree also with the other methods.) At the other end of the spectrum, the DFS and
GOLUB methods have an overlap that ranges between 78% (1% most informative
genes were selected) and 93%.
5.2.2 Computer-generated data
The data consists of 50 samples (with class labels evenly divided over two classes)
with expression levels for 1000 genes of which 10% are informative. Each method se-
lected the 1%, 5%, 10%, 15%, and 20% most informative genes, and the sets were
compared in the same manner as for the leukemia data. This process was repeated fif-
ty times, and the average overlaps were computed.
The overall agreement varied between 52% (1% most informative genes were se-
lected) to 90% (10% most informative genes were selected). The latter result indicates
that all methods successfully identified the informative genes. The pairwise compari-
sons showed a very high degree of agreement (96% to 100%) between the t-test and
DFS methods. Again, the TNoM methods disagreed most with the other methods. In
particular, it disagreed with the GOLUB method, where the overlap varied between
58% and 92%.
We also investigated how successful the methods were in identifying the genes that
had been created to be informative. This was done by deciding how many of the 10%
most informative genes selected by the methods were also created to be informative.
On average, 96 of the 100 most informative genes selected by the t-test and DFS me-
thods had been created to be informative. The TNoM and PBB methods were slightly
less successful; on average, they selected 93.5% of the informative genes.
5.2.3 Conclusions
The above results indicate that the methods essentially agree on which genes are most
informative. Therefore, in the following we only use the DFS gene selection method
(except that we use the GOLUB method together with the class prediction method of
Golub et al.)

13
5.3 Evaluation of class prediction methods
In evaluating the accuracy of the class prediction methods, we follow the framework
proposed by Radmacher et al (as described in section 1). That is, we use leave-one-out
cross-validation (LOOCV). More specifically, to evaluate the accuracy of a method,
we perform the following procedure:
#Errors = 0;
For each ),( jj lX in }},...,2,1{,:),(),...,,{( 11 KlRXlXlXT i
n
imm ∈∈= do:
Form the training set )},{(ˆ
jj lXTT −= ;
If required, rank the genes in Tˆ and select the most informative
(let S denote the selected genes);
Construct the class prediction function )(ˆ Yf using STˆ
( STˆ denotes the restriction of Tˆ to the genes in S . If no gene selection
was done, TTS
ˆˆ = );
If jSj lXf ≠))((ˆ , #Errors = #Errors + 1.
In the following, whenever we refer to the number of errors (or error ratio), it is the
number of errors computed by the above procedure that we mean. The procedure it-
self, we refer to as the LOOCV procedure.
5.3.1 Prediction without gene selection
We begin by evaluating the accuracy of the methods if no gene selection is per-
formed. That is, we construct the prediction functions using training sets that may
contain many irrelevant genes. Observe, however, that in the case of the leukemia da-
ta many irrelevant genes were removed by the preprocessing procedure. It is therefore
possible that the proportion of informative genes is rather high. For the computer-
generated data, on the other hand, we know that, except for the genes that we created
to be informative, all other genes are irrelevant.
5.3.1.1 Leukemia data
All of the methods performed quite well on the leukemia data: after some parameter
“fine-tuning” no method made more than three errors. Moreover, two samples (num-
ber 66 and 67) were misclassified by almost all methods, and it is possible that these
samples have been incorrectly labeled.
What distinguishes the methods is how fast they run. The following table gives the
execution time (in seconds) of the LOOCV procedure for each method:
Arc-fs
100=M
DLDA GOLUB3
k-NN
5=k
SVM
linear kernel
4156 13 99 13 40
We end this section with some brief comments on some of the methods.
3
“GOLUB refers here to the class prediction method proposed by Golub et al (1999).

14
Arc-fs: As can be seen from the table, the training phase is very time consuming, at
least compared to the other methods. (An experiment described in Breiman (1998) re-
quires 4 hours of CPU time.) Therefore, only the following (rather low) M-values
were tried: 1, 5, 10, 30, 50, and 100. The number of errors decreased from 7 to 1 as
the M-value increased. Note that, since Arc-fs uses random sampling, the number of
errors may vary even if the same training data and M-value are used. This is particu-
larly likely if the M-value is small.
k-NN: Various k-values were tried out. For k less than 30, the number of errors varied
between 1 and 3. For higher k-values, the number of errors increased dramatically.
SVM: With the linear polynomial kernel, there were two errors, while the radial basis
kernel resulted in 25 errors! From now on, we use only the linear kernel SVM.
5.3.1.2 Computer-generated data
In the following experiments, the class prediction methods were applied to computer-
generated data sets containing 50 samples (equally divided between class 1 and 2)
with expression levels for 1000 genes. For each experiment, 25 data sets were gener-
ated and the average numbers of errors were recorded.
In the first set of experiments, we varied the number of informative genes. The results
are as follows:
#Informative
Genes/Classification
Method
Arc-fs
10=M
DLDA GOLUB k-NN
10=k
SVM
100 4.5 0 0 0 0
50 4 0.04 0.08 0.2 0.04
20 5.8 3.2 5.3 7.2 3.2
All methods perform worse as the number of informative genes decreases. For the k-
NN method, the accuracy decreases more significantly than for the other methods.
The Arc-fs method seems to be rather insensitive to the number of informative genes.
The above results indicate that the methods are quite robust. Even with 90% irrelevant
genes, they (except for the Arc-fs method) have zero error rates. The next table shows
what happens when the 100 informative genes are made less informative by increas-
ing the standard deviation of their expression levels from 1500 to 2000 and 2500.
Standard devia-
tion/Classification Me-
thod
Arc-fs
10=M
DLDA GOLUB k-NN
5=k
SVM
1500 4.5 0 0 0 0
2000 7.48 0 0.04 0.2 0
2500 11.48 0.32 0.4 2.28 0.36
Not surprisingly, the error ratios increase as the standard deviation increases, but the
DLDA, GOLUB and SVM methods still have error ratios close to zero.

15
5.3.2 Prediction with gene selection
Several researchers claim that by removing irrelevant (or less informative) genes from
the training data, the corresponding class prediction function becomes more accurate.
In this section, we try to experimentally confirm this claim.
5.3.2.1 Leukemia data
We performed class prediction using the 1%, 5%, 10%, 15% and 20% most informa-
tive genes. The numbers of prediction errors are summarized in the following table:
% Selected genes/
Classification Me-
thod
Arc-fs
100=M
DLDA GOLUB k-NN
5=k
SVM
1 2 2 3 3 6
5 2 2 3 2 3
10 1 2 2 3 3
15 2 2 2 1 3
20 3 2 2 1 3
All genes 3 1 1 1 2
The main conclusion that we can draw from these results is that gene selection does
not necessarily lead to improved accuracy. In fact, for the leukemia data it is better to
use all genes than the most “informative”! The only method that seems to benefit
from gene selection is the Arc-fs method: with fewer genes, we can increase M and
still get reasonable execution times.
5.3.2.2 Computer-generated data
The results in section 5.3.1.2 show that without gene selection and with 95% irrele-
vant genes, the error rates are very close to zero. The following experiments were per-
formed in the same way as in section 5.3.1.2, except that 98% of the genes were irre-
levant:
% Selected genes/
Classification Me-
thod
Arc-fs
100=M
DLDA GOLUB k-NN
5=k
SVM
1 2.32 1.12 1.24 1.64 1.84
5 2 0.84 1.16 0.72 1.08
10 2.48 1.36 1.68 1.80 1.72
15 2.48 1.76 2.32 2.84 2.32
20 3.16 2.36 3.2 3.44 2.28
All genes 2.04 2.76 3.96 6.44 2.64
These results show that when a data set contains an extremely high proportion of “ir-
relevant” genes, then gene selection may actually improve accuracy. All methods (ex-
cept the Arc-fs method) achieved substantially better error rates when the 5% most in-
formative genes were selected compared with when all genes are used. This is particu-
larly noticeable for the k-NN method for which the accuracy differs by almost an or-
der of magnitude between optimal gene selection and no gene selection.

16
5.3.3 Permutation tests
As described in section 1, Radmacher et al recommend assessing cross-validation re-
sults by repeatedly permuting class labels and recording the numbers of errors com-
puted by the LOOCV procedure for each permutation. The histogram given below
shows the error distribution for a computer-generated data set consisting of 50 sam-
ples (equally divided between class 1 and 2) with 1000 genes of which 20 were in-
formative. The labels were permuted 1000 times.
10 15 20 25 30 35 40 45
0
10
20
30
40
50
60
70
80
90
Number of leave-one-out-cross-validation errors
As we can see, the error rates for permuted labels are significantly higher than for
non-permuted labels. The corresponding histogram for the leukemia data is almost
identical to this histogram. The error rates that we have observed above are thus high-
ly significant.
5.3.4 Conclusions
The results presented in sections 5.3.1 and 5.3.2 indicate that all of the evaluated class
prediction methods perform quite well (although the Arc-fs method seems to be main-
ly of theoretical interest). When applied to the leukemia data, the methods are essen-
tially equally accurate. Only by increasing the proportion of irrelevant genes to 95%
or higher (or by making the informative genes less informative) in the computer-
generated data, could we discover any differences between the methods.
If we must declare any method the “winner”, it must be the DLDA method. It is both
fast, easy to implement, and accurate. The k-NN method is also fast and easy to im-
plement, but is more sensitive to noise than the DLDA method. However, it is unclear
if this higher sensitivity makes any difference for real mRNA data. The SVM method
is as accurate as DLDA, but not that easy to implement. The method proposed by Go-
lub et al also performs well, but since there are well-known “general purpose” me-

17
thods that perform equally well or better, it is not clear if their method has contributed
anything to the state-of-the-art of cancer classification.
The notion that using only the most informative genes in the training data results in
more accurate class prediction functions makes a lot of sense. However, as we have
seen it can actually lead to decreased accuracy. Since the class prediction methods ap-
pear to be rather insensitive to noise, it may be better to allow a limited proportion of
irrelevant genes than risk removing too many informative genes.
6 References
A. Ben-Dor, L. Bruhn, N. Friedmann, I. Nachman, M. Schummer, Z. Yakhini. Tissue
classification with gene expression profiles. Proc. Fourth Annual Int. Conference on
Computational Molecular Biology (RECOMB), 2000.
L. Breiman. Arcing classifiers. The Annals of Statistics, 26, 801-824, 1998.
M.P.S. Brown et al. Knowledge-based analysis of microarray gene expression data by
using support vector machines. Proc. National Academy of Sciences, 97:262-
267,2000.
J. Deutsch. Algorithm for finding optimal gene sets in microarray prediction. ?
S. Dudoit, J. Friedlyand, and T.P. Speed. Comparison of discrimination methods for
the classification of tumors using gene expression data. J. American Statistical Asso-
ciation, March 2002, Vol. 97, No. 457.
T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler.
Support vector machine classification and validation of cancer tissue samples using
microarray expression data. Bioinformatics, 16(10), p. 906-914, 2000.
T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M.
Loh, J. Downing, M. Caliguri, C. Bloomfield, and E. Lander. Molecular classification
of cancer: class discovery and class description by gene monitoring. Science, 286, p.
531-537, 1999.
G. Getz, E. Levine and E. Domany. Coupled two-way analysis of gene microarray da-
ta. Proc. National Academy of Sciences, 97:12079-84, 2000.
I. Guyon, J. Weston, S. Barnhil, V. Vapnik. Gene selection for cancer classification
using support vector machines. Submitted to Machine Learning?
T. Hastie, R. Tibshirani, M. Eisen, P. Brown, D. Ross, U. Scherf, J. Weinstein, J. Ali-
zadeh and L. Stadt. Gene shaving: a new class of clustering methods for expression
arrays. Technical report, Stanford University, 2000.
J. Khan, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Scwab,
C.R. Antonescu, C. Peterson and P. Meltzer. Classification and diagnostic prediction
of cancers using expression profiling and artificial neural networks. Nature Medicine,
7(6), p.673-679, 2001.
A. Keller, M. Schummer, L. Hood and W. Ruzzo. Bayesian classification of DNA ar-
ray expression data. Technical report UW-CSE-2000-08-01, Univ. Washington, 2000.
W. Li and Y. Yang. How many genes are needed for a discriminant microarray data
analysis? lanl physics preprint archive xxx.lanl.gov, arXiv:physics/0104029 v1, 2001.
T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.
S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio.
Support vector machines classification of microarray data. Technical report, MIT,
1999.

18
P.J. Park, M. Pagano and M. Bonetti. A nonparametric scoring algorithm for identify-
ing informative genes from microarray data. Pacific Symposium on Bioinformatics,
2001.
P. Pavlidis, J. Weston, J. Cai and W. Grundy. Gene functional classification from he-
terogenous data. Proc. Fifth International Conf. Computational Molecular Biology,
2001.
M.D. Radmacher, L.M. McShane and R.Simon. A paradigm for class prediction using
gene expression profiles. Technical Report 001, July 2001, National Cancer Institute.
D. Slonim, P. Tamayo, J. Mesirov, T. Golub, E. Lander. Class prediction and discov-
ery using gene expression data. Proc. Fourth Annual Int. Conference on Computa-
tional Molecular Biology (RECOMB), 2000.
O. Troyanskaya et al. Missing value estimation methods for DNA microarrays. Bioin-
formatics, p. 520-525, 2001.
V.N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer, 2000.

Tissue Classification Using Gene Expression Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (16)

Semelhante a Tissue Classification Using Gene Expression Data

Semelhante a Tissue Classification Using Gene Expression Data (20)

Tissue Classification Using Gene Expression Data