1. Learning Bayesian Belief Network
Classifiers for Proteome Analyst
CMPUT551 Term Project
Project
Report
Zhiyong Lu
James Redford
Xiaomeng Wu
April 26, 2002
2. Table of Contents
1. ABSTRACT
2. INTRODUCTION
2.1 Description of the Task
2.2 Motivation
2.3 The Proteome Analyst
2.4 Our Solutions
2.5 Problems and Challenges
3. RELATED WORK
3.1 Proteome Analyst
3.2 NB vs. TAN
3.3 Discriminative Learning
4. APPROACHES
4.1 Overview
4.2 NB (generative vs. discriminative)
4.3 TAN
4.4 Neural Networks
4.5 Wrapper (Information Content)
4.6 Other approaches
5. EMPIRICAL ANALYSIS
5.1 Experimental Setup
5.1.1 Background on the Data Set
5.1.2 Training and Testing
5.2 Comparison of NB, TAN, and NN
5.3 Generative vs. Discriminative
5.4 Feature Selection—Wrapper
5.5 Miscellaneous Learning algorithms
5.6 Computational Efficiency
6. CONCLUSIONS and FUTURE WORK
7. REFERENCES
8. APPENDIX
3. 1. Abstract
In this course project, we investigate several machine learning techniques on
a specific task—Proteome Analyst. Naïve Bayes has been applied to this
problem with considerable success. However, it makes many assumptions
about data distributions that are clearly not true of real-world proteome. We
empirically evaluate several variant algorithms of Naive Bayes, including the
method in which parameters are learned (Generative vs. Discriminative
Learning) and different BN structures (Naïve Bayes vs. TAN). We also
implement a Neural Network algorithm and use some other existing tools such
as the WEKA data mining system to perform an empirical analysis of these
systems for the proteome function prediction problem.
This report is organized as follows. In Section 1 we introduce the task in our
project and our motivation and challenges we face. In Section 2 we review the
previous work on proteome analyst and discuss some alternative solutions to
the classification problem. In Section 3, we present the detailed concepts of
those machine learning techniques and our implementations. In Section 4, we
examine the proteome application for classification in detail, and show the
comparative results of different techniques. We conclude in Section 5 and
point out some future research directions in Section 6. Finally, Appendix A
contains all the experimental data we used in the report.
4. 2. Introduction
2.1 Description of the Task
Recently, more than 60 bacterial genomes and 5 eukaryotic genomes have
been completed. This explosion of DNA sequence data is leading to a
concomitant explosion in protein sequence data. Unfortunately, the function of
over half of these proteome sequences is unknown. Therefore, the proteome
function prediction problem has emerged as an interesting research topic in
bioinformatics. In our project, we are given some protein sequences with
known classes; our goal is to seek several machine learning techniques to
predict the classes of unknown protein sequence. This is a typical machine-
learning problem in the domain of classification area — learn from existing
experience to perform the task better.
2.2 Motivation
Typically it takes months or even years to determine the function of even a
single protein using standard biochemical approaches. A much quicker
alternative is to use computational techniques to predict protein functions.
Although there are many existing algorithms such as Naïve Bayes available
for the proteome function prediction, it often makes many assumptions about
data distributions that are clearly not true of real-world proteome. The
challenge is that we need some more generalized algorithms that not only do
not rely on the above assumption but also achieve high-throughput
performance including both classification accuracy and execution time.
2.3 The Proteome Analyst
Proteome Analyst is an application designed by the PENCE group at the
University of Alberta that carries out protein classification. The input to the
Proteome Analyst is a protein sequence, and the output is a prediction of
classification result. Figure 2.5.1 shows the architecture of the Proteome
Analyst.
The input protein sequence is initially fed though PsiBlast, which is a tool that
does sequence alignment against a database, in this case SwissProt. The
three best alignment matches, called homologues, returned by PsiBlast are in
turn passed into a tokenizer. The tokenizer retrieves text descriptions of the
homologues from the SwissProt database and then extracts a number of text
tokens from these descriptions. These tokens are used as input into the
classifier. Currently, the PENCE classifier is implemented as a Naïve
Bayesian network (NB). The features used by the NB are binary and
5. correspond to the tokens. If the token exists in the input sequence’s
description then the value of the feature is 1, otherwise the value is 0. The
output of the NB is the classification of the input sequence.
Protein Sequence
PsiBlast
Homologue
s SwissProt
Tokenize
Tokens
Classifie
Classification
Figure 2.5.1: Data flow architecture of the Proteome Analyst. Boxes, ovals, and arrows represent data,
filters, and data flow respectively. SwissProt is a database.
For our project, we are only concerned with the classifier portion of the
Proteome Analyst. We used data files that were already tokenized and the
data records already converted into classified vectors of binary features. See
Table 2.5.1 for an example.
6. Class F1 F2 F3 F4
A 1 0 0 0
A 0 1 0 1
B 1 0 1 0
B 0 1 1 0
B 0 0 1 1
B 0 0 1 1
B 1 1 1 1
B 0 0 1 0
Table 2.5.1: An example of the format of the data files used in our project.
2.4 Our Solutions
Since Naïve Bayes has been applied to the proteome function prediction with
considerable success by the PENCE group at University of Alberta. We focus
on two areas; the method in which parameters are learned, and the structure
of the BN. However, we also seek some other machine-learning techniques
such as Neural Networks and Support Vector Machines (SVM) to solve this
specific problem. Our goal is to explore the optimal classifier with best
performance on both classification accuracy and execution time during our
empirical analysis. Following is the summary of machine learning techniques
we have applied in our project:
Naïve Bayes (Generative vs. Discriminative Learning)
TAN (Tree-augmented Naïve Bayes)
Neural Networks
Decision Tree, Rule Learner… (Using WEKA data mining system)
Support Vector Machine
2.5 Challenges and Problems
Our evaluation criteria for those different machine-learning techniques are
mainly involve two:
Classification Accuracy
Execution Time
During our experiments on the real data, we found, overall, the Naïve
Bayesian classifier outperforms the other techniques, though it does not
achieve the best classification and shortest execution time in our empirical
study. For most of the other techniques, they might perform better than Naïve
Bayes in one aspect, but lose significantly in the other respect. For example,
the Decision Tree classifier, which achieves consistently 5 to 10 percentage
higher accuracy than Naïve Bayes. But it takes more than 5 times to train the
classifier. On the other hand, OneR in WEKA, another classifier, is easily
7. trained but has an accuracy of only 30%, which makes it unsuitable for our
task. Interestingly, we found an alternative approach—SVM (Support Vector
Machine) that achieves better classification accuracy with comparable
execution time of Naïve Bayes.
8. 3. Related Work
3.1 Proteome Analyst
PA (Proteome Analyst) is an application designed by the PENCE group at the
University of Alberta that does protein classification. Currently, a PA user can
upload proteome that consists of an arbitrary number of protein sequences in
FastA format. A PA user can configure PA to perform several function
prediction operations and can set up a workflow that will apply these
operations in various orders, under various conditions.
PA can be configured to use homology sequence comparison to compare
each protein against a database of sequences with known functions. Any
sequence with high sequence identity can then be assigned the function of its
homologues and removed from further analysis (or not). One or more
classification-based function predictors (that were using machine learning
techniques) can also be applied to any sequence.
More importantly, PA users can easily train their own custom classification-
based predictors and apply them to their sequences. Many other function
prediction operations are currently being developed and will be added to PA.
3.2 NB vs. TAN
The NB and TAN components of this project were primarily based on work
done by Friedman, Geiger, and Goldszmidt as described in their 1997 paper
“Bayesian Network Classifiers” [1]. Friedman et al compare NB’s to TAN’s on
a variety of data sets. They found that in most cases TAN methods were
more accurate than Naïve Bayesian methods. Our goal is to determine if
TAN’s are more accurate than NB’s for the PENCE data sets.
Jia You and Russ Greiner, from the University of Alberta, have also done work
on comparing different Bayesian classifiers, including NB and TAN classifiers
[4].
3.3 Discriminant Learning
Naïve Bayesian and TAN are two different types of belief net structure, both
first learns a good network structure and then fill in the according CPTable
attached to each node. Essentially, all of these learners use the parameters
that maximize the likelihood of the training samples [6]. Their goal is to
produce a good model most close to the distribution of the data, which is the
core idea belonging to the “Generative Classification”.
9. In general, there are two ways to make classification decisions, generative
learning and discriminative learning, respectively. Generative learning is to
build a model over the input examples in each class and classify based on
how well the resulting class conditional models explain any new input
example. The other method, discriminative learning, views the classification
problem from a quite different angle from generative learning. It aims to
maximize the classification accuracy instead of building the most accurate
model closest to the underlying distribution.
Thus, after obtaining a fixed structure, much more labor will be put to seeking
the parameters that maximize the conditional likelihood, of the class label ci
given the instance ei . R. Greiner and W. Zhou has done the related research
work on discriminant parameter learning of belief net classifiers in general
cases and found this kind of learning works effectively over a wide variety of
situations [5].
10. 4. APPROACHES
4.1 Overview
In our project, more than one machine learning techniques have been
adopted, each with its own advantages and from different angles.
Among probabilistic learners, we have implemented Naïve Bayesian Network
(generative and discriminative) and TAN. (Notes: our implementation work is
different from the existing system PENCE group used. All the code is built
from scratch, in deed, and the id3 file format is adopted.) For these two
classifiers, various experiments have been carried out by tuning different
parameters.
One important characteristic of our project is dealing with datasets having
thousands of features. All of the three datasets we investigate (ecoli, yeast,
and fly) have more than 1500 features. Problems exist when dealing with
applications with many features, in that irrelevant features indeed provide little
information and noisy features would make result worse. Moreover, standard
algorithms do not scale well. “Wrapper”(with Information Content) is what we
adopted in our project to handle the feature selection problems.
Neural Network is the one we extended the existing code to implement our
project. We have also tried several other techniques using available
implementations, such as WEKA series and SVM (for muti-classes).
For all of the experiments of above algorithms, cross-validation is the
technique we use to get a precise testing accuracy. We also take the running
efficiency into account in terms of execution time. In the last, comparisons for
these different algorithms will be presented.
4.2 NB (generative vs. discriminative)
4.2.1 Overview
Naïve Bayesian Network is one of the most practical learning methods for
classification problems. It is applicable when dealing with large training data
set, and the attributes that describe instances are conditionally independent
given classification labels.
The structure of Naïve Bayesian Network is simple and elegant based on the
assumption that attributes are independent given class labels. Nodes are
variables, and links between nodes represents the causal dependency. In NB,
Node for class labels serves as the root of tree, all the features are the child
nodes of root, and no sibling exists for each child node. Each node is attached
11. one Cptable, which is the parameter to learn for this structure. Every entry in
Cptable is in the form of P(child|parent).
In generative learning, entries in Cptable are populated with empirical
frequency count. In discriminative learning for a given fixed structure (NB
here), Cptable will be updated after each incoming query and thus try to
produce the optimal classification error score.
The inference of NB is based on Bayesian Theorem, and is carried out by
picking up the max[P(vj)ΠP(ai|vj)] where ai is the attributes and vj is class
label.
4.2.2 learning structure
The Naïve Bayes Learn Algorithm goes in this way:
For each target value Vj
P’(Vj) ← estimaite P(Vj)
For each attribute value ai of each attribute A
P’(ai|Vj) ← estimaite P(ai|Vj)
In deed, up to now, we have also filled up each Cptable for generative
learning.
4.2.3 discriminative learning
As said before, the parameters set in generative learning need not maximize
the classification accuracy. However, a good classifier is the one that
produces the appropriate answers to these unlabeled instances as often as
possible [5]. “Classification error” is usually defined as:
Err = P( class(e) != c) for <e,c> in sample
This can be approximated by empirical score :
Err’ = 1 Σ ( class(e)!= c) for <e,c> in sample
S
In discriminative learning for Naïve Bayesian Network, the goal is to learn the
Cptable entries for the given NB structure to produce the smallest empirical
error score above.
Therefore, the “log conditional likelihood” of given NB over the distribution of
labeled instances are used:
LCL = Σ P(e) log P(c|e) for <e,c> in sample
Similarly, this log conditional likelihood can also be approximated by:
LCL’ = 1 Σ log P(c|e) for <e,c> in sample
S
12. To get the CPtable entries that have the optimal conditional likelihood, a
simple gradient-descent algorithm is used [5]. The initialization of CPtable can
be obtained by the usual way to fill in the CPtable, then the empirical error
score is improved by changing the value of each CPtable entry.
In the implementation, “softmax” parameters are adopted. The advantage is to
keep the probability property: in the range between 0 and 1, and the marginal
would sum up to one.
Therefore, similar to the upgrade of weight in Neural Net, given a set of
labeled queries in training phase, the learning algorithm descends in the
direction of the total derivative, which is the sum of individual derivatives
For a singled labeled instance <e,c>, the partial derivative is:
P(r, f|e. c) - P(r,f |e) - θ r|f[ P(f| e, c)-P(f | e)]
In the specific NB structure, the computation of derivative is relatively less
intensive, because for each CPTable investigated, the parent of CPTable is
the class label node; this special belief network brings great reduce in
computation complexity in the implementation.
There are also some other speed-up techniques, such as “line-search” to
determine the learning rate, and conjugate gradient. We didn’t try these, but
still took advantage of the observation that when R is independent of C given
E, the derivative would be zero [5].
4.3 TAN
4.3.1 Overview
Tree Augmented Naïve Bayesian networks (TAN [1]) is one approach we took
in this project. TAN’s are similar to regular NB’s, but the features of a TAN are
organized into a tree structure. An example is given in figure 4.3.1.
Class
F1 F2 F3 F4 F5
Figure 4.3.1: An example of a tree augmented naïve Bayesian network
13. The CP tables of a TAN are also similar to those of a NB. The difference is
that all of the CP tables for the feature nodes have an extra column to account
for the extra parent, except for the root node of the tree. Figure 4.3.2 shows
an example.
Class Parent Fi = 0 Fi = 1
C1 0 .566 .444
C1 1 .200 .800
C2 0 .101 .899
C2 1 .750 .250
Figure 4.3.2: An example of a CP table for a TAN
4.3.2 Learning Structure
Below is the algorithm we used to learn the TAN structure, taken from [1].
1. Calculate the Conditional Mutual Information Ip between any two
features F1 and F2, given the classification C.
Ip(F1; F2 | C) = Σf1,f2,c P(F1=f1, F2=f2 | C=c) log P(F1=f1, F2=f2 | C=c)
P(F1=f1 | C=c) P(F2=f2 | C=c)
In our case we avoid zero cases by initializing the count buckets for the
P(f1, f2 | c) to 1 instead of 0.
2. Construct a complete undirected graph, where every feature is a node
in the graph. Set the weights of the edges in the graph to the
corresponding Ip values between features.
3. Extract the maximum weighted spanning tree from the graph. In our
case, we used Kruskal’s minimum weighted spanning tree algorithm [2]
and modified it slightly to find the maximum weighted spanning tree.
4. Choose a node to be the root and direct all edges in the spanning tree
away from it, creating a tree. In our case we chose the feature with the
highest information gain to be the root node.
5. Add the classification node and make it a parent of all of the feature
nodes.
4.3.3 Learning CP Parameters and Classification
Given a data record with m features f0, f1, … , fm
Class = argmaxc { P(c) Πi P(fi | p(fi), c) }
Where 1 <= i <= m, and p(fi) is the value of feature fi’s parent. We consider
the root node to be its own parent ( p(froot) = froot ).
14. Class = argmaxc { P(c) Πi P(fi, p(fi), c) / P(p(fi), c) }
Class = argmaxc { nc Πi ncijk / ncik }
nc is the number of records with class c
ncijk is the number of records with class c, where fi = j and p(fi) = k
ncik is the number of records with class c, where p(fi) = k
We simply count up the nc, ncijk, and ncik’s to learn the CP table entries. Again,
to avoid problems when these values are 0, we simply initialize all entries to 1.
4.3.4 Example
Lets say we are given the data in table 4.3.1. First we determine the structure
of the TAN. Step 1 is to calculate the conditional mutual information between
every two features.
Class F1 F2 F3 F4
A 1 0 0 0
A 0 1 0 1
B 1 0 1 0
B 0 1 1 0
B 0 0 1 1
B 0 0 1 1
B 1 1 1 1
B 0 0 1 0
Table 4.3.1: Data for the TAN example
P(F1=0, F2=0 | C=A) = 1/6 P(F1=0, F2=1 | C=A) = 2/6
P(F1=1, F2=0 | C=A) = 2/6 P(F1=1, F2=1 | C=A) = 1/6
P(F1=0, F2=0 | C=B) = 4/10P(F1=0, F2=1 | C=B) = 2/10
P(F1=1, F2=0 | C=B) = 2/10P(F1=1, F2=1 | C=B) = 2/10
P(F1=0 | C=A) = 3/6 P(F1=1 | C=A) = 3/6
P(F1=0 | C=B) = 6/10 P(F1=1 | C=B) = 4/10
P(F2=0 | C=A) = 3/6 P(F2=1 | C=A) = 3/6
P(F2=0 | C=B) = 6/10 P(F2=1 | C=B) = 4/10
Note that we started each of the above buckets at 1 instead of 0 before
counting. That explains why the denominators are 6 and 10, instead of 2 and
6 respectively.
15. So using the above values we get
Ip(F1; F2 | C) = 1/6 log( 1/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) +
2/6 log( 2/6 / (3/6 * 3/6) ) + 1/6 log( 1/6 / (3/6 * 3/6) ) +
4/10 log( 4/10 / (6/10 * 6/10) ) + 2/10 log( 2/10 / (6/10 * 4/10) ) +
2/10 log( 2/10 / (4/10 * 6/10) ) + 2/10 log( 2/10 / (4/10 * 4/10) )
= -0.0293485 + 0.0416462 + 0.0416462 + -0.0293485 +
0.018303 + -0.0158362 + -0.0158362 + 0.019382
Ip(F1; F2 | C) = 0.0306079
And similarly we get
Ip(F1; F3 | C) = 0.0022286
Ip(F1; F4 | C) = 0.0245954
Ip(F2; F3 | C) = 0.0022286
Ip(F2; F4 | C) = 0.0245954
Ip(F3; F4 | C) = 0
Step 2 is to create a complete undirected graph where the features are the
nodes and the Ip values are the edge weights. A graphical representation of
this graph is shown in figure 4.3.3. In our implementation we represent the
graph as an array of triplets <n1, n2, w> where n1 and n2 are the nodes that
the edge connects and w is the weight of the edge.
Graph = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246>, <2, 3, 0.0022>,
<2, 4, 0.0246>, <3, 4, 0> }
F
1
.0246
.0306
.0022
F F
2
.0246
4
.0022 0
F
3
Figure 4.3.3: The conditional mutual information graph for the TAN example.
Step 3 is to extract a maximum weighted spanning tree from the graph. Our
algorithm generates the following max span tree, also shown in figure 4.3.4.
16. MaxSpanTree = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246> }
It is easy to verify that this is indeed a maximum weighted spanning tree.
F
1
.0306
.0246
.0022
F F F
2 4
3
Figure 4.3.4: A maximum weighted spanning tree for the TAN example
In step 4 we choose the feature with the highest information content to be the
root node. The information contents of the features are given in Table 4.3.2.
We see that feature F4 has the highest information content, so it becomes the
root node. The following formula was used to calculate the information
contents:
Gain(F) = - P(F=0) log2( P(F=0) ) - P(F=1) log2( P(F=1) )
Step five involves simply adding the classification node as a parent to all other
nodes. Figure 4.3.5 shows the final TAN structure.
Feature Information
Content
F1 0.954434
F2 0.954434
F3 0.811278
F4 1.000000
Table 4.3.2: The information content of the features for the TAN example
Now that the structure is set, we need to learn the CP table entries. The
parameters required are the ncijk, ncik, and nc values described in section 4.3.2.
Tables 4.3.3, 4.3.4, and 4.3.5 show the CP tables that contain the ncijk, ncik,
and nc entries respectively, for our example. Again, remember that the ncijk
entries were initialized to 1, not 0.
17. F
4
C
F
1
F F
2 3
Figure 4.3.5: The final TAN structure in the TAN example
Class P(F1) = F1 = 0 F1 = 1 Class P(F2) = F2 = 0 F2 = 1
F4 F1
A 0 1 2 A 0 1 2
A 1 2 1 A 1 2 1
B 0 3 2 B 0 4 2
B 1 3 2 B 1 2 2
Class P(F3) = F3 = 0 F3 = 1 Class P(F4) = F4 = 0 F4 = 1
F1 F4
A 0 2 1 A 0 2 1
A 1 2 1 A 1 1 2
B 0 1 5 B 0 4 1
B 1 1 3 B 1 1 4
Table 4.3.3: The ncijk CP table entries for the TAN example
Class F1 = 0 F1 = 1 Class F2 = 0 F2 = 1
A 3 3 A 3 3
B 6 4 B 6 4
Class F3 = 0 F3 = 1 Class F4 = 0 F4 = 1
A 4 2 A 3 3
B 2 8 B 5 5
Table 4.3.4: The ncik CP table entries for the TAN example
Class = A 6
Class = B 10
18. Table 4.3.5: The nc CP table entries for the TAN example
Now that both the structure and the CP table entries have been learned, we
can attempt to classify new instances. Consider the following unclassified
record:
Class F1 F2 F3 F4
? 1 1 1 0
P(Class = A) = nA * (nA110 * nA211 * nA311 * nA400) / (nA40 * nA11 * nA11 * nA40)
= 6 * (2 * 1 * 1 * 2) / (3 * 3 * 3 * 3)
= 0.296
P(Class = B) = nB * (nB110 * nB211 * nB311 * nB400) / (nB40 * nB11 * nB11 * nB40)
= 10 * (2 * 2 * 3 * 4) / (5 * 4 * 4 * 5)
= 1.200
Therefore we classify this example as ‘B’ since P(Class = B) > P(Class = A).
3.3.4 Validation
We validated our TAN implementation by running it with the above example
and analyzing the verbose debugging output. We verified that the results
from that run were identical to results given in the above example.
4.4 Neural Nets
4.4.1 Overview
Artificial neural network learning provides a practical method for learning real-
valued and vector-valued functions over continuous and discrete-valued
attributes, in a way that is robust to noise in the training data. The
Backpropagation algorithm [3] is the most common network learning method
and has been successfully applied to a variety of learning tasks, such as
handwriting recognition and robot control. Neural Nets is one of the major
techniques covered in the class.
4.4.2 Implementation
As opposed to the Naïve Bayes and TAN that we implemented from scratch,
our implementation of Neural Nets is based on the our assignment 4 from
class. We modified the Backpropagation algorithm that was originally for the
problem of face recognition. The major modification we made was on the
input: instead of using input nodes that represented the images, we changed
it to represent every different features of our protein sequence. For the output
nodes, instead of representing the user’s head position or user id, etc, we use
them to represent the different class labels. Lastly, we changed the code for
19. estimating the classification accuracy since these two problems are totally
different in this case. For initial value of each input node, our strategy is: If one
feature appears in one particular sequence, then the initial value of that input
node is 1, otherwise, it will be set to 0. Corresponding, for the output node, we
set 1 to one of the 14 output nodes that represents the correct class of our
current sequence, set 0 to others 13 output nodes. The unit weight is set up
randomly in the beginning.
4.4.3 Example
For a specific protein sequence, the number of input nodes will be the number
of features. You can specify the number of hidden nodes as a parameter. The
number of output nodes in all experiment is 14 since we have 14 different
classes for all the dataset. Each output node represents one of the classes in
{a, b, c, d, e, f, g, h, i, j, k, l, m, n}.
Inputs Hidden Output
. .
. .
Figure 4.4.1 Learned Hidden Layer Representation
4.5 Wrapper (Information Content)
4.5.1 Overview
For our particular task, the data set scales up to thousands of features. Even
worse, some of these features are irrelevant and provide little to no
information. Also the features can be noisy. Standard algorithms do not scale
well with number of features, so the approach we use is “Wrapper”: Try
different subsets of features on learner, estimating performance of algorithm
with respect to each subset, and keep subset that performs best. Before
selecting the subset, we preprocess (weight) each feature according to its
mutual information content given by the formula below.
20. W j = Σv Σc P(y = c, f j = v) log P(y = c, f j = v)
P(y = c) P(f j = v)
We can see that formula treats all the features independently.
4.5.2 Implementation
Step 1: Calculate the information content of each feature
We read in all of the training records first and then use the above formula to
compute the mutual information content for each feature. When this
preprocessing step is finished, we can begin to train the classifier in the next
step.
Step 2: Try different subsets of features
We begin by using all of the features to train the classifier. Then we remove
5% of the features that have the lowest information contents and retrain the
classifier, in each round. After 20 rounds, there are no features remaining.
We compare the classification accuracies of these 20 rounds and choose the
subset of features that produced the highest prediction accuracy. If two
features have the same information content, then we choose one arbitrarily.
4.5.3 Example
Let us consider the following problem:
Suppose we have totally eight protein sequences, each sequence has exactly
eight features. These eight sequences belong to all four classes: {C, P, R, M}.
In the following table, for a particular protein sequence, if the entry of feature i
is 1 then feature i appears, otherwise it does not appear. For example, for the
first protein sequence, the features I, II and III appears in this sequence, the
others do not appear.
Seq. class I II III IV V VI VII VIII
1 C 1 1 1 0 0 0 0 0
2 C 1 1 1 1 0 0 0 0
3 P 0 0 1 0 0 1 0 1
4 P 0 0 1 0 0 0 1 1
5 R 1 1 1 0 0 0 0 0
6 R 0 1 1 0 0 1 0 0
7 M 0 0 1 0 1 1 0 0
8 M 0 1 1 1 0 0 0 0
21. Info. 0.352 0.352 0 0.156 0.147 0.102 0.147 0.406
Table 4.5.1 the information content of eight features in eight sequences
The last row shows the information content of each feature. These values are
computed by the formula given above. As we can see, feature #3’s
information content is 0, which shows that this feature contains least
information about the data. This is expected since it appears in all the eight
sequences. On the other hand, whenever feature #8 appears, its
corresponding class is P in our example. Therefore, it is a significant
discriminating feature in the data. Accordingly, its information content is the
highest one in this case.
The wrapper works by training a classifier using all the features in the first
round. In the following rounds, it removes a fixed number of features each
round starting from those features with low information content. For example,
if we decide to remove one feature at a time in our example, then we iterate
through 8 rounds, starting with removing feature #3 since its information
content is 0. Then removing feature #6 since 0.102 is the smallest among
remaining features. In the last round, only feature #8 remains. We then
choose the subset of features with highest accuracy appears during the eight
rounds.
4.6 Other approaches
4.6.1 Overview
Besides the primary techniques we implemented (Naïve Bayes, TAN, and
Neural Nets), we also apply some others, which include both traditional
techniques such as decision trees and ruler learners, and a more recent
approach in SVM’s.
A decision tree is a class discriminator that recursively partitions the
training sets until each partition consists entirely or dominantly of
examples from one class. Each non-leaf node of the tree contains a split
point that is a test on one of more features and determines how the data
is partitioned. It is the first classifier we learned in our class.
A rule learner is alternative classifier, which can be built directly by
reading off a decision tree: generating a rule for each leaf and making a
conjunction of all the tests encountered on the path from the root to that
leaf. The advantage of rule learner is of its easy understanding, but
sometimes it becomes more complex than necessary.
22. SVM (Support Vector Machine) is a method for creating functions from a
set of labeled training data. The function can be a classification function or
the function can be a general regression function. For classification, SVM
operates by finding a hyper-surface in the space of possible inputs. This
hyper-surface will attempt to split the positive examples from negative
examples. The split will be chosen to have the largest distance from the
hyper-surface to the nearest of the positive and negative examples.
Intuitively, this makes the classification correct for testing data that is near,
but not identical to the training data. They are maturely used in the NLP
(Natural Language Processing) problem such as text categorization.
4.6.2 Existing Tools
Instead of implementing all of the classifiers by ourselves, we chose to use
some existing machine learning tools to make life easier.
WEKA
Both Decision Tree and Rule Learner classifiers are used through WEKA.
WEKA is a collection of machine learning algorithms for solving real-world
data mining problems. It is written in Java and runs on almost any platform. It
includes almost all of the existing classification schemes. It has decision trees,
rule learners and naïve Bayes. However, we will show in the next section that
WEKA does not seem capable of dealing with our datasets very well.
Libsvm
Libsvm is a simple, easy-to-use, and efficient software for SVM classification
and regression. Although WEKA has the SVM classifier, it only deals with
binary classifications, which is inappropriate for our task since we have 14
classes in our datasets. The most appealing feature of Libsvm is that it
supports multi-class classification. In addition, it can solve C-SVM
classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression,
and nu-SVM regression.
23. 5. Empirical Analysis
5.1 Experiment Setup
5.1.1 Background on the Data Set
Our three data sets wer provided by the PENCE group at the University of
Alberta. Each data set contains thousands of protein sequences with known
classes. For each sequence, there are more than one thousand features. For
example, the Ecoli data set has more than two thousand sequences with
about 1500 features. See table 5.1.1.
Data Set # of classes # of sequences # of features
Ecoli 14 2370 1504
Yeast 14 2539 1555
Fly 14 3823 1906
Table 5.1.1 the three data sets: Ecoli, Yeast, Fly
5.1.2 Training and Testing
We train the classifier on each of the three datasets separately with different
techniques. We use 5-fold cross validation to compute the validation accuracy.
We implement Naïve Bayes, TAN and Neural Nets using C. The WEKA code
is implemented using JAVA. The Libsvm has both C and JAVA version, we
simply use C version during our experimentation. All the experimentation is on
the machine at our graduate office, which is an i686 machine running on Linux
7.0 and has 415MB swap memory.
5.2 Comparison of NB, TAN, and NN
Figure 5.1.1 shows a comparison of Naïve Bayesian, Tree Augmented Naïve
Bayesian, and Neural Net classifiers. The accuracies given were obtained
using 5-fold cross validation.
24. Comparison of diff. classifiers without wrapper
100
Validation accuracy 80
NB
60
TAN
` 40
NN
20
0
Ecoli Yeast Fly
Different classifier techniques
Comparison of diff. classifiers with wrapper
Best validation accuracy
100
80
NB
60
TAN
40
NN
20
0
Ecoli Yeast Fly
Different classifier techniques
Effect of feature selection on different classifiers
Ecoli
NB
Yeast TAN
NN
Fly
0 4 8 12 16 20
Percentage of accuracy improvement
Figure 5.1.1: A comparison of the accuracy of NB, TAN, and NN. The first graph shows the
comparative accuracies without using the wrapper. The second graph shows the maximum accuracy of
each method using the wrapper. The third graph shows the increase in accuracy when the wrapper is
used.
We see that the accuracies of the NB and the TAN classifiers are roughly
equal both with and without the wrapper, for all three data sets. Given that
TAN’s are more complicated to implement and take longer to train than NBs
25. [1], it is likely more practical to use NB’s for the PENCE data rather than
TAN’s.
Neural networks perform noticeably better than both NB’s and TAN’s in terms
of accuracy, on all three datasets. This suggests that neural network
classifiers could be a promising area for future research for the Proteome
Analyst tool.
The third graph shows the accuracy improvement percentage obtained by
using the wrapper. We note that the wrapper seems to have a similar effect
on both NB and TAN classifiers, while the wrapper does not help in the NN
case at all for the yeast and fly datasets.
5.3 Generative vs. Discriminative
The first observation is that discriminative learning will enhance the
classification accuracy. R.Greiner and W.Zhou have proved the discriminative
learning would be more robust to incorrect assumption than generative
learning [5].
The second observation is that discriminative learning is more computational
intensive than generative learning, since it will update every entry in CPTable
each time, and deal with high-dimension in this case.
5.4 Feature Selection—Wrapper
As we observed before, each protein sequence of our data set has more than
a thousand features; therefore, we use the “wrapper” feature selection
technique to remove less relevant features. The following graphs show how
the wrapper works on our three implementations: Naïve Bayes, TAN and
Neural Nets. From figure 5.3.1, we can see that:
For both Naïve Bayes and TAN, we see that wrapper helps a lot. When
removing 75%--85% percentages of features, both of them gain best
classification accuracy. As we can see, when remains 25% of features,
Naïve Bayes classifier achieve an accuracy close to eighty, which is about
15% higher than it use all the features to train a classifier.
But for the Neural Nets, wrapper only works on the Ecoli, and the help is
not that significant compared with running on Naïve Bayes or TAN. For
the other two data sets (Yeast and Fly), wrapper does not help at all. The
accuracy consistently decreases as the number of features goes down.
26. NB classifier with kfold = 5
100
Validation accuracy 80
Ecoli
60
Yeast
40
Fly
20
0
30
0
10
20
40
50
60
70
80
90
100
Percentage of tokens removed
TAN classifier with kfold = 5
100
Validation accuracy
80
Ecoli
60
Yeast
40
Fly
20
0
0
10
20
30
40
50
60
70
80
90
100
Percentage of tokens removed
NN classifier with kfold = 5
100
Validation accuracy
80
Ecoli
60
Yeast
40
Fly
20
0
0
100
10
20
30
40
50
60
70
80
90
Percentage of tokens removed
Figure 5.4.1: The effect of using wrapper as feature selection. The first graph shows how wrapper
works on NB. The second graph shows how wrapper works on TAN. The third graph shows how
wrapper works on the neural nets.
27. 5.5 Miscellaneous Learning Algorithms
We experiment with four other approaches using existing tools, WEKA and
Libsvm, and record the 5-fold cross-validation accuracy. Additionally, since the
WEKA code includes the Naïve Bayes classifier we compare their tool and our
implementation.
In the following table, some entries are empty. There are two reasons for this.
One is that the training time is too long. For example, for the rule trainer
classifier, it takes nearly 6 hours to train the Ecoli classifier. Since the Yeast
dataset has more records and more features, it becomes impractical to
continue.
Tech. Data Ecoli Yeast Fly
Decision Tree 81.9% 79.4% --
Rule Learner 82.66% -- --
Naïve Bayes 67.85% 69.16% --
SVM 85.3165% 82.4734% 78.0016%
Table 5.5.1 the validation accuracy using some other techniques
The second reason for the blank entry is that when using the WEKA code for
Fly, we run out of memory. Since we cannot modify the WEKA code, we
ignore those experiments.
Although we cannot complete some of the tests using existing tools, we can
still gain some useful insight from the results we did get.
The accuracy of Naïve Bayes of WEKA not only validates the correctness
of our implementation, but also illustrates the strength of our own
implementation, which can deal with all of three data sets, without running
out of memory.
Naïve Bayes is the worst classifier if we only considering the accuracy. All
the other three techniques achieve close to 80% high accuracy, which is
about 10% higher than NB. However, the execution time of those
techniques is much higher and is shown in the later sections.
SVM (Support Vector Machine) technique not only deals with all of three
data sets, but also is the winner among those techniques with respect to
accuracy. For Ecoli, it achieves the highest 85.3% accuracy, which is
about 20% better than Naïve Bayes. This makes it a potential alternative
for the Naïve Bayes classifier though it still consumes more execution
time than Naïve Bayes.
28. 5.6 Computational Efficiency
As we saw before, the Naïve Bayes classifier is not as accurate as the other
methods, but we believe it is the most practical classifier for our task. The
reason can be easily seen from the following table:
Classifier Naïve TAN Neural Decision Rule SVM
Bayes Nets Tree Learner
Time 5mins 15mins 30mins 1hr 6hrs 12mins
Table 5.6.1 the approximate execution times of different techniques on Ecoli
We conclude in the last section that nearly all other classifier outperform
Naïve Bayes with respect to accuracy. The table above suggests an
interesting tradeoff: More accuracy, longer time. For those classifiers that take
more than half an hour like Decision Tree, they can never be considered as a
practical approach for our task. For the others, if our goal focuses on
classification accuracy, then our study shows that both TAN and SVM will be
good choice. Especially, for the SVM, as we see before, it can outperform
20% more accuracy than Naïve Bayes, but it also takes twice as long to train
the classifier. Overall, when we consider both our criteria, Naïve Bayes still
seems to be the optimal classifier for our task, currently. However, TAN’s and
SVM’s look to be excellent areas of future research in this area, especially
research done to improve their training speed.
29. 6. Conclusions and Future Works
6.1 Conclusions
In this course project, we have explored several machine learning techniques
for classification on a specific application domain – PENCE. Though our main
focus is on Bayesian Network classifiers (Naïve Bayesian, TAN), we have also
tried other different ideas (Decision Tree, Neural Network, SVM, etc). What’s
more, discriminative learning on Naïve Bayesian is also tested. Comparison
both on classification accuracy and running efficiency in terms of execution
time are drawn based on various different combinations of experiments and
from different angles.
Based on the experimental results we have, we found that the harder a
learner learns (in terms of execution time), the better results we can get (in
terms of classification accuracy). However, this is the trade-off between
efficiency and accuracy. Take the all factors into account, we think
NB+wrapper is a suitable solution to this application. However, we are
impressive by the accuracy SVM works out.
6.2 Future Work
One of the possible future work can be carried out on the feature selection
part, since wrapper works quite effectively. There are many other algorithms
dealing with scaling up the supervised learning. From the last class this term,
several algorithms, such as “RELIEF-F” algorithm, which draw samples at
random, and then adjust weight of features that discriminate instances from
neighbors of different classes; “VSM”, which integrates feature weighting into
learning algorithm, etc have been introduced and can be tried. One of the
other possible way in reducing the feature dimensionality can take advantage
of some statistic metrics and clustering techniques to cluster the feature sets
first, and then do the learning task.
Considering the long execution time for almost all the algorithms except Naïve
Bayesian Network, speed-up in learning phase according to different
algorithms can be another aspect of future work.
30. Acknowledgments
The authors are grateful to Dr. Russ Greiner for his valuable comments on our
project and useful discussions relating to this work. Jie Cheng and Wei Zhou’s
previous work on Bayesian Network and discriminative learning helps our
work a lot. We also thank Dr. Duane Szafron, and Dr. Paul Lu for their
support with regard to the PENCE code and data. We also would like to
thank Roman Eisner for helping us on some detailed problems. And perhaps
most of all, we would like to thank the good people at Wendy’s for providing
us with tasty hamburgers at a reasonable price during the ungodly hours of
the night while we worked late.
31. 7. References
1. Friedman, Geiger, Goldszmidt. Bayesian Network Classifiers.
Machine Learning, volume 29 (pp. 131-163), 1997.
2. Brassard and Bratley. Fundamentals of Algorithmics, Prentice Hall,
1996.
3. T Mitchell. Machine Learning, McGraw Hill, 1997.
4. Jie Cheng and Russell Greiner. Comparing Bayesian Network
Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in
Artificial Intelligence (UAI-99), Sweden, Aug 1999.
5. Russ Greiner, and Wei Zhou. Structural Extension to Logistic
Regressions: Discriminative Parameter Learning of Belief Net
Classifiers. AAAI’02, Canada
6. David E. Heckerman. A tutorial on learning with Bayesian networks.
Learning in Graphical Models, 1998