SlideShare uma empresa Scribd logo
1 de 33
Learning Bayesian Belief Network
 Classifiers for Proteome Analyst

          CMPUT551 Term Project

                         Project
                         Report


                         Zhiyong Lu
                         James Redford
                         Xiaomeng Wu

                 April 26, 2002
Table of Contents

1. ABSTRACT
2. INTRODUCTION
      2.1 Description of the Task
      2.2 Motivation
      2.3 The Proteome Analyst
      2.4 Our Solutions
      2.5 Problems and Challenges
3. RELATED WORK
      3.1 Proteome Analyst
      3.2 NB vs. TAN
      3.3 Discriminative Learning
4. APPROACHES
      4.1 Overview
      4.2 NB (generative vs. discriminative)
      4.3 TAN
      4.4 Neural Networks
      4.5 Wrapper (Information Content)
      4.6 Other approaches
5. EMPIRICAL ANALYSIS
      5.1 Experimental Setup
             5.1.1 Background on the Data Set
             5.1.2 Training and Testing
      5.2 Comparison of NB, TAN, and NN
      5.3 Generative vs. Discriminative
      5.4 Feature Selection—Wrapper
      5.5 Miscellaneous Learning algorithms
      5.6 Computational Efficiency
6. CONCLUSIONS and FUTURE WORK
7. REFERENCES
8. APPENDIX
1. Abstract

In this course project, we investigate several machine learning techniques on
a specific task—Proteome Analyst. Naïve Bayes has been applied to this
problem with considerable success. However, it makes many assumptions
about data distributions that are clearly not true of real-world proteome. We
empirically evaluate several variant algorithms of Naive Bayes, including the
method in which parameters are learned (Generative vs. Discriminative
Learning) and different BN structures (Naïve Bayes vs. TAN). We also
implement a Neural Network algorithm and use some other existing tools such
as the WEKA data mining system to perform an empirical analysis of these
systems for the proteome function prediction problem.

This report is organized as follows. In Section 1 we introduce the task in our
project and our motivation and challenges we face. In Section 2 we review the
previous work on proteome analyst and discuss some alternative solutions to
the classification problem. In Section 3, we present the detailed concepts of
those machine learning techniques and our implementations. In Section 4, we
examine the proteome application for classification in detail, and show the
comparative results of different techniques. We conclude in Section 5 and
point out some future research directions in Section 6. Finally, Appendix A
contains all the experimental data we used in the report.
2. Introduction

2.1 Description of the Task
Recently, more than 60 bacterial genomes and 5 eukaryotic genomes have
been completed. This explosion of DNA sequence data is leading to a
concomitant explosion in protein sequence data. Unfortunately, the function of
over half of these proteome sequences is unknown. Therefore, the proteome
function prediction problem has emerged as an interesting research topic in
bioinformatics. In our project, we are given some protein sequences with
known classes; our goal is to seek several machine learning techniques to
predict the classes of unknown protein sequence. This is a typical machine-
learning problem in the domain of classification area — learn from existing
experience to perform the task better.


2.2 Motivation
Typically it takes months or even years to determine the function of even a
single protein using standard biochemical approaches. A much quicker
alternative is to use computational techniques to predict protein functions.
Although there are many existing algorithms such as Naïve Bayes available
for the proteome function prediction, it often makes many assumptions about
data distributions that are clearly not true of real-world proteome. The
challenge is that we need some more generalized algorithms that not only do
not rely on the above assumption but also achieve high-throughput
performance including both classification accuracy and execution time.


2.3 The Proteome Analyst
Proteome Analyst is an application designed by the PENCE group at the
University of Alberta that carries out protein classification. The input to the
Proteome Analyst is a protein sequence, and the output is a prediction of
classification result. Figure 2.5.1 shows the architecture of the Proteome
Analyst.

The input protein sequence is initially fed though PsiBlast, which is a tool that
does sequence alignment against a database, in this case SwissProt. The
three best alignment matches, called homologues, returned by PsiBlast are in
turn passed into a tokenizer. The tokenizer retrieves text descriptions of the
homologues from the SwissProt database and then extracts a number of text
tokens from these descriptions. These tokens are used as input into the
classifier. Currently, the PENCE classifier is implemented as a Naïve
Bayesian network (NB). The features used by the NB are binary and
correspond to the tokens. If the token exists in the input sequence’s
description then the value of the feature is 1, otherwise the value is 0. The
output of the NB is the classification of the input sequence.

                              Protein Sequence




                                    PsiBlast




                                Homologue
                                s                                     SwissProt




                                    Tokenize




                                     Tokens




                                    Classifie


                                  Classification

Figure 2.5.1: Data flow architecture of the Proteome Analyst. Boxes, ovals, and arrows represent data,
filters, and data flow respectively. SwissProt is a database.


For our project, we are only concerned with the classifier portion of the
Proteome Analyst. We used data files that were already tokenized and the
data records already converted into classified vectors of binary features. See
Table 2.5.1 for an example.
Class         F1          F2          F3          F4
                       A           1           0           0           0
                       A           0           1           0           1
                       B           1           0           1           0
                       B           0           1           1           0
                       B           0           0           1           1
                       B           0           0           1           1
                       B           1           1           1           1
                       B           0           0           1           0

           Table 2.5.1: An example of the format of the data files used in our project.



2.4 Our Solutions
Since Naïve Bayes has been applied to the proteome function prediction with
considerable success by the PENCE group at University of Alberta. We focus
on two areas; the method in which parameters are learned, and the structure
of the BN. However, we also seek some other machine-learning techniques
such as Neural Networks and Support Vector Machines (SVM) to solve this
specific problem. Our goal is to explore the optimal classifier with best
performance on both classification accuracy and execution time during our
empirical analysis. Following is the summary of machine learning techniques
we have applied in our project:
      Naïve Bayes (Generative vs. Discriminative Learning)
      TAN (Tree-augmented Naïve Bayes)
      Neural Networks
      Decision Tree, Rule Learner… (Using WEKA data mining system)
      Support Vector Machine


2.5 Challenges and Problems
Our evaluation criteria for those different machine-learning techniques are
mainly involve two:
    Classification Accuracy
    Execution Time

During our experiments on the real data, we found, overall, the Naïve
Bayesian classifier outperforms the other techniques, though it does not
achieve the best classification and shortest execution time in our empirical
study. For most of the other techniques, they might perform better than Naïve
Bayes in one aspect, but lose significantly in the other respect. For example,
the Decision Tree classifier, which achieves consistently 5 to 10 percentage
higher accuracy than Naïve Bayes. But it takes more than 5 times to train the
classifier. On the other hand, OneR in WEKA, another classifier, is easily
trained but has an accuracy of only 30%, which makes it unsuitable for our
task. Interestingly, we found an alternative approach—SVM (Support Vector
Machine) that achieves better classification accuracy with comparable
execution time of Naïve Bayes.
3. Related Work
3.1 Proteome Analyst
PA (Proteome Analyst) is an application designed by the PENCE group at the
University of Alberta that does protein classification. Currently, a PA user can
upload proteome that consists of an arbitrary number of protein sequences in
FastA format. A PA user can configure PA to perform several function
prediction operations and can set up a workflow that will apply these
operations in various orders, under various conditions.

PA can be configured to use homology sequence comparison to compare
each protein against a database of sequences with known functions. Any
sequence with high sequence identity can then be assigned the function of its
homologues and removed from further analysis (or not). One or more
classification-based function predictors (that were using machine learning
techniques) can also be applied to any sequence.

More importantly, PA users can easily train their own custom classification-
based predictors and apply them to their sequences. Many other function
prediction operations are currently being developed and will be added to PA.


3.2 NB vs. TAN
The NB and TAN components of this project were primarily based on work
done by Friedman, Geiger, and Goldszmidt as described in their 1997 paper
“Bayesian Network Classifiers” [1]. Friedman et al compare NB’s to TAN’s on
a variety of data sets. They found that in most cases TAN methods were
more accurate than Naïve Bayesian methods. Our goal is to determine if
TAN’s are more accurate than NB’s for the PENCE data sets.

Jia You and Russ Greiner, from the University of Alberta, have also done work
on comparing different Bayesian classifiers, including NB and TAN classifiers
[4].


3.3 Discriminant Learning
Naïve Bayesian and TAN are two different types of belief net structure, both
first learns a good network structure and then fill in the according CPTable
attached to each node. Essentially, all of these learners use the parameters
that maximize the likelihood of the training samples [6]. Their goal is to
produce a good model most close to the distribution of the data, which is the
core idea belonging to the “Generative Classification”.
In general, there are two ways to make classification decisions, generative
learning and discriminative learning, respectively. Generative learning is to
build a model over the input examples in each class and classify based on
how well the resulting class conditional models explain any new input
example. The other method, discriminative learning, views the classification
problem from a quite different angle from generative learning. It aims to
maximize the classification accuracy instead of building the most accurate
model closest to the underlying distribution.

Thus, after obtaining a fixed structure, much more labor will be put to seeking
the parameters that maximize the conditional likelihood, of the class label ci
given the instance ei . R. Greiner and W. Zhou has done the related research
work on discriminant parameter learning of belief net classifiers in general
cases and found this kind of learning works effectively over a wide variety of
situations [5].
4. APPROACHES
4.1 Overview
In our project, more than one machine learning techniques have been
adopted, each with its own advantages and from different angles.
Among probabilistic learners, we have implemented Naïve Bayesian Network
(generative and discriminative) and TAN. (Notes: our implementation work is
different from the existing system PENCE group used. All the code is built
from scratch, in deed, and the id3 file format is adopted.) For these two
classifiers, various experiments have been carried out by tuning different
parameters.

One important characteristic of our project is dealing with datasets having
thousands of features. All of the three datasets we investigate (ecoli, yeast,
and fly) have more than 1500 features. Problems exist when dealing with
applications with many features, in that irrelevant features indeed provide little
information and noisy features would make result worse. Moreover, standard
algorithms do not scale well. “Wrapper”(with Information Content) is what we
adopted in our project to handle the feature selection problems.

Neural Network is the one we extended the existing code to implement our
project. We have also tried several other techniques using available
implementations, such as WEKA series and SVM (for muti-classes).

For all of the experiments of above algorithms, cross-validation is the
technique we use to get a precise testing accuracy. We also take the running
efficiency into account in terms of execution time. In the last, comparisons for
these different algorithms will be presented.


4.2 NB (generative vs. discriminative)
4.2.1 Overview
Naïve Bayesian Network is one of the most practical learning methods for
classification problems. It is applicable when dealing with large training data
set, and the attributes that describe instances are conditionally independent
given classification labels.

The structure of Naïve Bayesian Network is simple and elegant based on the
assumption that attributes are independent given class labels. Nodes are
variables, and links between nodes represents the causal dependency. In NB,
Node for class labels serves as the root of tree, all the features are the child
nodes of root, and no sibling exists for each child node. Each node is attached
one Cptable, which is the parameter to learn for this structure. Every entry in
Cptable is in the form of P(child|parent).

In generative learning, entries in Cptable are populated with empirical
frequency count. In discriminative learning for a given fixed structure (NB
here), Cptable will be updated after each incoming query and thus try to
produce the optimal classification error score.

The inference of NB is based on Bayesian Theorem, and is carried out by
picking up the max[P(vj)ΠP(ai|vj)] where ai is the attributes and vj is class
label.

4.2.2 learning structure
The Naïve Bayes Learn Algorithm goes in this way:

         For each target value Vj
               P’(Vj) ← estimaite P(Vj)
               For each attribute value ai of each attribute A
                      P’(ai|Vj) ← estimaite P(ai|Vj)

In deed, up to now, we have also filled up each Cptable for generative
learning.

4.2.3 discriminative learning
As said before, the parameters set in generative learning need not maximize
the classification accuracy. However, a good classifier is the one that
produces the appropriate answers to these unlabeled instances as often as
possible [5]. “Classification error” is usually defined as:
       Err = P( class(e) != c) for <e,c> in sample
This can be approximated by empirical score :
       Err’ = 1 Σ ( class(e)!= c) for <e,c> in sample
               S
In discriminative learning for Naïve Bayesian Network, the goal is to learn the
Cptable entries for the given NB structure to produce the smallest empirical
error score above.

Therefore, the “log conditional likelihood” of given NB over the distribution of
labeled instances are used:
       LCL = Σ P(e) log P(c|e) for <e,c> in sample
Similarly, this log conditional likelihood can also be approximated by:
       LCL’ = 1 Σ log P(c|e) for <e,c> in sample
                S
To get the CPtable entries that have the optimal conditional likelihood, a
simple gradient-descent algorithm is used [5]. The initialization of CPtable can
be obtained by the usual way to fill in the CPtable, then the empirical error
score is improved by changing the value of each CPtable entry.

In the implementation, “softmax” parameters are adopted. The advantage is to
keep the probability property: in the range between 0 and 1, and the marginal
would sum up to one.

Therefore, similar to the upgrade of weight in Neural Net, given a set of
labeled queries in training phase, the learning algorithm descends in the
direction of the total derivative, which is the sum of individual derivatives
For a singled labeled instance <e,c>, the partial derivative is:
        P(r, f|e. c) - P(r,f |e) - θ r|f[ P(f| e, c)-P(f | e)]

In the specific NB structure, the computation of derivative is relatively less
intensive, because for each CPTable investigated, the parent of CPTable is
the class label node; this special belief network brings great reduce in
computation complexity in the implementation.

There are also some other speed-up techniques, such as “line-search” to
determine the learning rate, and conjugate gradient. We didn’t try these, but
still took advantage of the observation that when R is independent of C given
E, the derivative would be zero [5].


4.3 TAN
4.3.1 Overview
Tree Augmented Naïve Bayesian networks (TAN [1]) is one approach we took
in this project. TAN’s are similar to regular NB’s, but the features of a TAN are
organized into a tree structure. An example is given in figure 4.3.1.


                                           Class




                F1             F2            F3            F4             F5


             Figure 4.3.1: An example of a tree augmented naïve Bayesian network
The CP tables of a TAN are also similar to those of a NB. The difference is
that all of the CP tables for the feature nodes have an extra column to account
for the extra parent, except for the root node of the tree. Figure 4.3.2 shows
an example.

                           Class      Parent       Fi = 0      Fi = 1
                           C1         0           .566        .444
                           C1         1           .200        .800
                           C2         0           .101        .899
                           C2         1           .750        .250
                        Figure 4.3.2: An example of a CP table for a TAN



4.3.2 Learning Structure

Below is the algorithm we used to learn the TAN structure, taken from [1].

   1. Calculate the Conditional Mutual Information Ip between any two
      features F1 and F2, given the classification C.

        Ip(F1; F2 | C) = Σf1,f2,c P(F1=f1, F2=f2 | C=c) log       P(F1=f1, F2=f2 | C=c)
                                                              P(F1=f1 | C=c) P(F2=f2 | C=c)
        In our case we avoid zero cases by initializing the count buckets for the
        P(f1, f2 | c) to 1 instead of 0.
   2.   Construct a complete undirected graph, where every feature is a node
        in the graph. Set the weights of the edges in the graph to the
        corresponding Ip values between features.
   3.   Extract the maximum weighted spanning tree from the graph. In our
        case, we used Kruskal’s minimum weighted spanning tree algorithm [2]
        and modified it slightly to find the maximum weighted spanning tree.
   4.   Choose a node to be the root and direct all edges in the spanning tree
        away from it, creating a tree. In our case we chose the feature with the
        highest information gain to be the root node.
   5.   Add the classification node and make it a parent of all of the feature
        nodes.

4.3.3 Learning CP Parameters and Classification

Given a data record with m features f0, f1, … , fm
        Class = argmaxc { P(c) Πi P(fi | p(fi), c) }
        Where 1 <= i <= m, and p(fi) is the value of feature fi’s parent. We consider
        the root node to be its own parent ( p(froot) = froot ).
 Class = argmaxc { P(c) Πi P(fi, p(fi), c) / P(p(fi), c) }

        Class = argmaxc { nc Πi ncijk / ncik }
       nc is the number of records with class c
       ncijk is the number of records with class c, where fi = j and p(fi) = k
       ncik is the number of records with class c, where p(fi) = k

We simply count up the nc, ncijk, and ncik’s to learn the CP table entries. Again,
to avoid problems when these values are 0, we simply initialize all entries to 1.

4.3.4 Example

Lets say we are given the data in table 4.3.1. First we determine the structure
of the TAN. Step 1 is to calculate the conditional mutual information between
every two features.

                    Class         F1         F2         F3          F4
                      A           1          0          0           0
                      A           0          1          0           1
                      B           1          0          1           0
                      B           0          1          1           0
                      B           0          0          1           1
                      B           0          0          1           1
                      B           1          1          1           1
                      B           0          0          1           0

                            Table 4.3.1: Data for the TAN example


              P(F1=0, F2=0 | C=A) = 1/6                  P(F1=0, F2=1 | C=A) = 2/6
              P(F1=1, F2=0 | C=A) = 2/6                  P(F1=1, F2=1 | C=A) = 1/6

              P(F1=0, F2=0 | C=B) = 4/10P(F1=0, F2=1 | C=B) = 2/10
              P(F1=1, F2=0 | C=B) = 2/10P(F1=1, F2=1 | C=B) = 2/10

              P(F1=0 | C=A) = 3/6                        P(F1=1 | C=A) = 3/6
              P(F1=0 | C=B) = 6/10                       P(F1=1 | C=B) = 4/10

              P(F2=0 | C=A) = 3/6                        P(F2=1 | C=A) = 3/6
              P(F2=0 | C=B) = 6/10                       P(F2=1 | C=B) = 4/10

Note that we started each of the above buckets at 1 instead of 0 before
counting. That explains why the denominators are 6 and 10, instead of 2 and
6 respectively.
So using the above values we get
Ip(F1; F2 | C) = 1/6 log( 1/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) +
                 2/6 log( 2/6 / (3/6 * 3/6) ) + 1/6 log( 1/6 / (3/6 * 3/6) ) +
                 4/10 log( 4/10 / (6/10 * 6/10) ) + 2/10 log( 2/10 / (6/10 * 4/10) ) +
                 2/10 log( 2/10 / (4/10 * 6/10) ) + 2/10 log( 2/10 / (4/10 * 4/10) )
               = -0.0293485 + 0.0416462 + 0.0416462 + -0.0293485 +
                 0.018303 + -0.0158362 + -0.0158362 + 0.019382

Ip(F1; F2 | C) = 0.0306079

And similarly we get
Ip(F1; F3 | C) = 0.0022286
Ip(F1; F4 | C) = 0.0245954
Ip(F2; F3 | C) = 0.0022286
Ip(F2; F4 | C) = 0.0245954
Ip(F3; F4 | C) = 0

Step 2 is to create a complete undirected graph where the features are the
nodes and the Ip values are the edge weights. A graphical representation of
this graph is shown in figure 4.3.3. In our implementation we represent the
graph as an array of triplets <n1, n2, w> where n1 and n2 are the nodes that
the edge connects and w is the weight of the edge.

Graph = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246>, <2, 3, 0.0022>,
          <2, 4, 0.0246>, <3, 4, 0> }



                                              F
                                              1
                                                         .0246
                                 .0306
                                         .0022
                             F                                   F
                             2
                                                    .0246
                                                                 4



                                 .0022                      0

                                               F
                                               3

           Figure 4.3.3: The conditional mutual information graph for the TAN example.


Step 3 is to extract a maximum weighted spanning tree from the graph. Our
algorithm generates the following max span tree, also shown in figure 4.3.4.
MaxSpanTree = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246> }

It is easy to verify that this is indeed a maximum weighted spanning tree.




                                                  F
                                                  1
                                   .0306
                                                              .0246
                                            .0022
                               F                  F                   F
                               2                                      4
                                                  3



             Figure 4.3.4: A maximum weighted spanning tree for the TAN example


In step 4 we choose the feature with the highest information content to be the
root node. The information contents of the features are given in Table 4.3.2.
We see that feature F4 has the highest information content, so it becomes the
root node. The following formula was used to calculate the information
contents:
       Gain(F) = - P(F=0) log2( P(F=0) ) - P(F=1) log2( P(F=1) )

Step five involves simply adding the classification node as a parent to all other
nodes. Figure 4.3.5 shows the final TAN structure.

                            Feature              Information
                                                 Content
                                    F1             0.954434
                                    F2             0.954434
                                    F3             0.811278
                                    F4             1.000000

            Table 4.3.2: The information content of the features for the TAN example
Now that the structure is set, we need to learn the CP table entries. The
parameters required are the ncijk, ncik, and nc values described in section 4.3.2.
Tables 4.3.3, 4.3.4, and 4.3.5 show the CP tables that contain the ncijk, ncik,
and nc entries respectively, for our example. Again, remember that the ncijk
entries were initialized to 1, not 0.
F
                                        4
                                                                  C



                                        F
                                        1




                              F                 F
                              2                 3

                Figure 4.3.5: The final TAN structure in the TAN example


Class    P(F1) =     F1 = 0       F1 = 1         Class       P(F2) =      F2 = 0      F2 = 1
           F4                                                  F1
 A         0           1            2               A          0              1         2
 A         1           2            1               A          1              2         1
 B         0           3            2               B          0              4         2
 B         1           3            2               B          1              2         2
Class    P(F3) =     F3 = 0       F3 = 1         Class       P(F4) =      F4 = 0      F4 = 1
           F1                                                  F4
 A         0           2            1               A          0              2         1
 A         1           2            1               A          1              1         2
 B         0           1            5               B          0              4         1
 B         1           1            3               B          1              1         4
                Table 4.3.3: The ncijk CP table entries for the TAN example



        Class      F1 = 0     F1 = 1             Class        F2 = 0         F2 = 1
          A          3          3                  A            3              3
          B          6          4                  B            6              4
        Class      F3 = 0     F3 = 1             Class        F4 = 0       F4 = 1
          A          4          2                  A            3            3
          B          2          8                  B            5            5
                Table 4.3.4: The ncik CP table entries for the TAN example



                              Class = A              6
                              Class = B             10
Table 4.3.5: The nc CP table entries for the TAN example


Now that both the structure and the CP table entries have been learned, we
can attempt to classify new instances. Consider the following unclassified
record:

                  Class         F1          F2          F3          F4
                    ?           1           1           1           0

P(Class = A) =       nA * (nA110 * nA211 * nA311 * nA400) / (nA40 * nA11 * nA11 * nA40)
             =       6 * (2 * 1 * 1 * 2) / (3 * 3 * 3 * 3)
             =       0.296
P(Class = B) =       nB * (nB110 * nB211 * nB311 * nB400) / (nB40 * nB11 * nB11 * nB40)
             =       10 * (2 * 2 * 3 * 4) / (5 * 4 * 4 * 5)
             =       1.200

Therefore we classify this example as ‘B’ since P(Class = B) > P(Class = A).

3.3.4 Validation
We validated our TAN implementation by running it with the above example
and analyzing the verbose debugging output. We verified that the results
from that run were identical to results given in the above example.


4.4 Neural Nets

4.4.1 Overview
Artificial neural network learning provides a practical method for learning real-
valued and vector-valued functions over continuous and discrete-valued
attributes, in a way that is robust to noise in the training data. The
Backpropagation algorithm [3] is the most common network learning method
and has been successfully applied to a variety of learning tasks, such as
handwriting recognition and robot control. Neural Nets is one of the major
techniques covered in the class.

4.4.2 Implementation
As opposed to the Naïve Bayes and TAN that we implemented from scratch,
our implementation of Neural Nets is based on the our assignment 4 from
class. We modified the Backpropagation algorithm that was originally for the
problem of face recognition. The major modification we made was on the
input: instead of using input nodes that represented the images, we changed
it to represent every different features of our protein sequence. For the output
nodes, instead of representing the user’s head position or user id, etc, we use
them to represent the different class labels. Lastly, we changed the code for
estimating the classification accuracy since these two problems are totally
different in this case. For initial value of each input node, our strategy is: If one
feature appears in one particular sequence, then the initial value of that input
node is 1, otherwise, it will be set to 0. Corresponding, for the output node, we
set 1 to one of the 14 output nodes that represents the correct class of our
current sequence, set 0 to others 13 output nodes. The unit weight is set up
randomly in the beginning.

4.4.3 Example
For a specific protein sequence, the number of input nodes will be the number
of features. You can specify the number of hidden nodes as a parameter. The
number of output nodes in all experiment is 14 since we have 14 different
classes for all the dataset. Each output node represents one of the classes in
{a, b, c, d, e, f, g, h, i, j, k, l, m, n}.




                           Inputs      Hidden     Output




              .               .
              .               .



                   Figure 4.4.1 Learned Hidden Layer Representation



4.5 Wrapper (Information Content)
4.5.1 Overview
For our particular task, the data set scales up to thousands of features. Even
worse, some of these features are irrelevant and provide little to no
information. Also the features can be noisy. Standard algorithms do not scale
well with number of features, so the approach we use is “Wrapper”: Try
different subsets of features on learner, estimating performance of algorithm
with respect to each subset, and keep subset that performs best. Before
selecting the subset, we preprocess (weight) each feature according to its
mutual information content given by the formula below.
W j = Σv Σc P(y = c, f j = v) log    P(y = c, f j = v)

                                     P(y = c) P(f j = v)

We can see that formula treats all the features independently.


4.5.2 Implementation

Step 1: Calculate the information content of each feature
We read in all of the training records first and then use the above formula to
compute the mutual information content for each feature. When this
preprocessing step is finished, we can begin to train the classifier in the next
step.

Step 2: Try different subsets of features
We begin by using all of the features to train the classifier. Then we remove
5% of the features that have the lowest information contents and retrain the
classifier, in each round. After 20 rounds, there are no features remaining.
We compare the classification accuracies of these 20 rounds and choose the
subset of features that produced the highest prediction accuracy. If two
features have the same information content, then we choose one arbitrarily.

4.5.3 Example

Let us consider the following problem:
Suppose we have totally eight protein sequences, each sequence has exactly
eight features. These eight sequences belong to all four classes: {C, P, R, M}.
In the following table, for a particular protein sequence, if the entry of feature i
is 1 then feature i appears, otherwise it does not appear. For example, for the
first protein sequence, the features I, II and III appears in this sequence, the
others do not appear.


Seq.      class      I         II   III      IV          V    VI      VII      VIII
 1          C        1         1    1         0          0    0       0         0
 2          C        1         1    1         1          0    0       0         0
 3          P        0         0    1         0          0    1       0         1
 4          P        0         0    1         0          0    0       1         1
 5          R        1         1    1         0          0    0       0         0
 6          R        0         1    1         0          0    1       0         0
 7          M        0         0    1         0          1    1       0         0
 8          M        0         1    1         1          0    0       0         0
Info.            0.352    0.352        0      0.156     0.147    0.102     0.147    0.406
         Table 4.5.1 the information content of eight features in eight sequences


The last row shows the information content of each feature. These values are
computed by the formula given above. As we can see, feature #3’s
information content is 0, which shows that this feature contains least
information about the data. This is expected since it appears in all the eight
sequences. On the other hand, whenever feature #8 appears, its
corresponding class is P in our example. Therefore, it is a significant
discriminating feature in the data. Accordingly, its information content is the
highest one in this case.

The wrapper works by training a classifier using all the features in the first
round. In the following rounds, it removes a fixed number of features each
round starting from those features with low information content. For example,
if we decide to remove one feature at a time in our example, then we iterate
through 8 rounds, starting with removing feature #3 since its information
content is 0. Then removing feature #6 since 0.102 is the smallest among
remaining features. In the last round, only feature #8 remains. We then
choose the subset of features with highest accuracy appears during the eight
rounds.


4.6 Other approaches
4.6.1 Overview

Besides the primary techniques we implemented (Naïve Bayes, TAN, and
Neural Nets), we also apply some others, which include both traditional
techniques such as decision trees and ruler learners, and a more recent
approach in SVM’s.

   A decision tree is a class discriminator that recursively partitions the
    training sets until each partition consists entirely or dominantly of
    examples from one class. Each non-leaf node of the tree contains a split
    point that is a test on one of more features and determines how the data
    is partitioned. It is the first classifier we learned in our class.

   A rule learner is alternative classifier, which can be built directly by
    reading off a decision tree: generating a rule for each leaf and making a
    conjunction of all the tests encountered on the path from the root to that
    leaf. The advantage of rule learner is of its easy understanding, but
    sometimes it becomes more complex than necessary.
   SVM (Support Vector Machine) is a method for creating functions from a
    set of labeled training data. The function can be a classification function or
    the function can be a general regression function. For classification, SVM
    operates by finding a hyper-surface in the space of possible inputs. This
    hyper-surface will attempt to split the positive examples from negative
    examples. The split will be chosen to have the largest distance from the
    hyper-surface to the nearest of the positive and negative examples.
    Intuitively, this makes the classification correct for testing data that is near,
    but not identical to the training data. They are maturely used in the NLP
    (Natural Language Processing) problem such as text categorization.


4.6.2 Existing Tools

Instead of implementing all of the classifiers by ourselves, we chose to use
some existing machine learning tools to make life easier.

 WEKA
Both Decision Tree and Rule Learner classifiers are used through WEKA.
WEKA is a collection of machine learning algorithms for solving real-world
data mining problems. It is written in Java and runs on almost any platform. It
includes almost all of the existing classification schemes. It has decision trees,
rule learners and naïve Bayes. However, we will show in the next section that
WEKA does not seem capable of dealing with our datasets very well.

 Libsvm
Libsvm is a simple, easy-to-use, and efficient software for SVM classification
and regression. Although WEKA has the SVM classifier, it only deals with
binary classifications, which is inappropriate for our task since we have 14
classes in our datasets. The most appealing feature of Libsvm is that it
supports multi-class classification. In addition, it can solve C-SVM
classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression,
and nu-SVM regression.
5. Empirical Analysis

5.1 Experiment Setup
5.1.1 Background on the Data Set

Our three data sets wer provided by the PENCE group at the University of
Alberta. Each data set contains thousands of protein sequences with known
classes. For each sequence, there are more than one thousand features. For
example, the Ecoli data set has more than two thousand sequences with
about 1500 features. See table 5.1.1.

 Data Set            # of classes                # of sequences           # of features
 Ecoli               14                          2370                     1504
 Yeast               14                          2539                     1555
 Fly                 14                          3823                     1906
                     Table 5.1.1 the three data sets: Ecoli, Yeast, Fly


5.1.2 Training and Testing

We train the classifier on each of the three datasets separately with different
techniques. We use 5-fold cross validation to compute the validation accuracy.
We implement Naïve Bayes, TAN and Neural Nets using C. The WEKA code
is implemented using JAVA. The Libsvm has both C and JAVA version, we
simply use C version during our experimentation. All the experimentation is on
the machine at our graduate office, which is an i686 machine running on Linux
7.0 and has 415MB swap memory.


5.2 Comparison of NB, TAN, and NN
Figure 5.1.1 shows a comparison of Naïve Bayesian, Tree Augmented Naïve
Bayesian, and Neural Net classifiers. The accuracies given were obtained
using 5-fold cross validation.
Comparison of diff. classifiers without wrapper

                                      100

               Validation accuracy     80
                                                                                                            NB
                                       60
                                                                                                            TAN
       `                               40
                                                                                                            NN
                                       20
                                        0
                                                 Ecoli                 Yeast                    Fly
                                                         Different classifier techniques




                                                Comparison of diff. classifiers with wrapper
           Best validation accuracy




                                      100
                                       80
                                                                                                            NB
                                       60
                                                                                                            TAN
                                       40
                                                                                                            NN
                                       20
                                        0
                                                 Ecoli                 Yeast                   Fly
                                                         Different classifier techniques



                                                Effect of feature selection on different classifiers


                            Ecoli
                                                                                                            NB
                    Yeast                                                                                   TAN
                                                                                                            NN
                                      Fly

                                            0        4            8            12           16         20
                                                   Percentage of accuracy improvement


     Figure 5.1.1: A comparison of the accuracy of NB, TAN, and NN. The first graph shows the
comparative accuracies without using the wrapper. The second graph shows the maximum accuracy of
each method using the wrapper. The third graph shows the increase in accuracy when the wrapper is
                                                                      used.


We see that the accuracies of the NB and the TAN classifiers are roughly
equal both with and without the wrapper, for all three data sets. Given that
TAN’s are more complicated to implement and take longer to train than NBs
[1], it is likely more practical to use NB’s for the PENCE data rather than
TAN’s.

Neural networks perform noticeably better than both NB’s and TAN’s in terms
of accuracy, on all three datasets. This suggests that neural network
classifiers could be a promising area for future research for the Proteome
Analyst tool.

The third graph shows the accuracy improvement percentage obtained by
using the wrapper. We note that the wrapper seems to have a similar effect
on both NB and TAN classifiers, while the wrapper does not help in the NN
case at all for the yeast and fly datasets.


5.3 Generative vs. Discriminative
The first observation is that discriminative learning will enhance the
classification accuracy. R.Greiner and W.Zhou have proved the discriminative
learning would be more robust to incorrect assumption than generative
learning [5].

The second observation is that discriminative learning is more computational
intensive than generative learning, since it will update every entry in CPTable
each time, and deal with high-dimension in this case.


5.4 Feature Selection—Wrapper
As we observed before, each protein sequence of our data set has more than
a thousand features; therefore, we use the “wrapper” feature selection
technique to remove less relevant features. The following graphs show how
the wrapper works on our three implementations: Naïve Bayes, TAN and
Neural Nets. From figure 5.3.1, we can see that:
 For both Naïve Bayes and TAN, we see that wrapper helps a lot. When
    removing 75%--85% percentages of features, both of them gain best
    classification accuracy. As we can see, when remains 25% of features,
    Naïve Bayes classifier achieve an accuracy close to eighty, which is about
    15% higher than it use all the features to train a classifier.
 But for the Neural Nets, wrapper only works on the Ecoli, and the help is
    not that significant compared with running on Naïve Bayes or TAN. For
    the other two data sets (Yeast and Fly), wrapper does not help at all. The
    accuracy consistently decreases as the number of features goes down.
NB classifier with kfold = 5

                               100
         Validation accuracy   80
                                                                                            Ecoli
                               60
                                                                                            Yeast
                               40
                                                                                            Fly
                               20
                                0

                                                   30
                                     0
                                         10
                                              20


                                                        40
                                                             50
                                                                  60
                                                                       70
                                                                            80
                                                                                 90
                                                                                      100
                                              Percentage of tokens removed


                                                TAN classifier with kfold = 5

                               100
         Validation accuracy




                               80
                                                                                            Ecoli
                               60
                                                                                            Yeast
                               40
                                                                                            Fly
                               20
                                0
                                     0
                                         10
                                              20
                                                   30
                                                        40
                                                             50
                                                                  60
                                                                       70
                                                                            80
                                                                                 90
                                                                                      100




                                              Percentage of tokens removed



                                                 NN classifier with kfold = 5

                               100
         Validation accuracy




                               80
                                                                                            Ecoli
                               60
                                                                                            Yeast
                               40
                                                                                            Fly
                               20
                                0
                                     0




                                                                                      100
                                         10
                                              20
                                                   30
                                                        40
                                                             50
                                                                  60
                                                                       70
                                                                            80
                                                                                 90




                                              Percentage of tokens removed

Figure 5.4.1: The effect of using wrapper as feature selection. The first graph shows how wrapper
 works on NB. The second graph shows how wrapper works on TAN. The third graph shows how
                                                wrapper works on the neural nets.
5.5 Miscellaneous Learning Algorithms
We experiment with four other approaches using existing tools, WEKA and
Libsvm, and record the 5-fold cross-validation accuracy. Additionally, since the
WEKA code includes the Naïve Bayes classifier we compare their tool and our
implementation.

In the following table, some entries are empty. There are two reasons for this.
One is that the training time is too long. For example, for the rule trainer
classifier, it takes nearly 6 hours to train the Ecoli classifier. Since the Yeast
dataset has more records and more features, it becomes impractical to
continue.


  Tech.        Data      Ecoli                    Yeast                     Fly
  Decision Tree          81.9%                    79.4%                     --
  Rule Learner           82.66%                   --                        --
  Naïve Bayes            67.85%                   69.16%                    --
  SVM                    85.3165%                 82.4734%                  78.0016%
               Table 5.5.1 the validation accuracy using some other techniques


The second reason for the blank entry is that when using the WEKA code for
Fly, we run out of memory. Since we cannot modify the WEKA code, we
ignore those experiments.

Although we cannot complete some of the tests using existing tools, we can
still gain some useful insight from the results we did get.
 The accuracy of Naïve Bayes of WEKA not only validates the correctness
      of our implementation, but also illustrates the strength of our own
      implementation, which can deal with all of three data sets, without running
      out of memory.
 Naïve Bayes is the worst classifier if we only considering the accuracy. All
      the other three techniques achieve close to 80% high accuracy, which is
      about 10% higher than NB. However, the execution time of those
      techniques is much higher and is shown in the later sections.
 SVM (Support Vector Machine) technique not only deals with all of three
      data sets, but also is the winner among those techniques with respect to
      accuracy. For Ecoli, it achieves the highest 85.3% accuracy, which is
      about 20% better than Naïve Bayes. This makes it a potential alternative
      for the Naïve Bayes classifier though it still consumes more execution
      time than Naïve Bayes.
5.6 Computational Efficiency
As we saw before, the Naïve Bayes classifier is not as accurate as the other
methods, but we believe it is the most practical classifier for our task. The
reason can be easily seen from the following table:

Classifier    Naïve         TAN           Neural       Decision      Rule         SVM
              Bayes                       Nets         Tree          Learner
Time          5mins         15mins        30mins       1hr           6hrs         12mins
       Table 5.6.1 the approximate execution times of different techniques on Ecoli


We conclude in the last section that nearly all other classifier outperform
Naïve Bayes with respect to accuracy. The table above suggests an
interesting tradeoff: More accuracy, longer time. For those classifiers that take
more than half an hour like Decision Tree, they can never be considered as a
practical approach for our task. For the others, if our goal focuses on
classification accuracy, then our study shows that both TAN and SVM will be
good choice. Especially, for the SVM, as we see before, it can outperform
20% more accuracy than Naïve Bayes, but it also takes twice as long to train
the classifier. Overall, when we consider both our criteria, Naïve Bayes still
seems to be the optimal classifier for our task, currently. However, TAN’s and
SVM’s look to be excellent areas of future research in this area, especially
research done to improve their training speed.
6. Conclusions and Future Works
6.1 Conclusions
In this course project, we have explored several machine learning techniques
for classification on a specific application domain – PENCE. Though our main
focus is on Bayesian Network classifiers (Naïve Bayesian, TAN), we have also
tried other different ideas (Decision Tree, Neural Network, SVM, etc). What’s
more, discriminative learning on Naïve Bayesian is also tested. Comparison
both on classification accuracy and running efficiency in terms of execution
time are drawn based on various different combinations of experiments and
from different angles.

Based on the experimental results we have, we found that the harder a
learner learns (in terms of execution time), the better results we can get (in
terms of classification accuracy). However, this is the trade-off between
efficiency and accuracy. Take the all factors into account, we think
NB+wrapper is a suitable solution to this application. However, we are
impressive by the accuracy SVM works out.


6.2 Future Work
One of the possible future work can be carried out on the feature selection
part, since wrapper works quite effectively. There are many other algorithms
dealing with scaling up the supervised learning. From the last class this term,
several algorithms, such as “RELIEF-F” algorithm, which draw samples at
random, and then adjust weight of features that discriminate instances from
neighbors of different classes; “VSM”, which integrates feature weighting into
learning algorithm, etc have been introduced and can be tried. One of the
other possible way in reducing the feature dimensionality can take advantage
of some statistic metrics and clustering techniques to cluster the feature sets
first, and then do the learning task.

Considering the long execution time for almost all the algorithms except Naïve
Bayesian Network, speed-up in learning phase according to different
algorithms can be another aspect of future work.
Acknowledgments
The authors are grateful to Dr. Russ Greiner for his valuable comments on our
project and useful discussions relating to this work. Jie Cheng and Wei Zhou’s
previous work on Bayesian Network and discriminative learning helps our
work a lot. We also thank Dr. Duane Szafron, and Dr. Paul Lu for their
support with regard to the PENCE code and data. We also would like to
thank Roman Eisner for helping us on some detailed problems. And perhaps
most of all, we would like to thank the good people at Wendy’s for providing
us with tasty hamburgers at a reasonable price during the ungodly hours of
the night while we worked late.
7. References

 1. Friedman, Geiger, Goldszmidt. Bayesian Network Classifiers.
    Machine Learning, volume 29 (pp. 131-163), 1997.
 2. Brassard and Bratley. Fundamentals of Algorithmics, Prentice Hall,
    1996.
 3. T Mitchell. Machine Learning, McGraw Hill, 1997.
 4. Jie Cheng and Russell Greiner. Comparing Bayesian Network
    Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in
    Artificial Intelligence (UAI-99), Sweden, Aug 1999.
 5. Russ Greiner, and Wei Zhou. Structural Extension to Logistic
    Regressions: Discriminative Parameter Learning of Belief Net
    Classifiers. AAAI’02, Canada
 6. David E. Heckerman. A tutorial on learning with Bayesian networks.
    Learning in Graphical Models, 1998
8. Appendix




                   Naïve Bayes (Generative)           Naïve Bayes (Discriminative)
 Percentage of
tokens removed   Ecoli      Yeast        Fly          Ecoli     Yeast         Fly
     0           67.8       69.1         68.3
     5           68.5       69.3         69.0
    10           69.0       69.5         69.2
    15           69.4       69.3         69.4
    20           70.1       70.0         69.8
    25           70.7       70.3         70.1
    30           71.3       70.7         70.4
    35           71.8       70.9         70.9
    40           72.4       70.8         70.9
    45           73.9       71.1         71.4
    50           74.5       71.2         71.2
    55           75.0       71.4         71.1
    60           75.6       71.6         71.1
    65           76.1       71.5         70.9
    70           76.6       71.3         70.0
    75           77.3       71.6         69.5
    80           77.1       71.1         69.1
    85           77.0       71.2         68.1
    90           75.9       69.2         65.7
    95           71.7       66.1         61.7
    99           40.4       60.2        39.78
   100            0           0           0




Table 1: empirical result (Accuracy) of 2 approaches to learning classifier with
wrapper, over 3 datasets.
TAN                                Neural Nets
 Percentage of
tokens removed   Ecoli      Yeast        Fly          Ecoli      Yeast             Fly
     0           67.8       69.4         68.7         85.7        87.1             76.3
     5           68.3       69.7         69.0         89.4        86.7             73.9
    10           69.0       70.1         69.3         88.8        84.5             72.0
    15           69.4       70.3         69.8         86.6        83.4             74.9
    20           70.0       70.7         70.2         82.7        82.4             68.3
    25           70.6       70.8         70.4         84.5        78.8             69.0
    30           71.3       71.0         70.6         78.9        78.4             67.7
    35           71.7       71.3         71.0         78.7        77.4             64.3
    40           72.4       71.3         71.4         68.3        75.1             58.5
    45           73.8       71.6         71.7         72.1        73.7             56.7
    50           74.4       71.6         71.5         68.2        71.8             54.8
    55           74.8       71.9         71.4         64.7        69.2             51.2
    60           75.5       72.1         71.2         68.5        63.1             45.7
    65           76.1       72.2         71.2         65.3        63.3             50.8
    70           76.6       71.7         70.3         63.4        58.8             43.1
    75           77.2       72.0         70.0         59.4        51.8             42.7
    80           77.0       71.4         69.3         55.8        47.5             37.5
    85           77.0       71.1         68.8         49.3        42.6             28.0
    90           75.9       69.4         66.3         33.2        30.2             25.6
    95           71.7       66.6          62.3        20.1        26.6             24.3
    99           40.9       60.3         40.1         13.8        16.3             11.0
   100            0          0             0           0           0                0


Table 2: empirical result (Accuracy) of 2 approaches to learning classifier with
wrapper, over 3 datasets.

Mais conteúdo relacionado

Destaque

Bayesian scoring functions for Bayesian Belief Networks
Bayesian scoring functions for Bayesian Belief NetworksBayesian scoring functions for Bayesian Belief Networks
Bayesian scoring functions for Bayesian Belief NetworksJee Vang, Ph.D.
 
Bayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopBayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopOfer Mendelevitch
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 

Destaque (7)

Bayesian scoring functions for Bayesian Belief Networks
Bayesian scoring functions for Bayesian Belief NetworksBayesian scoring functions for Bayesian Belief Networks
Bayesian scoring functions for Bayesian Belief Networks
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Bayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopBayesian Networks with R and Hadoop
Bayesian Networks with R and Hadoop
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 

Semelhante a 551report.doc

BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Sage Base
 
renewed-poster-presentation (12)
renewed-poster-presentation (12)renewed-poster-presentation (12)
renewed-poster-presentation (12)Kofi Forson
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS
 
protein design, principles and examples.pptx
protein design, principles and examples.pptxprotein design, principles and examples.pptx
protein design, principles and examples.pptxGopiChand121
 
Predictive Text Embedding using LINE
Predictive Text Embedding using LINEPredictive Text Embedding using LINE
Predictive Text Embedding using LINENishant Prateek
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleYasset Perez-Riverol
 
1PhylogeneticAnalysisHomeworkassignmentThisa.docx
1PhylogeneticAnalysisHomeworkassignmentThisa.docx1PhylogeneticAnalysisHomeworkassignmentThisa.docx
1PhylogeneticAnalysisHomeworkassignmentThisa.docxfelicidaddinwoodie
 
Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...cscpconf
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfstudy help
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfstudy help
 
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCEINTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCEIPutuAdiPratama
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksBayesia USA
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTIONbutest
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTIONbutest
 

Semelhante a 551report.doc (20)

BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
Adam Margolin & Nicole DeFlaux Science Online London 2011-09-01
 
renewed-poster-presentation (12)
renewed-poster-presentation (12)renewed-poster-presentation (12)
renewed-poster-presentation (12)
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
protein design, principles and examples.pptx
protein design, principles and examples.pptxprotein design, principles and examples.pptx
protein design, principles and examples.pptx
 
Predictive Text Embedding using LINE
Predictive Text Embedding using LINEPredictive Text Embedding using LINE
Predictive Text Embedding using LINE
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
1PhylogeneticAnalysisHomeworkassignmentThisa.docx
1PhylogeneticAnalysisHomeworkassignmentThisa.docx1PhylogeneticAnalysisHomeworkassignmentThisa.docx
1PhylogeneticAnalysisHomeworkassignmentThisa.docx
 
Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdf
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdf
 
ProCheck
ProCheckProCheck
ProCheck
 
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCEINTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

551report.doc

  • 1. Learning Bayesian Belief Network Classifiers for Proteome Analyst CMPUT551 Term Project Project Report Zhiyong Lu James Redford Xiaomeng Wu April 26, 2002
  • 2. Table of Contents 1. ABSTRACT 2. INTRODUCTION 2.1 Description of the Task 2.2 Motivation 2.3 The Proteome Analyst 2.4 Our Solutions 2.5 Problems and Challenges 3. RELATED WORK 3.1 Proteome Analyst 3.2 NB vs. TAN 3.3 Discriminative Learning 4. APPROACHES 4.1 Overview 4.2 NB (generative vs. discriminative) 4.3 TAN 4.4 Neural Networks 4.5 Wrapper (Information Content) 4.6 Other approaches 5. EMPIRICAL ANALYSIS 5.1 Experimental Setup 5.1.1 Background on the Data Set 5.1.2 Training and Testing 5.2 Comparison of NB, TAN, and NN 5.3 Generative vs. Discriminative 5.4 Feature Selection—Wrapper 5.5 Miscellaneous Learning algorithms 5.6 Computational Efficiency 6. CONCLUSIONS and FUTURE WORK 7. REFERENCES 8. APPENDIX
  • 3. 1. Abstract In this course project, we investigate several machine learning techniques on a specific task—Proteome Analyst. Naïve Bayes has been applied to this problem with considerable success. However, it makes many assumptions about data distributions that are clearly not true of real-world proteome. We empirically evaluate several variant algorithms of Naive Bayes, including the method in which parameters are learned (Generative vs. Discriminative Learning) and different BN structures (Naïve Bayes vs. TAN). We also implement a Neural Network algorithm and use some other existing tools such as the WEKA data mining system to perform an empirical analysis of these systems for the proteome function prediction problem. This report is organized as follows. In Section 1 we introduce the task in our project and our motivation and challenges we face. In Section 2 we review the previous work on proteome analyst and discuss some alternative solutions to the classification problem. In Section 3, we present the detailed concepts of those machine learning techniques and our implementations. In Section 4, we examine the proteome application for classification in detail, and show the comparative results of different techniques. We conclude in Section 5 and point out some future research directions in Section 6. Finally, Appendix A contains all the experimental data we used in the report.
  • 4. 2. Introduction 2.1 Description of the Task Recently, more than 60 bacterial genomes and 5 eukaryotic genomes have been completed. This explosion of DNA sequence data is leading to a concomitant explosion in protein sequence data. Unfortunately, the function of over half of these proteome sequences is unknown. Therefore, the proteome function prediction problem has emerged as an interesting research topic in bioinformatics. In our project, we are given some protein sequences with known classes; our goal is to seek several machine learning techniques to predict the classes of unknown protein sequence. This is a typical machine- learning problem in the domain of classification area — learn from existing experience to perform the task better. 2.2 Motivation Typically it takes months or even years to determine the function of even a single protein using standard biochemical approaches. A much quicker alternative is to use computational techniques to predict protein functions. Although there are many existing algorithms such as Naïve Bayes available for the proteome function prediction, it often makes many assumptions about data distributions that are clearly not true of real-world proteome. The challenge is that we need some more generalized algorithms that not only do not rely on the above assumption but also achieve high-throughput performance including both classification accuracy and execution time. 2.3 The Proteome Analyst Proteome Analyst is an application designed by the PENCE group at the University of Alberta that carries out protein classification. The input to the Proteome Analyst is a protein sequence, and the output is a prediction of classification result. Figure 2.5.1 shows the architecture of the Proteome Analyst. The input protein sequence is initially fed though PsiBlast, which is a tool that does sequence alignment against a database, in this case SwissProt. The three best alignment matches, called homologues, returned by PsiBlast are in turn passed into a tokenizer. The tokenizer retrieves text descriptions of the homologues from the SwissProt database and then extracts a number of text tokens from these descriptions. These tokens are used as input into the classifier. Currently, the PENCE classifier is implemented as a Naïve Bayesian network (NB). The features used by the NB are binary and
  • 5. correspond to the tokens. If the token exists in the input sequence’s description then the value of the feature is 1, otherwise the value is 0. The output of the NB is the classification of the input sequence. Protein Sequence PsiBlast Homologue s SwissProt Tokenize Tokens Classifie Classification Figure 2.5.1: Data flow architecture of the Proteome Analyst. Boxes, ovals, and arrows represent data, filters, and data flow respectively. SwissProt is a database. For our project, we are only concerned with the classifier portion of the Proteome Analyst. We used data files that were already tokenized and the data records already converted into classified vectors of binary features. See Table 2.5.1 for an example.
  • 6. Class F1 F2 F3 F4 A 1 0 0 0 A 0 1 0 1 B 1 0 1 0 B 0 1 1 0 B 0 0 1 1 B 0 0 1 1 B 1 1 1 1 B 0 0 1 0 Table 2.5.1: An example of the format of the data files used in our project. 2.4 Our Solutions Since Naïve Bayes has been applied to the proteome function prediction with considerable success by the PENCE group at University of Alberta. We focus on two areas; the method in which parameters are learned, and the structure of the BN. However, we also seek some other machine-learning techniques such as Neural Networks and Support Vector Machines (SVM) to solve this specific problem. Our goal is to explore the optimal classifier with best performance on both classification accuracy and execution time during our empirical analysis. Following is the summary of machine learning techniques we have applied in our project:  Naïve Bayes (Generative vs. Discriminative Learning)  TAN (Tree-augmented Naïve Bayes)  Neural Networks  Decision Tree, Rule Learner… (Using WEKA data mining system)  Support Vector Machine 2.5 Challenges and Problems Our evaluation criteria for those different machine-learning techniques are mainly involve two:  Classification Accuracy  Execution Time During our experiments on the real data, we found, overall, the Naïve Bayesian classifier outperforms the other techniques, though it does not achieve the best classification and shortest execution time in our empirical study. For most of the other techniques, they might perform better than Naïve Bayes in one aspect, but lose significantly in the other respect. For example, the Decision Tree classifier, which achieves consistently 5 to 10 percentage higher accuracy than Naïve Bayes. But it takes more than 5 times to train the classifier. On the other hand, OneR in WEKA, another classifier, is easily
  • 7. trained but has an accuracy of only 30%, which makes it unsuitable for our task. Interestingly, we found an alternative approach—SVM (Support Vector Machine) that achieves better classification accuracy with comparable execution time of Naïve Bayes.
  • 8. 3. Related Work 3.1 Proteome Analyst PA (Proteome Analyst) is an application designed by the PENCE group at the University of Alberta that does protein classification. Currently, a PA user can upload proteome that consists of an arbitrary number of protein sequences in FastA format. A PA user can configure PA to perform several function prediction operations and can set up a workflow that will apply these operations in various orders, under various conditions. PA can be configured to use homology sequence comparison to compare each protein against a database of sequences with known functions. Any sequence with high sequence identity can then be assigned the function of its homologues and removed from further analysis (or not). One or more classification-based function predictors (that were using machine learning techniques) can also be applied to any sequence. More importantly, PA users can easily train their own custom classification- based predictors and apply them to their sequences. Many other function prediction operations are currently being developed and will be added to PA. 3.2 NB vs. TAN The NB and TAN components of this project were primarily based on work done by Friedman, Geiger, and Goldszmidt as described in their 1997 paper “Bayesian Network Classifiers” [1]. Friedman et al compare NB’s to TAN’s on a variety of data sets. They found that in most cases TAN methods were more accurate than Naïve Bayesian methods. Our goal is to determine if TAN’s are more accurate than NB’s for the PENCE data sets. Jia You and Russ Greiner, from the University of Alberta, have also done work on comparing different Bayesian classifiers, including NB and TAN classifiers [4]. 3.3 Discriminant Learning Naïve Bayesian and TAN are two different types of belief net structure, both first learns a good network structure and then fill in the according CPTable attached to each node. Essentially, all of these learners use the parameters that maximize the likelihood of the training samples [6]. Their goal is to produce a good model most close to the distribution of the data, which is the core idea belonging to the “Generative Classification”.
  • 9. In general, there are two ways to make classification decisions, generative learning and discriminative learning, respectively. Generative learning is to build a model over the input examples in each class and classify based on how well the resulting class conditional models explain any new input example. The other method, discriminative learning, views the classification problem from a quite different angle from generative learning. It aims to maximize the classification accuracy instead of building the most accurate model closest to the underlying distribution. Thus, after obtaining a fixed structure, much more labor will be put to seeking the parameters that maximize the conditional likelihood, of the class label ci given the instance ei . R. Greiner and W. Zhou has done the related research work on discriminant parameter learning of belief net classifiers in general cases and found this kind of learning works effectively over a wide variety of situations [5].
  • 10. 4. APPROACHES 4.1 Overview In our project, more than one machine learning techniques have been adopted, each with its own advantages and from different angles. Among probabilistic learners, we have implemented Naïve Bayesian Network (generative and discriminative) and TAN. (Notes: our implementation work is different from the existing system PENCE group used. All the code is built from scratch, in deed, and the id3 file format is adopted.) For these two classifiers, various experiments have been carried out by tuning different parameters. One important characteristic of our project is dealing with datasets having thousands of features. All of the three datasets we investigate (ecoli, yeast, and fly) have more than 1500 features. Problems exist when dealing with applications with many features, in that irrelevant features indeed provide little information and noisy features would make result worse. Moreover, standard algorithms do not scale well. “Wrapper”(with Information Content) is what we adopted in our project to handle the feature selection problems. Neural Network is the one we extended the existing code to implement our project. We have also tried several other techniques using available implementations, such as WEKA series and SVM (for muti-classes). For all of the experiments of above algorithms, cross-validation is the technique we use to get a precise testing accuracy. We also take the running efficiency into account in terms of execution time. In the last, comparisons for these different algorithms will be presented. 4.2 NB (generative vs. discriminative) 4.2.1 Overview Naïve Bayesian Network is one of the most practical learning methods for classification problems. It is applicable when dealing with large training data set, and the attributes that describe instances are conditionally independent given classification labels. The structure of Naïve Bayesian Network is simple and elegant based on the assumption that attributes are independent given class labels. Nodes are variables, and links between nodes represents the causal dependency. In NB, Node for class labels serves as the root of tree, all the features are the child nodes of root, and no sibling exists for each child node. Each node is attached
  • 11. one Cptable, which is the parameter to learn for this structure. Every entry in Cptable is in the form of P(child|parent). In generative learning, entries in Cptable are populated with empirical frequency count. In discriminative learning for a given fixed structure (NB here), Cptable will be updated after each incoming query and thus try to produce the optimal classification error score. The inference of NB is based on Bayesian Theorem, and is carried out by picking up the max[P(vj)ΠP(ai|vj)] where ai is the attributes and vj is class label. 4.2.2 learning structure The Naïve Bayes Learn Algorithm goes in this way: For each target value Vj P’(Vj) ← estimaite P(Vj) For each attribute value ai of each attribute A P’(ai|Vj) ← estimaite P(ai|Vj) In deed, up to now, we have also filled up each Cptable for generative learning. 4.2.3 discriminative learning As said before, the parameters set in generative learning need not maximize the classification accuracy. However, a good classifier is the one that produces the appropriate answers to these unlabeled instances as often as possible [5]. “Classification error” is usually defined as: Err = P( class(e) != c) for <e,c> in sample This can be approximated by empirical score : Err’ = 1 Σ ( class(e)!= c) for <e,c> in sample S In discriminative learning for Naïve Bayesian Network, the goal is to learn the Cptable entries for the given NB structure to produce the smallest empirical error score above. Therefore, the “log conditional likelihood” of given NB over the distribution of labeled instances are used: LCL = Σ P(e) log P(c|e) for <e,c> in sample Similarly, this log conditional likelihood can also be approximated by: LCL’ = 1 Σ log P(c|e) for <e,c> in sample S
  • 12. To get the CPtable entries that have the optimal conditional likelihood, a simple gradient-descent algorithm is used [5]. The initialization of CPtable can be obtained by the usual way to fill in the CPtable, then the empirical error score is improved by changing the value of each CPtable entry. In the implementation, “softmax” parameters are adopted. The advantage is to keep the probability property: in the range between 0 and 1, and the marginal would sum up to one. Therefore, similar to the upgrade of weight in Neural Net, given a set of labeled queries in training phase, the learning algorithm descends in the direction of the total derivative, which is the sum of individual derivatives For a singled labeled instance <e,c>, the partial derivative is: P(r, f|e. c) - P(r,f |e) - θ r|f[ P(f| e, c)-P(f | e)] In the specific NB structure, the computation of derivative is relatively less intensive, because for each CPTable investigated, the parent of CPTable is the class label node; this special belief network brings great reduce in computation complexity in the implementation. There are also some other speed-up techniques, such as “line-search” to determine the learning rate, and conjugate gradient. We didn’t try these, but still took advantage of the observation that when R is independent of C given E, the derivative would be zero [5]. 4.3 TAN 4.3.1 Overview Tree Augmented Naïve Bayesian networks (TAN [1]) is one approach we took in this project. TAN’s are similar to regular NB’s, but the features of a TAN are organized into a tree structure. An example is given in figure 4.3.1. Class F1 F2 F3 F4 F5 Figure 4.3.1: An example of a tree augmented naïve Bayesian network
  • 13. The CP tables of a TAN are also similar to those of a NB. The difference is that all of the CP tables for the feature nodes have an extra column to account for the extra parent, except for the root node of the tree. Figure 4.3.2 shows an example. Class Parent Fi = 0 Fi = 1 C1 0 .566 .444 C1 1 .200 .800 C2 0 .101 .899 C2 1 .750 .250 Figure 4.3.2: An example of a CP table for a TAN 4.3.2 Learning Structure Below is the algorithm we used to learn the TAN structure, taken from [1]. 1. Calculate the Conditional Mutual Information Ip between any two features F1 and F2, given the classification C. Ip(F1; F2 | C) = Σf1,f2,c P(F1=f1, F2=f2 | C=c) log P(F1=f1, F2=f2 | C=c) P(F1=f1 | C=c) P(F2=f2 | C=c) In our case we avoid zero cases by initializing the count buckets for the P(f1, f2 | c) to 1 instead of 0. 2. Construct a complete undirected graph, where every feature is a node in the graph. Set the weights of the edges in the graph to the corresponding Ip values between features. 3. Extract the maximum weighted spanning tree from the graph. In our case, we used Kruskal’s minimum weighted spanning tree algorithm [2] and modified it slightly to find the maximum weighted spanning tree. 4. Choose a node to be the root and direct all edges in the spanning tree away from it, creating a tree. In our case we chose the feature with the highest information gain to be the root node. 5. Add the classification node and make it a parent of all of the feature nodes. 4.3.3 Learning CP Parameters and Classification Given a data record with m features f0, f1, … , fm Class = argmaxc { P(c) Πi P(fi | p(fi), c) } Where 1 <= i <= m, and p(fi) is the value of feature fi’s parent. We consider the root node to be its own parent ( p(froot) = froot ).
  • 14.  Class = argmaxc { P(c) Πi P(fi, p(fi), c) / P(p(fi), c) }  Class = argmaxc { nc Πi ncijk / ncik } nc is the number of records with class c ncijk is the number of records with class c, where fi = j and p(fi) = k ncik is the number of records with class c, where p(fi) = k We simply count up the nc, ncijk, and ncik’s to learn the CP table entries. Again, to avoid problems when these values are 0, we simply initialize all entries to 1. 4.3.4 Example Lets say we are given the data in table 4.3.1. First we determine the structure of the TAN. Step 1 is to calculate the conditional mutual information between every two features. Class F1 F2 F3 F4 A 1 0 0 0 A 0 1 0 1 B 1 0 1 0 B 0 1 1 0 B 0 0 1 1 B 0 0 1 1 B 1 1 1 1 B 0 0 1 0 Table 4.3.1: Data for the TAN example P(F1=0, F2=0 | C=A) = 1/6 P(F1=0, F2=1 | C=A) = 2/6 P(F1=1, F2=0 | C=A) = 2/6 P(F1=1, F2=1 | C=A) = 1/6 P(F1=0, F2=0 | C=B) = 4/10P(F1=0, F2=1 | C=B) = 2/10 P(F1=1, F2=0 | C=B) = 2/10P(F1=1, F2=1 | C=B) = 2/10 P(F1=0 | C=A) = 3/6 P(F1=1 | C=A) = 3/6 P(F1=0 | C=B) = 6/10 P(F1=1 | C=B) = 4/10 P(F2=0 | C=A) = 3/6 P(F2=1 | C=A) = 3/6 P(F2=0 | C=B) = 6/10 P(F2=1 | C=B) = 4/10 Note that we started each of the above buckets at 1 instead of 0 before counting. That explains why the denominators are 6 and 10, instead of 2 and 6 respectively.
  • 15. So using the above values we get Ip(F1; F2 | C) = 1/6 log( 1/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) + 1/6 log( 1/6 / (3/6 * 3/6) ) + 4/10 log( 4/10 / (6/10 * 6/10) ) + 2/10 log( 2/10 / (6/10 * 4/10) ) + 2/10 log( 2/10 / (4/10 * 6/10) ) + 2/10 log( 2/10 / (4/10 * 4/10) ) = -0.0293485 + 0.0416462 + 0.0416462 + -0.0293485 + 0.018303 + -0.0158362 + -0.0158362 + 0.019382 Ip(F1; F2 | C) = 0.0306079 And similarly we get Ip(F1; F3 | C) = 0.0022286 Ip(F1; F4 | C) = 0.0245954 Ip(F2; F3 | C) = 0.0022286 Ip(F2; F4 | C) = 0.0245954 Ip(F3; F4 | C) = 0 Step 2 is to create a complete undirected graph where the features are the nodes and the Ip values are the edge weights. A graphical representation of this graph is shown in figure 4.3.3. In our implementation we represent the graph as an array of triplets <n1, n2, w> where n1 and n2 are the nodes that the edge connects and w is the weight of the edge. Graph = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246>, <2, 3, 0.0022>, <2, 4, 0.0246>, <3, 4, 0> } F 1 .0246 .0306 .0022 F F 2 .0246 4 .0022 0 F 3 Figure 4.3.3: The conditional mutual information graph for the TAN example. Step 3 is to extract a maximum weighted spanning tree from the graph. Our algorithm generates the following max span tree, also shown in figure 4.3.4.
  • 16. MaxSpanTree = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246> } It is easy to verify that this is indeed a maximum weighted spanning tree. F 1 .0306 .0246 .0022 F F F 2 4 3 Figure 4.3.4: A maximum weighted spanning tree for the TAN example In step 4 we choose the feature with the highest information content to be the root node. The information contents of the features are given in Table 4.3.2. We see that feature F4 has the highest information content, so it becomes the root node. The following formula was used to calculate the information contents: Gain(F) = - P(F=0) log2( P(F=0) ) - P(F=1) log2( P(F=1) ) Step five involves simply adding the classification node as a parent to all other nodes. Figure 4.3.5 shows the final TAN structure. Feature Information Content F1 0.954434 F2 0.954434 F3 0.811278 F4 1.000000 Table 4.3.2: The information content of the features for the TAN example Now that the structure is set, we need to learn the CP table entries. The parameters required are the ncijk, ncik, and nc values described in section 4.3.2. Tables 4.3.3, 4.3.4, and 4.3.5 show the CP tables that contain the ncijk, ncik, and nc entries respectively, for our example. Again, remember that the ncijk entries were initialized to 1, not 0.
  • 17. F 4 C F 1 F F 2 3 Figure 4.3.5: The final TAN structure in the TAN example Class P(F1) = F1 = 0 F1 = 1 Class P(F2) = F2 = 0 F2 = 1 F4 F1 A 0 1 2 A 0 1 2 A 1 2 1 A 1 2 1 B 0 3 2 B 0 4 2 B 1 3 2 B 1 2 2 Class P(F3) = F3 = 0 F3 = 1 Class P(F4) = F4 = 0 F4 = 1 F1 F4 A 0 2 1 A 0 2 1 A 1 2 1 A 1 1 2 B 0 1 5 B 0 4 1 B 1 1 3 B 1 1 4 Table 4.3.3: The ncijk CP table entries for the TAN example Class F1 = 0 F1 = 1 Class F2 = 0 F2 = 1 A 3 3 A 3 3 B 6 4 B 6 4 Class F3 = 0 F3 = 1 Class F4 = 0 F4 = 1 A 4 2 A 3 3 B 2 8 B 5 5 Table 4.3.4: The ncik CP table entries for the TAN example Class = A 6 Class = B 10
  • 18. Table 4.3.5: The nc CP table entries for the TAN example Now that both the structure and the CP table entries have been learned, we can attempt to classify new instances. Consider the following unclassified record: Class F1 F2 F3 F4 ? 1 1 1 0 P(Class = A) = nA * (nA110 * nA211 * nA311 * nA400) / (nA40 * nA11 * nA11 * nA40) = 6 * (2 * 1 * 1 * 2) / (3 * 3 * 3 * 3) = 0.296 P(Class = B) = nB * (nB110 * nB211 * nB311 * nB400) / (nB40 * nB11 * nB11 * nB40) = 10 * (2 * 2 * 3 * 4) / (5 * 4 * 4 * 5) = 1.200 Therefore we classify this example as ‘B’ since P(Class = B) > P(Class = A). 3.3.4 Validation We validated our TAN implementation by running it with the above example and analyzing the verbose debugging output. We verified that the results from that run were identical to results given in the above example. 4.4 Neural Nets 4.4.1 Overview Artificial neural network learning provides a practical method for learning real- valued and vector-valued functions over continuous and discrete-valued attributes, in a way that is robust to noise in the training data. The Backpropagation algorithm [3] is the most common network learning method and has been successfully applied to a variety of learning tasks, such as handwriting recognition and robot control. Neural Nets is one of the major techniques covered in the class. 4.4.2 Implementation As opposed to the Naïve Bayes and TAN that we implemented from scratch, our implementation of Neural Nets is based on the our assignment 4 from class. We modified the Backpropagation algorithm that was originally for the problem of face recognition. The major modification we made was on the input: instead of using input nodes that represented the images, we changed it to represent every different features of our protein sequence. For the output nodes, instead of representing the user’s head position or user id, etc, we use them to represent the different class labels. Lastly, we changed the code for
  • 19. estimating the classification accuracy since these two problems are totally different in this case. For initial value of each input node, our strategy is: If one feature appears in one particular sequence, then the initial value of that input node is 1, otherwise, it will be set to 0. Corresponding, for the output node, we set 1 to one of the 14 output nodes that represents the correct class of our current sequence, set 0 to others 13 output nodes. The unit weight is set up randomly in the beginning. 4.4.3 Example For a specific protein sequence, the number of input nodes will be the number of features. You can specify the number of hidden nodes as a parameter. The number of output nodes in all experiment is 14 since we have 14 different classes for all the dataset. Each output node represents one of the classes in {a, b, c, d, e, f, g, h, i, j, k, l, m, n}. Inputs Hidden Output . . . . Figure 4.4.1 Learned Hidden Layer Representation 4.5 Wrapper (Information Content) 4.5.1 Overview For our particular task, the data set scales up to thousands of features. Even worse, some of these features are irrelevant and provide little to no information. Also the features can be noisy. Standard algorithms do not scale well with number of features, so the approach we use is “Wrapper”: Try different subsets of features on learner, estimating performance of algorithm with respect to each subset, and keep subset that performs best. Before selecting the subset, we preprocess (weight) each feature according to its mutual information content given by the formula below.
  • 20. W j = Σv Σc P(y = c, f j = v) log P(y = c, f j = v) P(y = c) P(f j = v) We can see that formula treats all the features independently. 4.5.2 Implementation Step 1: Calculate the information content of each feature We read in all of the training records first and then use the above formula to compute the mutual information content for each feature. When this preprocessing step is finished, we can begin to train the classifier in the next step. Step 2: Try different subsets of features We begin by using all of the features to train the classifier. Then we remove 5% of the features that have the lowest information contents and retrain the classifier, in each round. After 20 rounds, there are no features remaining. We compare the classification accuracies of these 20 rounds and choose the subset of features that produced the highest prediction accuracy. If two features have the same information content, then we choose one arbitrarily. 4.5.3 Example Let us consider the following problem: Suppose we have totally eight protein sequences, each sequence has exactly eight features. These eight sequences belong to all four classes: {C, P, R, M}. In the following table, for a particular protein sequence, if the entry of feature i is 1 then feature i appears, otherwise it does not appear. For example, for the first protein sequence, the features I, II and III appears in this sequence, the others do not appear. Seq. class I II III IV V VI VII VIII 1 C 1 1 1 0 0 0 0 0 2 C 1 1 1 1 0 0 0 0 3 P 0 0 1 0 0 1 0 1 4 P 0 0 1 0 0 0 1 1 5 R 1 1 1 0 0 0 0 0 6 R 0 1 1 0 0 1 0 0 7 M 0 0 1 0 1 1 0 0 8 M 0 1 1 1 0 0 0 0
  • 21. Info. 0.352 0.352 0 0.156 0.147 0.102 0.147 0.406 Table 4.5.1 the information content of eight features in eight sequences The last row shows the information content of each feature. These values are computed by the formula given above. As we can see, feature #3’s information content is 0, which shows that this feature contains least information about the data. This is expected since it appears in all the eight sequences. On the other hand, whenever feature #8 appears, its corresponding class is P in our example. Therefore, it is a significant discriminating feature in the data. Accordingly, its information content is the highest one in this case. The wrapper works by training a classifier using all the features in the first round. In the following rounds, it removes a fixed number of features each round starting from those features with low information content. For example, if we decide to remove one feature at a time in our example, then we iterate through 8 rounds, starting with removing feature #3 since its information content is 0. Then removing feature #6 since 0.102 is the smallest among remaining features. In the last round, only feature #8 remains. We then choose the subset of features with highest accuracy appears during the eight rounds. 4.6 Other approaches 4.6.1 Overview Besides the primary techniques we implemented (Naïve Bayes, TAN, and Neural Nets), we also apply some others, which include both traditional techniques such as decision trees and ruler learners, and a more recent approach in SVM’s.  A decision tree is a class discriminator that recursively partitions the training sets until each partition consists entirely or dominantly of examples from one class. Each non-leaf node of the tree contains a split point that is a test on one of more features and determines how the data is partitioned. It is the first classifier we learned in our class.  A rule learner is alternative classifier, which can be built directly by reading off a decision tree: generating a rule for each leaf and making a conjunction of all the tests encountered on the path from the root to that leaf. The advantage of rule learner is of its easy understanding, but sometimes it becomes more complex than necessary.
  • 22. SVM (Support Vector Machine) is a method for creating functions from a set of labeled training data. The function can be a classification function or the function can be a general regression function. For classification, SVM operates by finding a hyper-surface in the space of possible inputs. This hyper-surface will attempt to split the positive examples from negative examples. The split will be chosen to have the largest distance from the hyper-surface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical to the training data. They are maturely used in the NLP (Natural Language Processing) problem such as text categorization. 4.6.2 Existing Tools Instead of implementing all of the classifiers by ourselves, we chose to use some existing machine learning tools to make life easier.  WEKA Both Decision Tree and Rule Learner classifiers are used through WEKA. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. It includes almost all of the existing classification schemes. It has decision trees, rule learners and naïve Bayes. However, we will show in the next section that WEKA does not seem capable of dealing with our datasets very well.  Libsvm Libsvm is a simple, easy-to-use, and efficient software for SVM classification and regression. Although WEKA has the SVM classifier, it only deals with binary classifications, which is inappropriate for our task since we have 14 classes in our datasets. The most appealing feature of Libsvm is that it supports multi-class classification. In addition, it can solve C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression.
  • 23. 5. Empirical Analysis 5.1 Experiment Setup 5.1.1 Background on the Data Set Our three data sets wer provided by the PENCE group at the University of Alberta. Each data set contains thousands of protein sequences with known classes. For each sequence, there are more than one thousand features. For example, the Ecoli data set has more than two thousand sequences with about 1500 features. See table 5.1.1. Data Set # of classes # of sequences # of features Ecoli 14 2370 1504 Yeast 14 2539 1555 Fly 14 3823 1906 Table 5.1.1 the three data sets: Ecoli, Yeast, Fly 5.1.2 Training and Testing We train the classifier on each of the three datasets separately with different techniques. We use 5-fold cross validation to compute the validation accuracy. We implement Naïve Bayes, TAN and Neural Nets using C. The WEKA code is implemented using JAVA. The Libsvm has both C and JAVA version, we simply use C version during our experimentation. All the experimentation is on the machine at our graduate office, which is an i686 machine running on Linux 7.0 and has 415MB swap memory. 5.2 Comparison of NB, TAN, and NN Figure 5.1.1 shows a comparison of Naïve Bayesian, Tree Augmented Naïve Bayesian, and Neural Net classifiers. The accuracies given were obtained using 5-fold cross validation.
  • 24. Comparison of diff. classifiers without wrapper 100 Validation accuracy 80 NB 60 TAN ` 40 NN 20 0 Ecoli Yeast Fly Different classifier techniques Comparison of diff. classifiers with wrapper Best validation accuracy 100 80 NB 60 TAN 40 NN 20 0 Ecoli Yeast Fly Different classifier techniques Effect of feature selection on different classifiers Ecoli NB Yeast TAN NN Fly 0 4 8 12 16 20 Percentage of accuracy improvement Figure 5.1.1: A comparison of the accuracy of NB, TAN, and NN. The first graph shows the comparative accuracies without using the wrapper. The second graph shows the maximum accuracy of each method using the wrapper. The third graph shows the increase in accuracy when the wrapper is used. We see that the accuracies of the NB and the TAN classifiers are roughly equal both with and without the wrapper, for all three data sets. Given that TAN’s are more complicated to implement and take longer to train than NBs
  • 25. [1], it is likely more practical to use NB’s for the PENCE data rather than TAN’s. Neural networks perform noticeably better than both NB’s and TAN’s in terms of accuracy, on all three datasets. This suggests that neural network classifiers could be a promising area for future research for the Proteome Analyst tool. The third graph shows the accuracy improvement percentage obtained by using the wrapper. We note that the wrapper seems to have a similar effect on both NB and TAN classifiers, while the wrapper does not help in the NN case at all for the yeast and fly datasets. 5.3 Generative vs. Discriminative The first observation is that discriminative learning will enhance the classification accuracy. R.Greiner and W.Zhou have proved the discriminative learning would be more robust to incorrect assumption than generative learning [5]. The second observation is that discriminative learning is more computational intensive than generative learning, since it will update every entry in CPTable each time, and deal with high-dimension in this case. 5.4 Feature Selection—Wrapper As we observed before, each protein sequence of our data set has more than a thousand features; therefore, we use the “wrapper” feature selection technique to remove less relevant features. The following graphs show how the wrapper works on our three implementations: Naïve Bayes, TAN and Neural Nets. From figure 5.3.1, we can see that:  For both Naïve Bayes and TAN, we see that wrapper helps a lot. When removing 75%--85% percentages of features, both of them gain best classification accuracy. As we can see, when remains 25% of features, Naïve Bayes classifier achieve an accuracy close to eighty, which is about 15% higher than it use all the features to train a classifier.  But for the Neural Nets, wrapper only works on the Ecoli, and the help is not that significant compared with running on Naïve Bayes or TAN. For the other two data sets (Yeast and Fly), wrapper does not help at all. The accuracy consistently decreases as the number of features goes down.
  • 26. NB classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 30 0 10 20 40 50 60 70 80 90 100 Percentage of tokens removed TAN classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 0 10 20 30 40 50 60 70 80 90 100 Percentage of tokens removed NN classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 0 100 10 20 30 40 50 60 70 80 90 Percentage of tokens removed Figure 5.4.1: The effect of using wrapper as feature selection. The first graph shows how wrapper works on NB. The second graph shows how wrapper works on TAN. The third graph shows how wrapper works on the neural nets.
  • 27. 5.5 Miscellaneous Learning Algorithms We experiment with four other approaches using existing tools, WEKA and Libsvm, and record the 5-fold cross-validation accuracy. Additionally, since the WEKA code includes the Naïve Bayes classifier we compare their tool and our implementation. In the following table, some entries are empty. There are two reasons for this. One is that the training time is too long. For example, for the rule trainer classifier, it takes nearly 6 hours to train the Ecoli classifier. Since the Yeast dataset has more records and more features, it becomes impractical to continue. Tech. Data Ecoli Yeast Fly Decision Tree 81.9% 79.4% -- Rule Learner 82.66% -- -- Naïve Bayes 67.85% 69.16% -- SVM 85.3165% 82.4734% 78.0016% Table 5.5.1 the validation accuracy using some other techniques The second reason for the blank entry is that when using the WEKA code for Fly, we run out of memory. Since we cannot modify the WEKA code, we ignore those experiments. Although we cannot complete some of the tests using existing tools, we can still gain some useful insight from the results we did get.  The accuracy of Naïve Bayes of WEKA not only validates the correctness of our implementation, but also illustrates the strength of our own implementation, which can deal with all of three data sets, without running out of memory.  Naïve Bayes is the worst classifier if we only considering the accuracy. All the other three techniques achieve close to 80% high accuracy, which is about 10% higher than NB. However, the execution time of those techniques is much higher and is shown in the later sections.  SVM (Support Vector Machine) technique not only deals with all of three data sets, but also is the winner among those techniques with respect to accuracy. For Ecoli, it achieves the highest 85.3% accuracy, which is about 20% better than Naïve Bayes. This makes it a potential alternative for the Naïve Bayes classifier though it still consumes more execution time than Naïve Bayes.
  • 28. 5.6 Computational Efficiency As we saw before, the Naïve Bayes classifier is not as accurate as the other methods, but we believe it is the most practical classifier for our task. The reason can be easily seen from the following table: Classifier Naïve TAN Neural Decision Rule SVM Bayes Nets Tree Learner Time 5mins 15mins 30mins 1hr 6hrs 12mins Table 5.6.1 the approximate execution times of different techniques on Ecoli We conclude in the last section that nearly all other classifier outperform Naïve Bayes with respect to accuracy. The table above suggests an interesting tradeoff: More accuracy, longer time. For those classifiers that take more than half an hour like Decision Tree, they can never be considered as a practical approach for our task. For the others, if our goal focuses on classification accuracy, then our study shows that both TAN and SVM will be good choice. Especially, for the SVM, as we see before, it can outperform 20% more accuracy than Naïve Bayes, but it also takes twice as long to train the classifier. Overall, when we consider both our criteria, Naïve Bayes still seems to be the optimal classifier for our task, currently. However, TAN’s and SVM’s look to be excellent areas of future research in this area, especially research done to improve their training speed.
  • 29. 6. Conclusions and Future Works 6.1 Conclusions In this course project, we have explored several machine learning techniques for classification on a specific application domain – PENCE. Though our main focus is on Bayesian Network classifiers (Naïve Bayesian, TAN), we have also tried other different ideas (Decision Tree, Neural Network, SVM, etc). What’s more, discriminative learning on Naïve Bayesian is also tested. Comparison both on classification accuracy and running efficiency in terms of execution time are drawn based on various different combinations of experiments and from different angles. Based on the experimental results we have, we found that the harder a learner learns (in terms of execution time), the better results we can get (in terms of classification accuracy). However, this is the trade-off between efficiency and accuracy. Take the all factors into account, we think NB+wrapper is a suitable solution to this application. However, we are impressive by the accuracy SVM works out. 6.2 Future Work One of the possible future work can be carried out on the feature selection part, since wrapper works quite effectively. There are many other algorithms dealing with scaling up the supervised learning. From the last class this term, several algorithms, such as “RELIEF-F” algorithm, which draw samples at random, and then adjust weight of features that discriminate instances from neighbors of different classes; “VSM”, which integrates feature weighting into learning algorithm, etc have been introduced and can be tried. One of the other possible way in reducing the feature dimensionality can take advantage of some statistic metrics and clustering techniques to cluster the feature sets first, and then do the learning task. Considering the long execution time for almost all the algorithms except Naïve Bayesian Network, speed-up in learning phase according to different algorithms can be another aspect of future work.
  • 30. Acknowledgments The authors are grateful to Dr. Russ Greiner for his valuable comments on our project and useful discussions relating to this work. Jie Cheng and Wei Zhou’s previous work on Bayesian Network and discriminative learning helps our work a lot. We also thank Dr. Duane Szafron, and Dr. Paul Lu for their support with regard to the PENCE code and data. We also would like to thank Roman Eisner for helping us on some detailed problems. And perhaps most of all, we would like to thank the good people at Wendy’s for providing us with tasty hamburgers at a reasonable price during the ungodly hours of the night while we worked late.
  • 31. 7. References 1. Friedman, Geiger, Goldszmidt. Bayesian Network Classifiers. Machine Learning, volume 29 (pp. 131-163), 1997. 2. Brassard and Bratley. Fundamentals of Algorithmics, Prentice Hall, 1996. 3. T Mitchell. Machine Learning, McGraw Hill, 1997. 4. Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), Sweden, Aug 1999. 5. Russ Greiner, and Wei Zhou. Structural Extension to Logistic Regressions: Discriminative Parameter Learning of Belief Net Classifiers. AAAI’02, Canada 6. David E. Heckerman. A tutorial on learning with Bayesian networks. Learning in Graphical Models, 1998
  • 32. 8. Appendix Naïve Bayes (Generative) Naïve Bayes (Discriminative) Percentage of tokens removed Ecoli Yeast Fly Ecoli Yeast Fly 0 67.8 69.1 68.3 5 68.5 69.3 69.0 10 69.0 69.5 69.2 15 69.4 69.3 69.4 20 70.1 70.0 69.8 25 70.7 70.3 70.1 30 71.3 70.7 70.4 35 71.8 70.9 70.9 40 72.4 70.8 70.9 45 73.9 71.1 71.4 50 74.5 71.2 71.2 55 75.0 71.4 71.1 60 75.6 71.6 71.1 65 76.1 71.5 70.9 70 76.6 71.3 70.0 75 77.3 71.6 69.5 80 77.1 71.1 69.1 85 77.0 71.2 68.1 90 75.9 69.2 65.7 95 71.7 66.1 61.7 99 40.4 60.2 39.78 100 0 0 0 Table 1: empirical result (Accuracy) of 2 approaches to learning classifier with wrapper, over 3 datasets.
  • 33. TAN Neural Nets Percentage of tokens removed Ecoli Yeast Fly Ecoli Yeast Fly 0 67.8 69.4 68.7 85.7 87.1 76.3 5 68.3 69.7 69.0 89.4 86.7 73.9 10 69.0 70.1 69.3 88.8 84.5 72.0 15 69.4 70.3 69.8 86.6 83.4 74.9 20 70.0 70.7 70.2 82.7 82.4 68.3 25 70.6 70.8 70.4 84.5 78.8 69.0 30 71.3 71.0 70.6 78.9 78.4 67.7 35 71.7 71.3 71.0 78.7 77.4 64.3 40 72.4 71.3 71.4 68.3 75.1 58.5 45 73.8 71.6 71.7 72.1 73.7 56.7 50 74.4 71.6 71.5 68.2 71.8 54.8 55 74.8 71.9 71.4 64.7 69.2 51.2 60 75.5 72.1 71.2 68.5 63.1 45.7 65 76.1 72.2 71.2 65.3 63.3 50.8 70 76.6 71.7 70.3 63.4 58.8 43.1 75 77.2 72.0 70.0 59.4 51.8 42.7 80 77.0 71.4 69.3 55.8 47.5 37.5 85 77.0 71.1 68.8 49.3 42.6 28.0 90 75.9 69.4 66.3 33.2 30.2 25.6 95 71.7 66.6 62.3 20.1 26.6 24.3 99 40.9 60.3 40.1 13.8 16.3 11.0 100 0 0 0 0 0 0 Table 2: empirical result (Accuracy) of 2 approaches to learning classifier with wrapper, over 3 datasets.