SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
The Florida State University
     College of Arts and Sciences




   A Family of Global Protein Shape
   Descriptors Using Gauss Integrals


                  By

   Christian Edgar Laing Celestino


A proposal submitted to the Department
of Mathematics in partial fulfillment of
 the doctoral preliminary examination



             April 30, 2004
Table of Contents



        Abstract ……………………………………………                                   2


1       Background and Significance ……………………..                       4

    1.1 CATH Protein Structure Classification ……………………               4
    1.2 Current Methods and Importance of a New Approach …….         5
    1.3 The Writhing Number ….…………………………………..                        7
        1.3.1 Directional Writhing Number ….……………………..               8
        1.3.2 Natural Notion of the Writhing Number for Polygonal
        Curves ……………………………………………………..                               10
    1.4 Representing Proteins in R 20 ………………………………                  11
        1.4.1 Results of the SGM when Tested for CATH 2.4 …….       12

2       The Experimental Plan …………………………….                          14

    2.1 Purpose and Objectives ……………………………………                       14
    2.2 Procedures …………………………………………………                              15


        References …………………………………………                                 17




                                                                     1
Abstract

        Within the field of biology, comparison, description and prediction of biological
structures is an important task. In the case of proteins, it is of great interest to characterize
and therefore classify these three dimensional structures. Protein structures can be
classified in a variety of interrelated ways such as functional similarity, evolutionary
similarity, and fold similarity. Two similar proteins can have different sequence
information, but comparison of protein structures can show their distant evolutionary
relationships that would not be evident by sequence information alone. Proteins also have
three-dimensional structures that provide clues to their function in living organisms.

        Protein classification focuses on identifying proteins that have similar chemical
architectures and topology. Because it is not practical to study in detail all the protein
structures in every genome, the functional role of a new protein in the cell can be inferred
from an already classified protein with similar structure. This is why it is important to
develop new methods for 3D structures classification of proteins.

         Today, there is a great amount of protein information obtained from experimental
methods such as X-ray Crystallography (1) and NMR Spectroscopy (2). The data is
deposited into a resource of public domain, such as the Protein Data Bank (3). Structural
information about proteins such as CATH (Class, Architecture, Topology and
Homologous Superfamily, see 4-5) and SCOP (Structural Classification of Proteins, see
6) is also available in databases. However, some of the methods of classification are done
by manual inspection. Because of the rapid increase in the number of known proteins (as
of April 2004, 25,004 and growing by >450 per month (3)), a fully automatic method
(using solely computer algorithms) is required.

            Currently there are several computer methods for structural comparison of
proteins (7). Examples of these are CE (8), DALI (9), KENOBI (10), and STRUCTAL
(11). Such methods are also in the public domain and in some cases the program itself is
available for download. These structural comparison methods are based on computing a
pairwise distance between the alpha carbon atoms of the protein, but such methods
present several complications. First, these methods are high in computational cost
because they require alignment between two molecules in order to compare poteins.
Additionally, the measures that are used violate the triangle inequality
( d ( x, z ) ≤ d ( x, y ) + d ( y, z ) ). Consequently these computations have little meaning for
proteins with large distance, that is, when their structures similarities are far apart.
Because of these complications, the need of a better and different approach is required.

       Peter Rogen and Boris Fain in the group of Michael Levitt at Stanford University,
have developed a new automatic classification of proteins using Gauss integrals. A vector


                                                                                               2
of 20 numbers inspired by Vassiliev knot invariants to capture the topology of a protein
(12), (13). Multiple combinations of a geometrical tool called “writhing number” gives
these 20 numbers.

        This work is still in progress and it has shown good results when it was tested on
a protein database known as CATH 2.4, correctly classifying 98.6% of the protein crystal
structure data used.

        The authors leave an interesting point open (12): “While we have geometric
interpretation of the writhing number we would like to understand the other generalized
Gauss integrals used in this work”. We intend to investigate and answer this question.




                                                                                        3
1.   Background and Significance

        Proteomics is the study of the full set of proteins encoded by a genome, and
Structural Proteomics is a sub-area of Proteomics that studies the structure of proteins. So
far, many genomes have been fully sequenced, including Yeast, Drosophila Melanogaster
and Homo Sapiens. The full value of the sequence data will be realized when we assign
the role of each protein in the cell, and this require a full set of tools for classification of
proteins, computer databases like CATH, and sequence methods like DALI for example.


       1.1 CATH Protein Structure Classification

      CATH is a hierarchical classification of protein domain structures in the Protein
Data Bank (3) which clusters proteins at four levels: Class (C), Architecture (A),
Topology (T) and Homologous Superfamily (H).

        Such classification operates at the level of structural domains, as these domains
are likely to be the fundamental evolutionary building blocks or units (5). When a protein
has a similarity to another protein already in the database, then the new protein inherits
the domain boundaries of the existing entry. If the new protein has no relative in the
CATH database, three different algorithms (DETECTIVE, PUU and DOMAK) are used
to identify the structural domain automatically. If all the programs agree, the domain
boundaries are assigned. If not, then the domain boundaries are assigned manually based
on the rules below (see also 14). The four levels of CATH are described and figure 1
shows the hierarchy for the C, A, and T levels. References for CATH can be found in (4),
(5) and (14).

   •   Class C level is assigned by considering the secondary structure and packing
       within the structure. Four classes are recognized: mainly alpha, mainly beta,
       alpha-beta and the fourth class, which contains protein domains that have low
       secondary structure and content. The correspondence of a protein to its class is of
       more than 90% of protein structures are classified automatically, the rest are
       determined by hand.

   •   Architecture A level describes the overall shape of secondary structures in three-
       dimensional space but ignores their connectivity. Although an automatic
       procedure is being developed, it is currently assigned manually using a basic
       description of the secondary arrangements (e.g. roll, sandwich).

   •   Topology T level groups structures into fold families depending on the shape and
       connectivity of the secondary structures. This fold group is also related to protein
       domains that show a similarity in structure but have no sequence similarity. The
       assignments are made by sequence and structure comparison (a SSAP score
       greater than 70 is required) (5).



                                                                                              4
•   Homologous Superfamily H level groups into domains that are thought to share
       a common ancestor (homologous families) for either having sequence similarity
       (35%) or high structural similarity (20%). Structural similarity is done by an
       automatic method (SSAP>80).




               Figure 1. Hierarchy of CATH at C, A and T levels. From reference (4).




       1.2 Current Methods and Importance of a New Approach


        In order to find similarity between 3D protein structures in the crystal state,
scientists have built a wide variety of protein structure alignment methods and techniques
such as distance matrix alignment (9), genetic algorithms (10) and double dynamic
programming (11). The general idea is to consider the protein backbone of two proteins
as two chains, A and B in the three dimensional space, and to find sub-chains α and β
of A and B respectively, such that the lengths of the sub chains α and β are equal and
maximal with the property that α and β are similar (see figure 2).



                                                                                        5
The most common parameter that expresses the difference between two proteins
is RMSD or root mean square deviation. RMSD can be computed using the position of
the alpha carbon atoms of the protein backbone and is a function of the distance between
atoms in one structure and the same atoms in another structure.

       Because of the nature of these methods, we encounter some complications:

   •   A protein structure can contain several hundreds of atoms, therefore finding such
       alignments may be high in computational cost. A structural comparison method
       needs to be fast.

   •   As discussed in the introduction, these methods fail to satisfy the triangle
       inequality. Indeed, if we consider three proteins made of the following sequences:
       protein A=DEF-LMN, protein B=GHI-LMN and protein C=GHI-OPQ. Then
       there is a similarity between protein A and B in the LMN region, and also there is
       a similarity between protein B and C in the GHI region. However, we cannot infer
       a similarity between A and C (see figure 3). The triangle inequality is violated
       because it does not satisfy d ( A, C ) ≤ d ( A, B) + d ( B, C ) . When this occurs, we are
       unable to judge dissimilarity and the problem worsens with increasing distance.

   •   In order to compute such measures, the methods require a series of adjustable
       parameters such as gap and insertion penalties, weights, etc.




       Figure 2. Two chains in three                    Figure 3. Failure of triangle inequality
       dimensional space.                                        From reference (12).




       These complications lead to the search of a better, more efficient and fully
automated method. The protein backbone is a space curve, and mathematicians study
such curves in areas such as Knot Theory and Differential Geometry, we wish to apply
these mathematical techniques to the protein classification problem.




                                                                                                   6
1.3 The Writhing Number

       We start with the concepts of linking number and the twist. These two numbers,
together with the writhing number are all related in a simple formula. These concepts
were obtained from (15) and (16).

       A strip (C,U) is a smooth1 curve C together with a smoothly varying unit vector
U(t) perpendicular to C at each point.

Definition 1. If C1 (t1 ) and C2 (t2 ) are two disjoint oriented closed curves in space
parametrized by [0,1], the linking number is defined by the integral


                                      1         (C1 (t1 ) − C 2 (t 2 )) ⋅ (∂C1 / ∂t1 × ∂C 2 / ∂t 2 )
                  Lk (C1 , C 2 ) =
                                     4π   ∫∫
                                          C1 C2              | C1 (t1 ) − C 2 (t 2 ) |3
                                                                                                     dt1 dt 2


       The linking number is an integer that measures the entanglement between two
curves. Examples of the linking number are shown on figure 4 below, notice that figure
4c shows an example of two curves that are entangled, however the linking number is
zero.




                                              Figure 4. From reference (16).


         For any simple closed strip, the curves C + εU given parametrically C (t ) + εU (t )
are, for sufficiently small ε > 0 , simple closed curves disjoint from C, and the linking
number Lk(C, C + εU ) is defined and independent of ε . The vectors C ' (t), U(t) and
V (t ) = C ' (t ) × U (t ) define a moving frame (C ' ,U ,V ) along C. Let Ω denote the angular

1
    A curve C is smooth if is infinitely differentiable.


                                                                                                                7
velocity vector describing the rate of rotation of the frame with respect to the arclength t,
so that c' = Ω × C ' , µ = Ω × U and ν = Ω × V . Let ω1 , ω 2 and ω 3 be the components of

Ω referred to the moving frame, i. e., Ω = ω1C '+ω 2U + ω 3V . Then ω1 represents the
angular rate at which U revolves around C. ω1 is called the twist of the strip at each point
of the curve.

Definition 2. The total twist number Tw(C,U), is defined by the integral of ω1 with
respect to the arclength t over the curve C and divided by 2π . That is
             1
            2π ∫
Tw(C ,U ) =      ω1 dt . The total twist number need not be an integer and if the curve C is
a simple plane curve then the linking number Lk (C , C + U ) and the total twist number
Tw(C,U) are equal.

Definition 3. The difference Wr (C ) = Lk (C , C + U ) − Tw(C ,U ) is a geometric invariant
of the curve C and is called the writhing number.



1.3.1 Directional Writhing Number

Definition 4. A smooth simple closed curve C and a fixed unit vector σ are said to be in
general position if the tangents to C are never parallel to σ . In this case the curves
C + εσ are disjoint from C for all sufficiently small ε > 0 , hence for such ε we can
may define the directional writhing number of C in the direction of σ by
Wr (C , σ ) = Lk (C , C + εσ ) .

       If C and σ are in general position, the orthogonal projection of C onto a plane
with normal σ defines a smooth closed plane curve Cσ for which undercrossings and
overcrossings can be distinguished at each crossing point (see figure 5 below). At a
crossing point c of an oriented regular diagram for a curve, we have two possible
configurations. Either sign(c)=+1 or sign(c)= – 1 as shown on figure 5. The sign of a
crossing number is based on the right hand rule convention.




                                          Figure 5.
       If one adds all the signed crossing numbers for a fixed regular projection of a
curve for a direction σ , one obtains the directional writhing number Wr (C , σ ) . The


                                                                                           8
writhing number Wr of a curve C is equal to the average of the directional writhing
number over all projections, the average is taken with respect to the area on the unit
sphere.

        Figure 6 shows examples of regular projections of two knots, for the oriented
projection of the trefoil knot (left) we have the projected writhing number is 3 while for
the oriented projection of the figure eight knot (right), is 0.




                                                            Figure 6


       The writhing number Wr of a closed space curve γ can be calculated using
generalized Gauss integrals.
                                        1
                                       4π γ ×∫∫D
                             Wr (γ ) =          w(t1 , t 2 )dt1 dt 2 ,
                                             γ


where
                                                     [γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )]
                                    w(t1 , t 2 ) =
                                                              | γ (t1 ) − γ (t 2 ) |3


and D is the diagonal of γ × γ . The numerator of w(t1 , t 2 ) is the triple scalar product,
[γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )] = γ ' (t1 ) ⋅ {[γ (t1 ) − γ (t 2 )] × γ ' (t 2 )} . The triple scalar product is
also equal to the oriented volume of the parallelepiped spanned by γ ' (t1 ), γ (t1 ) − γ (t 2 ) ,
and γ ' (t 2 ) . Thus w(t1 , t 2 ) = w(t 2 , t1 ) . Assuming that γ is parametrized by [0,1] it suffices
to calculate the integral on the domain ∆2 = {(t1 , t 2 );0 < t1 < t 2 < 1} . If
I (1, 2 ) = ∫ w(t1 , t 2 )dt1 dt 2 then:
          ∆2
                                                                    1
                                                     Wr (γ ) =        I (1, 2 )
                                                                   2π

        Another measure for curves is the average crossing number and is defined by
taking the absolute value of the integrand:

                                           I |1, 2| (γ ) = ∫ | w(t1 , t 2 ) | dt1 dt 2
                                                           ∆2




                                                                                                                          9
The main difference between the projection of a knot and space curves
(representing protein backbones) is that for knots we deal with simple closed curves,
while for protein backbones we have polygonal curves which are not closed.



           1.3.2 Natural Notion of the Writhing Number for Polygonal Curves


           For a polygonal curve the natural definition of writhing number is:

                                               I (1, 2) (γ ) = Wr (γ ) =                  ∑W (i , i
                                                                                         0< i1
                                                                                                         1       2   ),
                                                                                         < i2 < N

with
                                                                          i1 +1 i2 +1
                                                                     1
                                            W (i1 , i2 ) =
                                                                    2π     ∫ ∫ w(t , t
                                                                         t1 =i1 t 2 =i2
                                                                                               1    2   )dt1 dt 2 .


and w(t1 , t 2 ) = [γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )] / | γ (t1 ) − γ (t 2 ) |3 .


        Here W (i1 , i2 ) is the contribution to the writhing number coming from the i1 th
and the i2 th line segments. W (i1 , i2 ) is equal to the probability from an arbitrary
direction to see the i1 th and the i2 th line segment cross, multiplied by the sign of this
crossing. Thus, geometrically this notion of writhe number is still the projected writhing
number averaged over all projections.

           By combining this number we can make a whole set of structural measures, e.g.

                                                      I |1, 2| (γ ) =      ∑ | W (i , i
                                                                          0<i1
                                                                                               1    2   ) |,
                                                                          < i2 < N

                                           I |1,3|( 2, 4 ) (γ ) =      ∑ | W (i , i ) | W (i , i
                                                                    0<i1 <i2
                                                                                          1    3                 2        4   ),
                                                                    <i3 <i4 < N

                                I |1,5|( 2, 4 )(3,6 ) (γ ) =        ∑ | W (i , i ) | W (i , i
                                                               0<i1 <i2 < i3
                                                                                     1     5                 2       4   )W (i3 , i6 )
                                                               <i4 <i5 <i6 < N




where N is the number of vertices of the polygonal curve.

Numbers like the ones just mentioned will constitute the building blocks for our protein
domain descriptors, which described in the next section.




                                                                                                                                         10
1.4 Representing Proteins in R 20

        As mentioned before, the protein backbone is a space curve (see figure 7 below).
We are interested in the absolute measures of the geometry of these curves by studying
the self-crossings seen in a planar projection. These measures are inspired by generalized
Gauss integrals involved in formulas for the Vassiliev knot invariants.




                           Figure 7. Backbone curve of Lysozyme from Gallus Gallus, from (3).



        For each protein domain on CATH 2.4, we have a geometric invariant of the polygonal
curve connecting the α -carbon atoms. Each domain is assigned a 20-dimensional vector
containing the measures described by the following:

I (1, 2) , I |1, 2| , I (1,3)( 2, 4) , I (1, 2)(3, 4) , I (1, 4)( 2,3) , I (1, 2)(3, 4)(5,6) , I (1, 2)(3,5)( 4,6) , I (1, 2)(3,6)( 4,5) , I (1,3)( 2, 4)(5, 6) ,
I (1,3)( 2,5)( 4,6) , I (1,3)( 2,6)( 4,5) , I (1, 4)( 2,3)(5,6) , I (1, 4)( 2,5)(3, 6) , I (1, 4)( 2,6)(3,5) , I (1,5)( 2,3)( 4,6) , I (1,5)( 2, 4)(3,6) ,
I (1,5)( 2,6)(3, 4) , I (1,6)( 2,3)( 4,5) , I (1,6)( 2, 4)(3,5) , and I (1,6)( 2,5)(3, 4) .

        The measures are normalized such that each value is between –1 and 1. The
normalization factors are one over 146, 1277, 119, 101 023, 1206, 477 989, 6612, 23 946,
6448, 203, 1884, 54 581, 172, 258, 1246, 293, 1396, 36 143, 442, and 2468 respectively
for the measures in the order above.

       Once each protein chain is mapped onto a point in the 20-dimensional space, the
usual euclidean metric is used to compare the protein chains.


                                                                                                                                                             11
20
                                         d ( x, y ) =   ∑ (x
                                                        i =1
                                                               i   − yi ) 2


      Based on the scaled factors described given above, this metric is called the Scaled
Gauss Metric (SGM).


         1.4.1 Results of the SGM when Tested for CATH 2.4

       Let x, y and z be points in R 20 , then the Scaled Gauss Metric satisfies the three
properties for pseudometric:

i) d ( x, y ) = 0 if x=y
ii) d ( x, y ) = d ( y, x) (symmetry)
iii) d ( x, z ) ≤ d ( x, y ) + d ( y, z ) (triangle inequality).

        The fact that SGM satisfies the triangle inequality is important because it allows
us to judge dissimilarity between proteins.

       A computer algorithm (12,13,17) based on this metric was made to classify the
domains of all 20,937 of CATH 2.4 domains as of September 2002. The total success rate
was 98.6%. The remaining 1.4% of the chains are unknown; of these, 0.9% are actually
new folds. It presented no mistakes since unknown structures were flagged instead of
misclassifying. Also proteins of different sizes can be compared directly without use of
alignment or gap penalties. The figure 8 shows a projection map from R 20 to R 2 , and it
shows the CATH hierarchy. Here, every point represents a protein domain in CATH.

        As described by the authors (12), the rectangle in the upper left contains all the
chains in CATH, colored according to their class ( α , β , αβ and few secondary
structures), notice that the αβ group resides between the α and the β groups. This
observation shows the congruence that exists between the automatic classification created
by the SGM and the CATH database assignation currently given.

       Figure 9 shows the usefulness of the second order invariants. In this example the
curves A and B posses the same crossing number and average crossing number.
However the second order invariants can differentiate between the two curves.




                                                                                       12
Figure 8. From reference (12).




Figure 9. From reference (12).




                                              13
2. The Experimental Plan

         2.1 Purpose and Objectives

      The excellent results of the SGM shown in the previous section are elegant, fast,
computationally viable, and motivate one to understand the true geometric meaning of
such measures.

        As it was mentioned before, the geometric idea of all these measures is still not
fully understood (12-13). While there is a geometric interpretation of the writhing
number ( I (1, 2) ) and the average crossing number ( I |1, 2| ), the meaning of the higher order
measures is still a mystery. Another important question worth investigating is to
determine if it is possible to classify protein structure domains with less Gauss measures
(described in 1.4), if some of the measures are strongly correlated or provide more
information and it will be possible to improve the combinations used. Finally, it might be
plausible to apply this method to classification of RNA secondary structures.

      During this research proposal I intend, with the support of my advisor, De Witt
Sumners, to complete the following objectives:

        I)      Determine the geometric meaning of the higher order invariants obtained
                from the Gauss integral measures. Such work will validate the importance
                of the role of these numbers and corroborate the excellent results obtained
                from experimental evidence.

        II)     Optimize the choice of the invariant numbers used to classify the protein
                structures. This will allow an increase of the speed and efficiency of the
                computer algorithms to classify the protein structures by selecting the best
                shape descriptors, and the minimum quantity necessary of such
                descriptors.

        III)    Study the mathematical idea involved in these numbers and the possible
                applications to branches of mathematics such as Knot Theory and
                Differential Geometry.

        IV)     Explore the possibility of application of these methods to the classification
                of RNA secondary structures. Since an RNA secondary structure can be
                seen as a chain or a polygonal curve, an approach to this unexplored topic
                could result in promising and new applications of mathematics in biology.


    The research questions are as follows:



                                                                                              14
Are the numbers obtained by using the higher order writhe calculations truly shape
descriptors of space curves? Or, are they just numbers chosen by chance, that work only
for very particular curves?

    The answer to these questions will unveil the true geometric meaning of these higher
order invariants. This is fundamental to validate the automatic classification computer
method for novel protein structure domains.



       2.2 Procedures

    The research will be based on mathematics and on biology as described below.

    To begin with, we consider a review of the old literature related to the writhing
number such as the work by J. H. White (18), G. Gălugăreanu (19), and Brock Fuller (15-
16) ,as well as the new literature that focus also on the concept of writhing number for
open and closed curves (20-28). A study on the proof and the methods for solving the
primary cases would provide clues for solving the general case for the higher order
invariants.

    Another fundamental source of information is to review current computer algorithms
designed to calculate the writhing number particularly applied to fields such as biology
and physics (27). Some of these computer algorithms are in the public domain and can be
downloaded (28).

    An algorithm to compute the writhing number is essential to understand and to verify
the geometric ideas. Using Monte Carlo simulations, we intend to estimate the write
number of a polygonal curve of n in the simple cubic lattice. The advantage of using a
simple cubic lattice is that for a closed curve, the problem reduces the writhing number
computation to the average of the linking number of the given curve with four of its
pushoffs (24). The next step would be to study the higher order invariants on this simple
cubic lattice.

    To verify the data on simulation results we would like to consider some examples.
We will first consider simple cases where we know the answer and then we will apply
these methods for a polygonal curve describing the backbone of some protein crystals.
Such data can be obtained from the Protein Data Bank (3).

    Finally, we would like to apply this method to RNA secondary structures. A
ribonucleic acid (RNA) molecule consists of a chain of ribonucleotides linked together by
covalent chemical bonds (29). Figure 10 shows a model of an RNA structure obtained
from the Protein Data Bank. We notice that RNA structures, like on the figure 10, can be
seen as a chain that bends and twines about itself. Such self-crossings are of particular
interest because the Gauss measures, designed to describe the shape of proteins, can be
applied to these chains.



                                                                                      15
With these approaches we expect to understand the geometric meaning of these higher
order invariants.




                    Figure 10. Pseudoknot within the gene 32 messenger RNA
                    of Bacteriophage T2. Image obtained by Protein Data Bank (3).




                                                                                    16
References

 (1) Gale Rhodes. Crystallography: Made Crystals Clear. Academic Press, 2000,
     Second Edition.

 (2) Joseph P. Hornak. The Basics of NMR. <http://www.cis.rit.edu/htbooks/nmr/>.

 (3) Protein Databank, Available from <http://beta.rcsb.org/pdb/>.

 (4) CATH Protein Structure Classification <http://www.biochem.ucl.ac.uk/bsm/cath/>.

 (5) Pearl, F. M. G. Lee, D., Bray, J. E. Sillitoe, I., Todd, A. E., Harrison, A. P.,
     Thornton, J. M. and Orengo, C. A. Assigning Genomic Sequences to CATH.
     Nucleic Acids Research. 2000, Vol 28. No 1. 277-282.

 (6) Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP : A Structural
     Classification of Proteins Database for the Investigation of Sequences and
     Structures. J. Mol. Biol. 1995, 247:536-540.

 (7) Patrice Koehl, Protein Structure Similarities. Curr. Opin. Struct. Biol. 2001,
     11:348-353.

 (8) CE Combinatorial Extension <http://cl.sdsc.edu>, available to download from
     <ftp://ftp.sdsc.edu/pub/sdsc/biology/CE/src>.

 (9) DALI Distance Matrix Alignment <http://www2.ebi.ac.uk/dali>, available to
     download from <http://jura.ebi.ac.uk:8765/~holm/DaliLite>.

(10) KENOBI Alignment Using a Genetic Algorithm
     <http://sullivan.bu.edu/kenobi>, available to download from
     <http://www.columbia.edu/~ay1>.

(11) STRUCTAL Double Dynamic Programming
     <http://bioinfo.mbb.yale.edu/align/server.cgi>.

(12) Peter Rogen, Boris Fain. Automatic Classification of Protein Structure by Using
     Gauss Integrals. PNAS, Vol 100 (2003), no.1, 119-124.

(13) Peter Rogen, Henrik Bohr. A New Family of Global Protein Shape Descriptors.
     Math Biosc 182 (2003), 167-181.

(14) Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and
     Thornton, J. M. CATH- A Hierarchy Classification of Protein Domain Structures.
     Structure. Vol 5 (1997), No 8. 1093-1108.




                                                                                   17
(15) F. Brock Fuller, The Writhing Number of a Space Curve. Proc. Nat. Acad. Sci.
     USA, Vol. 68, No. 4 (1971), 815-819.

(16) F. Brock Fuller, Mathematical Problems in the Biological Sciences, Proceedings of
     Symposia in Applied Mathematics, ed. R. E. Bellman (American Mathematical
     Society, Providence) Vol. 14 (1962), 64-68.

(17) Peter Rogen, Robert Sinclair. Computing a New Family of Shape Descriptors for
     Protein Structures. J. Chem. Inf. Comput. Sci. 43 (2003), 1740-1747.

(18) White J. H., Self-Linking and the Gauss Integral in HigherDimensions. Am. J.
     Math. 91 (1969), 693-727

(19) G. Gălugăreanu, Sur les Classes D’isotope des Noeuds Tridimensionnels et Leur
     Invariants, Czechoslovak Mathematical Journal 11 (1961), 588-625.

(20) Lin, X-S, Wang, Z. Integral Geometry of Plane Curves and Knot Invariants. J.
     Differ. Geom. 44 (1996), 74-95.

(21) Yu. Aminov, Differential Geometry and Topology of Curves, Gordon and Breach
     Science Publishers (2000).

(22) Eric S. Lander, Michael Waterman, Calculating the Secretes of Life, National
     Research Council (1995).

(23) Levitt group Server, <http://www.stanford.edu/~bfain/>.

(24) E. Orlandini, M. C. Tesi, E. J. Janse van Rensburg, D. W. Sumners, S. G.
     Whittington, The Writhe of a Self-avoiding Polygon, J. Phys. A: Math. Gen. 26
     (1993), 981-986.

(25) E. Orlandini, S. G. Whittington, D. W. Sumners, M. C. Tesi, E. J. Janse van
     Rensburg, The Writhe of a Self-avoiding Path, J. Phys. A: Math. Gen. 27 (1994),
     333-338.

(26) Meivys Garcia, Emmanuel Ilangko, Stuart G. Whittimgton, The Writhe of Polygons
     on the Face-centered Cubic Lattice, Path, J. Phys. A: Math. Gen. 32 (1999), 4593-
     4600.

(27) Corinne Cerf, Andrzej Stasiak, A Topological Invariant to Predict the three-
     dimensional Writhe of Ideal Configurations of Knots and Links, PNAS Vol. 97
     (2000), 3795-3798.

(28) Pankaj K. Agarwal, Herbert Edelsbrunner, Yusu Wang, Computing the Writhing
     Number of a Polygonal Knot, SODA, (2002), 791-799.




                                                                                       18
(29) RNA World at IMB Jena: <http://www.imb-jena.de/RNA.html>.




                                                                 19

Mais conteúdo relacionado

Mais procurados

Protein 3D structure and classification database
Protein 3D structure and classification database Protein 3D structure and classification database
Protein 3D structure and classification database
nadeem akhter
 
Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724
Robin Gutell
 
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.Assignment
Naima Tahsin
 

Mais procurados (20)

Cath
CathCath
Cath
 
Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
demonstration lecture on Homology modeling
demonstration lecture on Homology modelingdemonstration lecture on Homology modeling
demonstration lecture on Homology modeling
 
Protein 3D structure and classification database
Protein 3D structure and classification database Protein 3D structure and classification database
Protein 3D structure and classification database
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysis
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformatics
 
Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724
 
Lanjutan kimed
Lanjutan kimedLanjutan kimed
Lanjutan kimed
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Drug design
Drug designDrug design
Drug design
 
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.Assignment
 
Pharmacophore mapping joon
Pharmacophore mapping joonPharmacophore mapping joon
Pharmacophore mapping joon
 
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
In silico structure prediction
In silico structure predictionIn silico structure prediction
In silico structure prediction
 

Destaque (6)

LinkedIn collage _ what does your photo tell the world
LinkedIn collage _ what does your photo tell the worldLinkedIn collage _ what does your photo tell the world
LinkedIn collage _ what does your photo tell the world
 
Arise Solution
Arise SolutionArise Solution
Arise Solution
 
Facebook marketing-update-spring-2011
Facebook marketing-update-spring-2011Facebook marketing-update-spring-2011
Facebook marketing-update-spring-2011
 
Adma digital-marketing-yearbook-2010
Adma digital-marketing-yearbook-2010Adma digital-marketing-yearbook-2010
Adma digital-marketing-yearbook-2010
 
Tecnoadiccio
TecnoadiccioTecnoadiccio
Tecnoadiccio
 
A visual guide to custom landing tabs on Facebook Pages.
A visual guide to custom landing tabs on Facebook Pages.A visual guide to custom landing tabs on Facebook Pages.
A visual guide to custom landing tabs on Facebook Pages.
 

Semelhante a A family of global protein shape descriptors using gauss integrals, christian laing

Comparative Protein Structure Modeling and itsApplications
Comparative Protein Structure Modeling and itsApplicationsComparative Protein Structure Modeling and itsApplications
Comparative Protein Structure Modeling and itsApplications
LynellBull52
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
Automated alphabet reduction with evolutionary algorithms for protein structu...
Automated alphabet reduction with evolutionary algorithms for protein structu...Automated alphabet reduction with evolutionary algorithms for protein structu...
Automated alphabet reduction with evolutionary algorithms for protein structu...
kknsastry
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
Justin P. Bolinger
 

Semelhante a A family of global protein shape descriptors using gauss integrals, christian laing (20)

MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residues
MULISA : A New Strategy for Discovery of Protein Functional Motifs and ResiduesMULISA : A New Strategy for Discovery of Protein Functional Motifs and Residues
MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residues
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Drug discovery presentation
Drug discovery presentationDrug discovery presentation
Drug discovery presentation
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Presage database
Presage databasePresage database
Presage database
 
Pep Talk San Diego 011311
Pep Talk San Diego 011311Pep Talk San Diego 011311
Pep Talk San Diego 011311
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
 
Protein struc pred-Ab initio and other methods as a short introduction.ppt
Protein struc pred-Ab initio and other methods as a short introduction.pptProtein struc pred-Ab initio and other methods as a short introduction.ppt
Protein struc pred-Ab initio and other methods as a short introduction.ppt
 
Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383
 
Comparative Protein Structure Modeling and itsApplications
Comparative Protein Structure Modeling and itsApplicationsComparative Protein Structure Modeling and itsApplications
Comparative Protein Structure Modeling and itsApplications
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Automated alphabet reduction with evolutionary algorithms for protein structu...
Automated alphabet reduction with evolutionary algorithms for protein structu...Automated alphabet reduction with evolutionary algorithms for protein structu...
Automated alphabet reduction with evolutionary algorithms for protein structu...
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
 
Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676
 
PROTEIN STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINE
PROTEIN STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINEPROTEIN STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINE
PROTEIN STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINE
 
Protein Structure Prediction Using Support Vector Machine
Protein Structure Prediction Using Support Vector Machine  Protein Structure Prediction Using Support Vector Machine
Protein Structure Prediction Using Support Vector Machine
 
Recent trends in bioinformatics
Recent trends in bioinformaticsRecent trends in bioinformatics
Recent trends in bioinformatics
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 

Último (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

A family of global protein shape descriptors using gauss integrals, christian laing

  • 1. The Florida State University College of Arts and Sciences A Family of Global Protein Shape Descriptors Using Gauss Integrals By Christian Edgar Laing Celestino A proposal submitted to the Department of Mathematics in partial fulfillment of the doctoral preliminary examination April 30, 2004
  • 2. Table of Contents Abstract …………………………………………… 2 1 Background and Significance …………………….. 4 1.1 CATH Protein Structure Classification …………………… 4 1.2 Current Methods and Importance of a New Approach ……. 5 1.3 The Writhing Number ….………………………………….. 7 1.3.1 Directional Writhing Number ….…………………….. 8 1.3.2 Natural Notion of the Writhing Number for Polygonal Curves …………………………………………………….. 10 1.4 Representing Proteins in R 20 ……………………………… 11 1.4.1 Results of the SGM when Tested for CATH 2.4 ……. 12 2 The Experimental Plan ……………………………. 14 2.1 Purpose and Objectives …………………………………… 14 2.2 Procedures ………………………………………………… 15 References ………………………………………… 17 1
  • 3. Abstract Within the field of biology, comparison, description and prediction of biological structures is an important task. In the case of proteins, it is of great interest to characterize and therefore classify these three dimensional structures. Protein structures can be classified in a variety of interrelated ways such as functional similarity, evolutionary similarity, and fold similarity. Two similar proteins can have different sequence information, but comparison of protein structures can show their distant evolutionary relationships that would not be evident by sequence information alone. Proteins also have three-dimensional structures that provide clues to their function in living organisms. Protein classification focuses on identifying proteins that have similar chemical architectures and topology. Because it is not practical to study in detail all the protein structures in every genome, the functional role of a new protein in the cell can be inferred from an already classified protein with similar structure. This is why it is important to develop new methods for 3D structures classification of proteins. Today, there is a great amount of protein information obtained from experimental methods such as X-ray Crystallography (1) and NMR Spectroscopy (2). The data is deposited into a resource of public domain, such as the Protein Data Bank (3). Structural information about proteins such as CATH (Class, Architecture, Topology and Homologous Superfamily, see 4-5) and SCOP (Structural Classification of Proteins, see 6) is also available in databases. However, some of the methods of classification are done by manual inspection. Because of the rapid increase in the number of known proteins (as of April 2004, 25,004 and growing by >450 per month (3)), a fully automatic method (using solely computer algorithms) is required. Currently there are several computer methods for structural comparison of proteins (7). Examples of these are CE (8), DALI (9), KENOBI (10), and STRUCTAL (11). Such methods are also in the public domain and in some cases the program itself is available for download. These structural comparison methods are based on computing a pairwise distance between the alpha carbon atoms of the protein, but such methods present several complications. First, these methods are high in computational cost because they require alignment between two molecules in order to compare poteins. Additionally, the measures that are used violate the triangle inequality ( d ( x, z ) ≤ d ( x, y ) + d ( y, z ) ). Consequently these computations have little meaning for proteins with large distance, that is, when their structures similarities are far apart. Because of these complications, the need of a better and different approach is required. Peter Rogen and Boris Fain in the group of Michael Levitt at Stanford University, have developed a new automatic classification of proteins using Gauss integrals. A vector 2
  • 4. of 20 numbers inspired by Vassiliev knot invariants to capture the topology of a protein (12), (13). Multiple combinations of a geometrical tool called “writhing number” gives these 20 numbers. This work is still in progress and it has shown good results when it was tested on a protein database known as CATH 2.4, correctly classifying 98.6% of the protein crystal structure data used. The authors leave an interesting point open (12): “While we have geometric interpretation of the writhing number we would like to understand the other generalized Gauss integrals used in this work”. We intend to investigate and answer this question. 3
  • 5. 1. Background and Significance Proteomics is the study of the full set of proteins encoded by a genome, and Structural Proteomics is a sub-area of Proteomics that studies the structure of proteins. So far, many genomes have been fully sequenced, including Yeast, Drosophila Melanogaster and Homo Sapiens. The full value of the sequence data will be realized when we assign the role of each protein in the cell, and this require a full set of tools for classification of proteins, computer databases like CATH, and sequence methods like DALI for example. 1.1 CATH Protein Structure Classification CATH is a hierarchical classification of protein domain structures in the Protein Data Bank (3) which clusters proteins at four levels: Class (C), Architecture (A), Topology (T) and Homologous Superfamily (H). Such classification operates at the level of structural domains, as these domains are likely to be the fundamental evolutionary building blocks or units (5). When a protein has a similarity to another protein already in the database, then the new protein inherits the domain boundaries of the existing entry. If the new protein has no relative in the CATH database, three different algorithms (DETECTIVE, PUU and DOMAK) are used to identify the structural domain automatically. If all the programs agree, the domain boundaries are assigned. If not, then the domain boundaries are assigned manually based on the rules below (see also 14). The four levels of CATH are described and figure 1 shows the hierarchy for the C, A, and T levels. References for CATH can be found in (4), (5) and (14). • Class C level is assigned by considering the secondary structure and packing within the structure. Four classes are recognized: mainly alpha, mainly beta, alpha-beta and the fourth class, which contains protein domains that have low secondary structure and content. The correspondence of a protein to its class is of more than 90% of protein structures are classified automatically, the rest are determined by hand. • Architecture A level describes the overall shape of secondary structures in three- dimensional space but ignores their connectivity. Although an automatic procedure is being developed, it is currently assigned manually using a basic description of the secondary arrangements (e.g. roll, sandwich). • Topology T level groups structures into fold families depending on the shape and connectivity of the secondary structures. This fold group is also related to protein domains that show a similarity in structure but have no sequence similarity. The assignments are made by sequence and structure comparison (a SSAP score greater than 70 is required) (5). 4
  • 6. Homologous Superfamily H level groups into domains that are thought to share a common ancestor (homologous families) for either having sequence similarity (35%) or high structural similarity (20%). Structural similarity is done by an automatic method (SSAP>80). Figure 1. Hierarchy of CATH at C, A and T levels. From reference (4). 1.2 Current Methods and Importance of a New Approach In order to find similarity between 3D protein structures in the crystal state, scientists have built a wide variety of protein structure alignment methods and techniques such as distance matrix alignment (9), genetic algorithms (10) and double dynamic programming (11). The general idea is to consider the protein backbone of two proteins as two chains, A and B in the three dimensional space, and to find sub-chains α and β of A and B respectively, such that the lengths of the sub chains α and β are equal and maximal with the property that α and β are similar (see figure 2). 5
  • 7. The most common parameter that expresses the difference between two proteins is RMSD or root mean square deviation. RMSD can be computed using the position of the alpha carbon atoms of the protein backbone and is a function of the distance between atoms in one structure and the same atoms in another structure. Because of the nature of these methods, we encounter some complications: • A protein structure can contain several hundreds of atoms, therefore finding such alignments may be high in computational cost. A structural comparison method needs to be fast. • As discussed in the introduction, these methods fail to satisfy the triangle inequality. Indeed, if we consider three proteins made of the following sequences: protein A=DEF-LMN, protein B=GHI-LMN and protein C=GHI-OPQ. Then there is a similarity between protein A and B in the LMN region, and also there is a similarity between protein B and C in the GHI region. However, we cannot infer a similarity between A and C (see figure 3). The triangle inequality is violated because it does not satisfy d ( A, C ) ≤ d ( A, B) + d ( B, C ) . When this occurs, we are unable to judge dissimilarity and the problem worsens with increasing distance. • In order to compute such measures, the methods require a series of adjustable parameters such as gap and insertion penalties, weights, etc. Figure 2. Two chains in three Figure 3. Failure of triangle inequality dimensional space. From reference (12). These complications lead to the search of a better, more efficient and fully automated method. The protein backbone is a space curve, and mathematicians study such curves in areas such as Knot Theory and Differential Geometry, we wish to apply these mathematical techniques to the protein classification problem. 6
  • 8. 1.3 The Writhing Number We start with the concepts of linking number and the twist. These two numbers, together with the writhing number are all related in a simple formula. These concepts were obtained from (15) and (16). A strip (C,U) is a smooth1 curve C together with a smoothly varying unit vector U(t) perpendicular to C at each point. Definition 1. If C1 (t1 ) and C2 (t2 ) are two disjoint oriented closed curves in space parametrized by [0,1], the linking number is defined by the integral 1 (C1 (t1 ) − C 2 (t 2 )) ⋅ (∂C1 / ∂t1 × ∂C 2 / ∂t 2 ) Lk (C1 , C 2 ) = 4π ∫∫ C1 C2 | C1 (t1 ) − C 2 (t 2 ) |3 dt1 dt 2 The linking number is an integer that measures the entanglement between two curves. Examples of the linking number are shown on figure 4 below, notice that figure 4c shows an example of two curves that are entangled, however the linking number is zero. Figure 4. From reference (16). For any simple closed strip, the curves C + εU given parametrically C (t ) + εU (t ) are, for sufficiently small ε > 0 , simple closed curves disjoint from C, and the linking number Lk(C, C + εU ) is defined and independent of ε . The vectors C ' (t), U(t) and V (t ) = C ' (t ) × U (t ) define a moving frame (C ' ,U ,V ) along C. Let Ω denote the angular 1 A curve C is smooth if is infinitely differentiable. 7
  • 9. velocity vector describing the rate of rotation of the frame with respect to the arclength t, so that c' = Ω × C ' , µ = Ω × U and ν = Ω × V . Let ω1 , ω 2 and ω 3 be the components of Ω referred to the moving frame, i. e., Ω = ω1C '+ω 2U + ω 3V . Then ω1 represents the angular rate at which U revolves around C. ω1 is called the twist of the strip at each point of the curve. Definition 2. The total twist number Tw(C,U), is defined by the integral of ω1 with respect to the arclength t over the curve C and divided by 2π . That is 1 2π ∫ Tw(C ,U ) = ω1 dt . The total twist number need not be an integer and if the curve C is a simple plane curve then the linking number Lk (C , C + U ) and the total twist number Tw(C,U) are equal. Definition 3. The difference Wr (C ) = Lk (C , C + U ) − Tw(C ,U ) is a geometric invariant of the curve C and is called the writhing number. 1.3.1 Directional Writhing Number Definition 4. A smooth simple closed curve C and a fixed unit vector σ are said to be in general position if the tangents to C are never parallel to σ . In this case the curves C + εσ are disjoint from C for all sufficiently small ε > 0 , hence for such ε we can may define the directional writhing number of C in the direction of σ by Wr (C , σ ) = Lk (C , C + εσ ) . If C and σ are in general position, the orthogonal projection of C onto a plane with normal σ defines a smooth closed plane curve Cσ for which undercrossings and overcrossings can be distinguished at each crossing point (see figure 5 below). At a crossing point c of an oriented regular diagram for a curve, we have two possible configurations. Either sign(c)=+1 or sign(c)= – 1 as shown on figure 5. The sign of a crossing number is based on the right hand rule convention. Figure 5. If one adds all the signed crossing numbers for a fixed regular projection of a curve for a direction σ , one obtains the directional writhing number Wr (C , σ ) . The 8
  • 10. writhing number Wr of a curve C is equal to the average of the directional writhing number over all projections, the average is taken with respect to the area on the unit sphere. Figure 6 shows examples of regular projections of two knots, for the oriented projection of the trefoil knot (left) we have the projected writhing number is 3 while for the oriented projection of the figure eight knot (right), is 0. Figure 6 The writhing number Wr of a closed space curve γ can be calculated using generalized Gauss integrals. 1 4π γ ×∫∫D Wr (γ ) = w(t1 , t 2 )dt1 dt 2 , γ where [γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )] w(t1 , t 2 ) = | γ (t1 ) − γ (t 2 ) |3 and D is the diagonal of γ × γ . The numerator of w(t1 , t 2 ) is the triple scalar product, [γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )] = γ ' (t1 ) ⋅ {[γ (t1 ) − γ (t 2 )] × γ ' (t 2 )} . The triple scalar product is also equal to the oriented volume of the parallelepiped spanned by γ ' (t1 ), γ (t1 ) − γ (t 2 ) , and γ ' (t 2 ) . Thus w(t1 , t 2 ) = w(t 2 , t1 ) . Assuming that γ is parametrized by [0,1] it suffices to calculate the integral on the domain ∆2 = {(t1 , t 2 );0 < t1 < t 2 < 1} . If I (1, 2 ) = ∫ w(t1 , t 2 )dt1 dt 2 then: ∆2 1 Wr (γ ) = I (1, 2 ) 2π Another measure for curves is the average crossing number and is defined by taking the absolute value of the integrand: I |1, 2| (γ ) = ∫ | w(t1 , t 2 ) | dt1 dt 2 ∆2 9
  • 11. The main difference between the projection of a knot and space curves (representing protein backbones) is that for knots we deal with simple closed curves, while for protein backbones we have polygonal curves which are not closed. 1.3.2 Natural Notion of the Writhing Number for Polygonal Curves For a polygonal curve the natural definition of writhing number is: I (1, 2) (γ ) = Wr (γ ) = ∑W (i , i 0< i1 1 2 ), < i2 < N with i1 +1 i2 +1 1 W (i1 , i2 ) = 2π ∫ ∫ w(t , t t1 =i1 t 2 =i2 1 2 )dt1 dt 2 . and w(t1 , t 2 ) = [γ ' (t1 ), γ (t1 ) − γ (t 2 ), γ ' (t 2 )] / | γ (t1 ) − γ (t 2 ) |3 . Here W (i1 , i2 ) is the contribution to the writhing number coming from the i1 th and the i2 th line segments. W (i1 , i2 ) is equal to the probability from an arbitrary direction to see the i1 th and the i2 th line segment cross, multiplied by the sign of this crossing. Thus, geometrically this notion of writhe number is still the projected writhing number averaged over all projections. By combining this number we can make a whole set of structural measures, e.g. I |1, 2| (γ ) = ∑ | W (i , i 0<i1 1 2 ) |, < i2 < N I |1,3|( 2, 4 ) (γ ) = ∑ | W (i , i ) | W (i , i 0<i1 <i2 1 3 2 4 ), <i3 <i4 < N I |1,5|( 2, 4 )(3,6 ) (γ ) = ∑ | W (i , i ) | W (i , i 0<i1 <i2 < i3 1 5 2 4 )W (i3 , i6 ) <i4 <i5 <i6 < N where N is the number of vertices of the polygonal curve. Numbers like the ones just mentioned will constitute the building blocks for our protein domain descriptors, which described in the next section. 10
  • 12. 1.4 Representing Proteins in R 20 As mentioned before, the protein backbone is a space curve (see figure 7 below). We are interested in the absolute measures of the geometry of these curves by studying the self-crossings seen in a planar projection. These measures are inspired by generalized Gauss integrals involved in formulas for the Vassiliev knot invariants. Figure 7. Backbone curve of Lysozyme from Gallus Gallus, from (3). For each protein domain on CATH 2.4, we have a geometric invariant of the polygonal curve connecting the α -carbon atoms. Each domain is assigned a 20-dimensional vector containing the measures described by the following: I (1, 2) , I |1, 2| , I (1,3)( 2, 4) , I (1, 2)(3, 4) , I (1, 4)( 2,3) , I (1, 2)(3, 4)(5,6) , I (1, 2)(3,5)( 4,6) , I (1, 2)(3,6)( 4,5) , I (1,3)( 2, 4)(5, 6) , I (1,3)( 2,5)( 4,6) , I (1,3)( 2,6)( 4,5) , I (1, 4)( 2,3)(5,6) , I (1, 4)( 2,5)(3, 6) , I (1, 4)( 2,6)(3,5) , I (1,5)( 2,3)( 4,6) , I (1,5)( 2, 4)(3,6) , I (1,5)( 2,6)(3, 4) , I (1,6)( 2,3)( 4,5) , I (1,6)( 2, 4)(3,5) , and I (1,6)( 2,5)(3, 4) . The measures are normalized such that each value is between –1 and 1. The normalization factors are one over 146, 1277, 119, 101 023, 1206, 477 989, 6612, 23 946, 6448, 203, 1884, 54 581, 172, 258, 1246, 293, 1396, 36 143, 442, and 2468 respectively for the measures in the order above. Once each protein chain is mapped onto a point in the 20-dimensional space, the usual euclidean metric is used to compare the protein chains. 11
  • 13. 20 d ( x, y ) = ∑ (x i =1 i − yi ) 2 Based on the scaled factors described given above, this metric is called the Scaled Gauss Metric (SGM). 1.4.1 Results of the SGM when Tested for CATH 2.4 Let x, y and z be points in R 20 , then the Scaled Gauss Metric satisfies the three properties for pseudometric: i) d ( x, y ) = 0 if x=y ii) d ( x, y ) = d ( y, x) (symmetry) iii) d ( x, z ) ≤ d ( x, y ) + d ( y, z ) (triangle inequality). The fact that SGM satisfies the triangle inequality is important because it allows us to judge dissimilarity between proteins. A computer algorithm (12,13,17) based on this metric was made to classify the domains of all 20,937 of CATH 2.4 domains as of September 2002. The total success rate was 98.6%. The remaining 1.4% of the chains are unknown; of these, 0.9% are actually new folds. It presented no mistakes since unknown structures were flagged instead of misclassifying. Also proteins of different sizes can be compared directly without use of alignment or gap penalties. The figure 8 shows a projection map from R 20 to R 2 , and it shows the CATH hierarchy. Here, every point represents a protein domain in CATH. As described by the authors (12), the rectangle in the upper left contains all the chains in CATH, colored according to their class ( α , β , αβ and few secondary structures), notice that the αβ group resides between the α and the β groups. This observation shows the congruence that exists between the automatic classification created by the SGM and the CATH database assignation currently given. Figure 9 shows the usefulness of the second order invariants. In this example the curves A and B posses the same crossing number and average crossing number. However the second order invariants can differentiate between the two curves. 12
  • 14. Figure 8. From reference (12). Figure 9. From reference (12). 13
  • 15. 2. The Experimental Plan 2.1 Purpose and Objectives The excellent results of the SGM shown in the previous section are elegant, fast, computationally viable, and motivate one to understand the true geometric meaning of such measures. As it was mentioned before, the geometric idea of all these measures is still not fully understood (12-13). While there is a geometric interpretation of the writhing number ( I (1, 2) ) and the average crossing number ( I |1, 2| ), the meaning of the higher order measures is still a mystery. Another important question worth investigating is to determine if it is possible to classify protein structure domains with less Gauss measures (described in 1.4), if some of the measures are strongly correlated or provide more information and it will be possible to improve the combinations used. Finally, it might be plausible to apply this method to classification of RNA secondary structures. During this research proposal I intend, with the support of my advisor, De Witt Sumners, to complete the following objectives: I) Determine the geometric meaning of the higher order invariants obtained from the Gauss integral measures. Such work will validate the importance of the role of these numbers and corroborate the excellent results obtained from experimental evidence. II) Optimize the choice of the invariant numbers used to classify the protein structures. This will allow an increase of the speed and efficiency of the computer algorithms to classify the protein structures by selecting the best shape descriptors, and the minimum quantity necessary of such descriptors. III) Study the mathematical idea involved in these numbers and the possible applications to branches of mathematics such as Knot Theory and Differential Geometry. IV) Explore the possibility of application of these methods to the classification of RNA secondary structures. Since an RNA secondary structure can be seen as a chain or a polygonal curve, an approach to this unexplored topic could result in promising and new applications of mathematics in biology. The research questions are as follows: 14
  • 16. Are the numbers obtained by using the higher order writhe calculations truly shape descriptors of space curves? Or, are they just numbers chosen by chance, that work only for very particular curves? The answer to these questions will unveil the true geometric meaning of these higher order invariants. This is fundamental to validate the automatic classification computer method for novel protein structure domains. 2.2 Procedures The research will be based on mathematics and on biology as described below. To begin with, we consider a review of the old literature related to the writhing number such as the work by J. H. White (18), G. Gălugăreanu (19), and Brock Fuller (15- 16) ,as well as the new literature that focus also on the concept of writhing number for open and closed curves (20-28). A study on the proof and the methods for solving the primary cases would provide clues for solving the general case for the higher order invariants. Another fundamental source of information is to review current computer algorithms designed to calculate the writhing number particularly applied to fields such as biology and physics (27). Some of these computer algorithms are in the public domain and can be downloaded (28). An algorithm to compute the writhing number is essential to understand and to verify the geometric ideas. Using Monte Carlo simulations, we intend to estimate the write number of a polygonal curve of n in the simple cubic lattice. The advantage of using a simple cubic lattice is that for a closed curve, the problem reduces the writhing number computation to the average of the linking number of the given curve with four of its pushoffs (24). The next step would be to study the higher order invariants on this simple cubic lattice. To verify the data on simulation results we would like to consider some examples. We will first consider simple cases where we know the answer and then we will apply these methods for a polygonal curve describing the backbone of some protein crystals. Such data can be obtained from the Protein Data Bank (3). Finally, we would like to apply this method to RNA secondary structures. A ribonucleic acid (RNA) molecule consists of a chain of ribonucleotides linked together by covalent chemical bonds (29). Figure 10 shows a model of an RNA structure obtained from the Protein Data Bank. We notice that RNA structures, like on the figure 10, can be seen as a chain that bends and twines about itself. Such self-crossings are of particular interest because the Gauss measures, designed to describe the shape of proteins, can be applied to these chains. 15
  • 17. With these approaches we expect to understand the geometric meaning of these higher order invariants. Figure 10. Pseudoknot within the gene 32 messenger RNA of Bacteriophage T2. Image obtained by Protein Data Bank (3). 16
  • 18. References (1) Gale Rhodes. Crystallography: Made Crystals Clear. Academic Press, 2000, Second Edition. (2) Joseph P. Hornak. The Basics of NMR. <http://www.cis.rit.edu/htbooks/nmr/>. (3) Protein Databank, Available from <http://beta.rcsb.org/pdb/>. (4) CATH Protein Structure Classification <http://www.biochem.ucl.ac.uk/bsm/cath/>. (5) Pearl, F. M. G. Lee, D., Bray, J. E. Sillitoe, I., Todd, A. E., Harrison, A. P., Thornton, J. M. and Orengo, C. A. Assigning Genomic Sequences to CATH. Nucleic Acids Research. 2000, Vol 28. No 1. 277-282. (6) Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP : A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 1995, 247:536-540. (7) Patrice Koehl, Protein Structure Similarities. Curr. Opin. Struct. Biol. 2001, 11:348-353. (8) CE Combinatorial Extension <http://cl.sdsc.edu>, available to download from <ftp://ftp.sdsc.edu/pub/sdsc/biology/CE/src>. (9) DALI Distance Matrix Alignment <http://www2.ebi.ac.uk/dali>, available to download from <http://jura.ebi.ac.uk:8765/~holm/DaliLite>. (10) KENOBI Alignment Using a Genetic Algorithm <http://sullivan.bu.edu/kenobi>, available to download from <http://www.columbia.edu/~ay1>. (11) STRUCTAL Double Dynamic Programming <http://bioinfo.mbb.yale.edu/align/server.cgi>. (12) Peter Rogen, Boris Fain. Automatic Classification of Protein Structure by Using Gauss Integrals. PNAS, Vol 100 (2003), no.1, 119-124. (13) Peter Rogen, Henrik Bohr. A New Family of Global Protein Shape Descriptors. Math Biosc 182 (2003), 167-181. (14) Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M. CATH- A Hierarchy Classification of Protein Domain Structures. Structure. Vol 5 (1997), No 8. 1093-1108. 17
  • 19. (15) F. Brock Fuller, The Writhing Number of a Space Curve. Proc. Nat. Acad. Sci. USA, Vol. 68, No. 4 (1971), 815-819. (16) F. Brock Fuller, Mathematical Problems in the Biological Sciences, Proceedings of Symposia in Applied Mathematics, ed. R. E. Bellman (American Mathematical Society, Providence) Vol. 14 (1962), 64-68. (17) Peter Rogen, Robert Sinclair. Computing a New Family of Shape Descriptors for Protein Structures. J. Chem. Inf. Comput. Sci. 43 (2003), 1740-1747. (18) White J. H., Self-Linking and the Gauss Integral in HigherDimensions. Am. J. Math. 91 (1969), 693-727 (19) G. Gălugăreanu, Sur les Classes D’isotope des Noeuds Tridimensionnels et Leur Invariants, Czechoslovak Mathematical Journal 11 (1961), 588-625. (20) Lin, X-S, Wang, Z. Integral Geometry of Plane Curves and Knot Invariants. J. Differ. Geom. 44 (1996), 74-95. (21) Yu. Aminov, Differential Geometry and Topology of Curves, Gordon and Breach Science Publishers (2000). (22) Eric S. Lander, Michael Waterman, Calculating the Secretes of Life, National Research Council (1995). (23) Levitt group Server, <http://www.stanford.edu/~bfain/>. (24) E. Orlandini, M. C. Tesi, E. J. Janse van Rensburg, D. W. Sumners, S. G. Whittington, The Writhe of a Self-avoiding Polygon, J. Phys. A: Math. Gen. 26 (1993), 981-986. (25) E. Orlandini, S. G. Whittington, D. W. Sumners, M. C. Tesi, E. J. Janse van Rensburg, The Writhe of a Self-avoiding Path, J. Phys. A: Math. Gen. 27 (1994), 333-338. (26) Meivys Garcia, Emmanuel Ilangko, Stuart G. Whittimgton, The Writhe of Polygons on the Face-centered Cubic Lattice, Path, J. Phys. A: Math. Gen. 32 (1999), 4593- 4600. (27) Corinne Cerf, Andrzej Stasiak, A Topological Invariant to Predict the three- dimensional Writhe of Ideal Configurations of Knots and Links, PNAS Vol. 97 (2000), 3795-3798. (28) Pankaj K. Agarwal, Herbert Edelsbrunner, Yusu Wang, Computing the Writhing Number of a Polygonal Knot, SODA, (2002), 791-799. 18
  • 20. (29) RNA World at IMB Jena: <http://www.imb-jena.de/RNA.html>. 19