This document summarizes a presentation on predicting protein functional sites using a shortest-path graph kernel method. The presentation introduces the problem of predicting functional sites on proteins, describes a graph-based approach to represent protein structures, and presents results applying a shortest-path graph kernel and nearest neighbor prediction methods to datasets of catalytic sites and phosphorylation sites. The approach achieved up to 77.1% accuracy on the catalytic site dataset. Future work could include adding more parameters to the graph representations and node labels, improving the method as a web service, and optimizing algorithms for large datasets.
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Protein functional site prediction using the shotest path graphnew1 2
1. PROTEIN FUNCTIONAL SITE
PREDICTION USING THE SHORTEST-
PATH GRAPH KERNEL METHOD
Presented by :: Malinda Sanjaka
Major Advisor:: Dr. Changhui Yan
Graduate Committee Members::
Dr. Juan (Jen) Li
Dr. Jun Kong
Dr. Nan Yu
Date:: 04/22/2013
1
3. Problem Statement
Problem : Prediction of functional sites on protein
structures
What are the functional sites
The functional sites are the small portion of a protein where substrate
molecules bind and undergo a chemical reaction.
Example:
3
Phosphorylation SiteProtein 3D Structure
4. Problem Statement(2)
Importance of Functional Sites Prediction
To understand protein functionalities
To structure based drug design
To design new protein
4
7. Introduction(2)
Protein Functional Sites
D. Catalytic active site atlas
Catalytic active site atlas
Phosphorylation Site
DNA binding Site
Zinc-binding site
7
Addition of a phosphate to an amino acid
The functional sites are the small portion of a protein where substrate molecules bind
and undergo a chemical reaction.
8. Introduction(3)
Laboratory Methods for Functional Sites Determination
X-ray Crystallography
Nuclear Magnetic Resonance(NMR)
Challenges
Time consume
High cost
Lack of support for some protein
Need skilled professional bodies
8
9. Introduction(4)
The Need for Computational Methods
Structural Genomics (SG) projects reveal large number of protein structures
but least understanding of protein function.
Advantages
Low cost
Less execution time
Less environmental impacts
Results optimize by repeating
Reusable
Run as simulation
Reduce human mistakes
Disadvantage
Accuracy is less than laboratory experimental results
Computational methods provide helpful guide line for experimental approach
9
10. Introduction(5)
Computational Methods for Functional Sites Prediction
Template-based
Identify the structure similar template
An alignment a target and the template
Predict functional groups
Micro environment-based
Focus on a single residue or position
Used structural and physicochemical properties
Supervised machine learning approaches
Macro environment-based
Local structural region is involved
Protein to protein interaction
Structure-based drug design
DNA-binding sites and ligand-binding sites
10
11. Introduction(6)
Overview of Our Approach
We used graphs to represent each residue with contacting neighbors in a
protein structure.
Central Residue
(+/Functional)
Contacting Residues
One Residue is
consist of number of
atoms
11
Residue
(-/Non-Functional) Contacting
14. Materials and Methods
Datasets
How to get protein structure
Download::
[http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]
How to get the protein sequence
PDB Database ::
[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].
PDB ID and Change ID :: 101m_A
FASTA Format:: >101m_Amol:protein length:154 MYOGLOBIN
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKH
14
15. Materials and Methods(2)
Catalytic Binding Site (CSA)
[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl]
73 Protein Chains
201 Active Catalytic Sites
20398 Non-Active Residues
Balanced Dataset
201 Active Catalytic Sites
201 Non-Active Residues
Phosphorylation Site
Section 3.3.4 of this paper
[http://www.informatics.indiana.edu/predrag/publications.htm].
679 Protein Chains
2062 Active Phosphorylation Site Residues
139795 Non-Active Residues
Balanced Dataset
2062 Active Phosphorylation Site Residues
2062 Non-Active Residues
15
16. Materials and Methods(3)
Graph Representation
Definition
A graph G=<V, E>
V vertices (nodes) and E edges (arcs)
A path in G is a sequence of vertices
<v0, v1, v2, ..., vn>
Directed Graph
Undirected Graph
Adjacency Matrix
16
Node
(Label)
Edge(Weight)
17. Materials and Methods(4)
Graph Representation Contd.
Node
Edge
Weight
Labels
(PSSM <Biological conservation of amino acid>)
(Position-specific scoring matrix)
blast-2.2.25+
NR Database
Distance Contacting
Residue (Node-
Labeled(PSSM))
Edge
(Arch) –weight (1)
Calculation
Distance (d1)
2+ (y1-y2)2+ (z1-z2)2
VDW- radius of each atoms
(van der Waals-VDW.radii file)
d1 <= (R1+R2+0.5)
Protein Sequences
17
R1 R2
d1<x,y,z> PDB
Residue1.Atom1 Residue2.Atom1
18. Materials and Methods(6)
Shortest-path graph Kernel
What is a kernel
Simply Kernel is a matrix
AxA =<v1…..Vn,v1…..Vn> =Matrix elements
What is a graph kernel
Use graph instead of vectors
What is shortest-path graph kernel
Compare the each pair of node by using
shortest- path between each node
V1
V1
V2
V2
Vn
Vn
g1 g2 gn
g2
g1
gn
18
19. Materials and Methods(7)
Shortest-Path Graph Kernel Contd.
Original G1 and G2 graphs converted into shortest-path graphs S1 (V1, E1) and S2
(V2, E2)
The Floyd-Warshall algorithm
The kernel function is used to calculate similarity between G1 and G2 by
comparing all pairs of edges between S1 and S2.
Calculation
11 22
),(),( 2121
Ee Ee
edge eekGGK
Where, kedge ( ) is a kernel function for comparing two edges
19
e1 e2
v1 w1 w2v2
20. Materials and Methods(8)
)
2
||)()(||
exp(),( 2
2
wlabelsvlabels
wvknode
Where, labels (v) returns the vector of attributes associated with node v. Note that Knode() is a Gaussian
kernel function. 2
2
1
was set to 72 by trying different values between 32 and 128 with increments of 2.
|))()(|,0max(),( 2121 eweighteweightceekweight
Where, weight (e) returns the weight of edge e. Kweight( ) is a Brownian bridge kernel that assigns the
highest value to the edges that are identical in length. Constant c was set to 2 as in Borgward et
al.(2005).
Shortest-Path Graph Kernel Contd.
Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then,
),(*),(*),(),( 21212121 wwkeekvvkeek nodeweightnodeedge
Where, knode( ) is a kernel function for comparing the labels of two nodes, and kweight( ) is a
kernel function for comparing the weights of two edges. These two functions are defined as
in Borgward et al.(2005):
20
v1
<Pssm1>
e1=1
w2
w1 v2 e2=1
<Pssm2> <Pssm3>
<Pssm4>
21. Materials and Methods(9)
Prediction Methods
Nearest Neighbor Algorithm
Classify a new example x by finding the training
example <Xi-Yj> that is nearest to x according to
Euclidean distance:
NNM_Max
NNM_AVE
NNM_TOP10AVE
Positive
(Functional/Active)
Negative
(Non-Functional/Non-Active) ?
Test Set
Train Set(Experimentally Verified )
21
Similarity
22. Materials and Methods(10)
K-fold Cross-Validation
Leave-One-Out Cross-Validation
Evolution of Predictors
22
26. Results and Discussion(2)
Percentile Ranking
Used full dataset
Ordered list
Position ranking
Majority of functional sites
are less 10% percentile
NNM_MAX
NNM_AVE
NNM_TOP10AVE
26
30. Conclusions
We developed an innovative graph method to represent protein
surface based on how amino acid residues contact with each other.
We implemented a shortest-path graph kernel method and used it
to compute the similarity between graphs.
We developed three nearest neighbor variants to predict both
dataset based on the similarity matrix that the graph kernel method
produced.
The predictors were able to predict catalytic sites with accuracy up
to 77.1%.
This work showed that the proposed methods were able to capture
the similarity between enzyme catalytic sites and would provide a
useful tool for catalytic site prediction.
30
32. Future Work
Add more parameters into labels(graphs, nodes)
Improve the program as web service
Working with other kernel methods such
as, Minimum Spring Tree and etc.
Optimize algorithm for large datasets
32
33. Acknowledgements
I would like to express my deep gratitude to my adviser Dr.
Changhui Yan for his continuous
encouragements, guidance, and supports to complete this
paper successfully.
My sincere thanks also go to my committee members, Dr. Juan
(Jen) Li, Dr. Jun Kong, and Dr. Nan Yu for their willingness to
serve as committee members.
33
37. Important of Functional Site
Prediction
Understanding Protein Functionalities
Reveal the Structural Protein
Drug Design
Design New Protein
37
38. Rationale for Understanding Protein Structure and
Function
Protein sequence
-large numbers of
sequences, including
whole genomes
Protein function
- rational drug design and treatment of disease
- protein and genetic engineering
- build networks to model cellular pathways
- study organismal function and evolution
?
structure determination
structure prediction
homology
rational mutagenesis
biochemical analysis
model studies
Protein structure
- three dimensional
- complicated
38
42. Graph
A graph G=<V, E>
V vertices (nodes) and E edges (arcs)
A path in G is a sequence of vertices <v0, v1, v2, ..., vn>
Directed Graph
Undirected Graph 42
43. Adjacency Matrix
A simple graph is a matrix with rows and columns
labeled by graph vertices
1 = Adjacent
0 = Not Adjacent
0s on the diagonal
43
44. Shortest Distance Path Algorithm
Used in communications, transportation, electronics, and
bioinformatics problems.
The all-pairs shortest-path problem involves finding the
shortest path between all pairs of vertices in a graph.
A i j=1 if there is an edge (Vi,Vj) ; otherwise, A i j =0
44
45. Percentile Ranking
There is no proper definition for percentile
calculation
Ordered List
Position Ranking
Max, Ave, Top10
45
46. Method And Material
Data Gathering
Identify the Active Residues
Balance Dataset
Generating a Map File
Generate Set of Graphs
Development of Graph Kernel
46
47. Data Gathering
Catalytic Binding Site (CSA)
http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl
EC1, EC2…EC6
HTML
Regular Expression
Finding Large Single Group
Selected EC 3.4
73 Protein chains
201 Active Catalytic Site
20398 Non-Active Resides
47
48. Data Gathering..
Phosphorylation Site
Section 3.3.4 of This Paper
[http://www.informatics.indiana.edu/predrag/publications.htm].
679 protein chains
2062 Active Phosphorylation Site Residues
139795 Non-Active Resides
48
49. Identify the Active Residues
Catalytic Binding Site (CSA)
CSA Annotation –Database(CSA_2_2_12.dat)
[ http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Download.pl]
251777 Records
List of Active Residue(201)
Phosphorylation Site
[http://www.informatics.indiana.edu/predrag/publications.htm]
List of Active Residue(2062)
49
50. Balance Dataset
Computation Time
Leave-One-Out Cross-Validation
Random Selection
Catalytic Binding Site (CSA)
-Active 201 , Non Active 201
Phosphorylation Site
-Active 2062, Non Active 2062
50
51. Generating a Map File
Map with Protein PDB ID with Protein Sequences
Atomic Solvent Accessible Area Calculations (RASA)
Position-Specific Scoring Matrix Calculations (PSSM)
Active Residues
51
52. Map with Protein PDB ID with Protein
Sequences
PDB ID and Change ID
101m_A
PDB Database
[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].
FASTA Format
>101m_Amol:protein length:154 MYOGLOBIN
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKA
SEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GNFGADAQGAMNKALELFRKDIAAKYKELGYQG
52
53. Atomic Solvent Accessible Area
Calculations (RASA)
Calculate the Solvent Accessible Area (RASA) of each
Protein
Naccess V2.11 Program
– Linux/Unix systems /Cygwin
– [http://www.bioinf.manchester.ac.uk/naccess/]
– ./naccess 1a91.pdb & ./naccess 1afo.pdb & ./naccess 1aig.pdb
PDB DATA Bank –PDB File
– [http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]
ncbi-blast-2.2.24+
RASA >0
53
54. Position-Specific Scoring Matrix
Calculations (PSSM)
Download PDB Files
blast-2.2.25+ Program
– Microsoft Windows
NR Database (non-redundant protein sequence)
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.FileName = "C:blast-2.2.25+binpsiblast.exe";
p.StartInfo.Arguments = string.Format("{0}", "-query " + FileNameIN + " -db C:blast-
• 2.2.25+dbnr -num_iterations 2 -out_ascii_pssm " + FileNameOUT);
p.Start();
• Example: Sample record of .PSSM
1 A 5 -2 -2 -2 -1 -1 -2 1 -2 -2 -3 -1 -2 -3 -2 2 -1 -3 -3 -1 77 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 0 0.59 1.#J
54
55. Sample Mapping File
>1neg_A
Seq :
KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQF
SUR :
11101011111111111111111111111111111111011111111011110111111111111
Site :
00000000000000000000000000000000000000000000000000000000000010000
rASA
:115.47,81.22,64.82,.00,20.59,.00,41.60,111.13,56.32,14.17,124.18,35.41,127.39,43.03,111.84,1
60.37,10.00,.71,33.57,1.82,120.20,91.83,15.89,41.40,69.81,.77,20.31,2.22,49.44,65.40,30.56,97
.39,80.11,152.72,75.17,80.10,47.20,64.49,.00,57.09,16.33,101.38,111.31,104.16,71.57,2.73,60.8
4,.00,18.67,8.04,64.07,71.08,.00,125.10,66.68,24.97,32.49,79.86,65.19,179.94,87.62,51.01,109.
35,145.21,71.53,
entropy
:0.80,0.85,0.25,0.92,0.44,1.48,1.02,2.42,1.57,2.01,0.44,0.93,0.49,0.73,0.73,0.83,1.72,1.46,0.59,
2.15,0.72,0.98,1.99,1.65,0.60,1.20,0.35,0.94,0.66,0.65,0.51,0.23,1.04,0.45,1.09,4.74,3.91,0.67,1
.38,0.61,0.45,0.75,1.43,0.49,0.36,2.32,0.72,1.63,3.17,0.46,1.53,2.78,1.61,0.38,0.45,0.26,0.15,0.
51,0.17,0.38,0.47,0.46,0.93,2.04,1.73,
pdbindex
:6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,
38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68
,69,70, 55
56. Generate Set of Graphs
Shorted Distance Path (Dijkstra Theory)
Adjacent Matrix Theory
Contacting Neighbor’s Residues
Labeled
Weighted
Various Numbers of Node and Edge
Normalization Graph
– Linear Normalization(X1) =(X-Min)/ (Max-Min)
56
57. Calculate Distance between Atoms
and Check the Contacting
2+ (y1-y2)2+ (z1-z2)2
PDB File
VDW
(van der Waals-VDW.radii file)
D1 <= (R1+R2+0.5)
Example of a contact residue
2 A _ 3 A! : 1.33441
Example of a non-contact residue.
4 A _ 2 A : 4.14432 57
59. Development of Graph Kernel
Original G1 and G2 graph converted into
shortest-path graphs S1 (V1, E1) and S2 (V2, E2)
The Floyd-Warshall algorithm
The kernel function is used to calculate the
similarity between G1 and G2 by comparing
all pairs of edges between S1 and S2.
59
60. The Floyd-Warshall Algorithm
for i = 1 to N
for j = 1 to N
if there is an edge from i to j
dist[0][i][j] = the length of the edge from i to j
else dist[0][i][j] = INFINITY
for k = 1 to N
for i = 1 to N
for j = 1 to N
dist[k][i][j] = min(dist[k-1][i][j], dist[k-1][i][k] + dist[k-1][k][j])
To find the shortest path between all vertices v V for a weighted graph G = (V; E).
D(k)
ij=the weight of the shortest path from vertex I to vertex j for which all intermediate
vertices are in the set {1,2,……k}
60
72. van der Waals-VDW.radii file
Back
RESIDUE ATOM ALA 5
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
RESIDUE ATOM ARG 11
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.87 0
ATOM NE 1.65 1
ATOM CZ 1.76 0
ATOM NH1 1.65 1
ATOM NH2 1.65 1
RESIDUE ATOM ASP 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM OD1 1.40 1
ATOM OD2 1.40 1
RESIDUE ATOM ASN 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM OD1 1.40 1
ATOM ND2 1.65 1
RESIDUE ATOM CYS 6
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM SG 1.85 0
RESIDUE ATOM GLU 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.76 0
ATOM OE1 1.40 1
ATOM OE2 1.40 1
RESIDUE ATOM GLN 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.76 0
ATOM OE1 1.40 1
ATOM NE2 1.65 1
RESIDUE ATOM GLY 4
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
RESIDUE ATOM HIS 10
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM ND1 1.65 1
ATOM CD2 1.76 0
ATOM CE1 1.76 0
ATOM NE2 1.65 1
RESIDUE ATOM ILE 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG1 1.87 0
ATOM CG2 1.87 0
ATOM CD1 1.87 0
RESIDUE ATOM LEU 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD1 1.87 0
ATOM CD2 1.87 0
RESIDUE ATOM LYS 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.87 0
ATOM CE 1.87 0
ATOM NZ 1.50 1
72
Hi All,Good Morning. My research title is protein functional site prediction using the shortest path graph kernel
This is my presentation outline ….(read list)
Problem statement of our research approach is mainly related to determine functional sites in a protein structure,What are the functional sites ? a residue or group in a protein that activity participate for biochemical relation with another element such as Magnesium, Zinc , phosphate groupthe picture shows a example for a functional site [ (Phosphorylation Site).] papal colorI will give more details aware the functional site in rest of my presentation.
Function sites prediction has several importance such as;1. The Functional sites prediction helps to Understand protein functionalities.2. protein functional information can be used by Drug design companies todesign new Structure based drugs andalso3. Protein engineers used functional sites information for Design new proteins based on strong identified functionalities.
Next section of my presentation is introduction
Proteins are very important molecules in biological cells. They are involved in virtually all cell functions. Each protein within the body has a specific role.A Protein is consist of one or more amino acids of 20 amino acids which are shown in the following table that means a protein is a sequence of amino acids, each amino acid in a protein sequence commonly calls a residue. In the other words, each residues in a protein represent one amino acid.Proteinitself have a 3 dimensional structurewhich is used to identify the functional important groups . In the other words , active sites
As I mentioned in the problem statement section , functional site is a part of protein which is involve in various biochemical reactions.Here have shown few functional sites example such as Read list , but In our approach we considered only first two functional sites that mean phosphoryation site and catalytic active site.This picture shows a example for a reaction which is happening in a phosphoryation sites and it describes the way of the involving a addition or removal of a phosphate from a protein structure.Similarly , other functional sites also involve to various biochemical reactions.
Now we have a brief idea about the importance aware the protein functional sites prediction, so we need to know that how we can determine these functionally importance group in a protein. One of the popular way is conducting laboratory experimental such as x-ray or NMR but there are some challenges related with laboratory experimental methods.This list show what are the those , laboratory method might be time consuming or needs valuable equipment ,further Or some protein don’t support for laboratory processes, example NMR, all protein need to be liquid status but reality is that all protein cannot be convert into liquid status . So next side I will explain what are the available alternative.
Large number of structural genomics projects are working on finding protein structures and already have large number of structure in protein data banks. The problem is that lack of knowledge of functionality of those protein structures. In briefly, we already have large number of protein structure without knowing their functionalities. Further day to day increase the gab between knowing protein structures with lack of knowing their functionalities. It is need a some alternative to minimize this gab so computer professional try to provide methods with high accuracy as solution to this problem.Further we can identify some advantages of computational methods when comparing with laboratory methods such as …the computational method is..(Read list)………Also have some disadvantage such as accuracy . most of time laboratory methods have more accuracy than computation methods but the information discover by computational method related with functional important group in a protein can be used as a good guild line for laboratory technician for their research .
Now a days, there are few computational methods are used such as template based method, micro environmental based method and macro environment based method. Briefly template based needs to find a similar template from a protein database which is experimentally verified, then used an alignment method to determine functionality of a target protein structureMicro- environmentmethod basically used the nearest neighbor method for determine unknown functionality of a target protein with comparing structural and physicochemical properties of their neighbors. I will explain more about the nearest neighbor method in next few sides.The macro – environment-based method used same process used in micro- environmental method but only different is number neighbor residues in macro based method comparatively high than micro-environment based method.In Our approach we used macro environment –based method for prediction functional sites from a protein structure.
we proposed a graphs kernel based computational approach,we generate set of graph on each residues which are either positive or negative and those are experimentally verified their functionality. The number of nodes in a graph defend on the number of residues contacting with a central node of a graph. And number of graphs are equal to number residue in protein sequence This is only the overview of our approach but I will explain in detail about thisprocess in the materials and methods section.
As mentioned in the previous sides, we generated set of graphs based on each residues , the type of residues is either positive or negative . The functionality of each residues are verified by experimentally methods so the set of graphs in train set can be consider as knowledge based which is consist of functional site graph and non-functional site graphs .further we used two set of knowledge based one for catalytic site prediction and other for phosphoslation side prediction. This knowledge based used to calculate similarity between each residues in the train set and a target residue further we used nearest neighbor method as predictor of the proposed methodI will explain more about the similarity calculation and the prediction process in the metrical and methods section
The next section is marterials and methods
Material and methods,We used following databases to retrieve information protein structures and sequence 1.We used this link to download PDB Files of each protein , the pdb file provides information related with protein structure, it provide geometric coordinate of each atoms of residues in a protein structure. This information we used for checking contacting or not any given residues in a protein sequences each to others2. This link is used to download all protein sequences, it provide protein sequences in fasta format so it easy to map with pdb id of each datasets and retrieve relevant protein sequences
In our research approach used two datasets, one is catalytic binding site and other is phosphorylation site. This link is used to download catalytic binding site protein’s pdb id and map with pdb database for get a relevant protein sequence as mentioned in previous site.and we selected a dataset which contain 73 protein sequences in order contain at least one phosphorylation site in a protein sequence based on the information provide by the CSA.DAT databaseThe CSA.DAT database provides literature information related with catalytic site active residues which are experimentally discovered Then we mapped each residues in protein sequences through residue’s index number with CSA.dat database for identify catalytic active residues and non-active residues . finally we found 201 active residues and 23 hundred and nightly eight non active residues but this dataset is unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset contain 201 active residues and 201 non active residues in other words…functionalWe used this link to download phosporyation site and itself provide information related with Phosporyation active residues ,the dataset contain 679 protein sequences, we used same process used on catalytic site database, used PDBdatabase to map with phosporyation PDB ID. Then we used active residue list to find active phosporyation site in each sequences. finally we found 2062 active residues and And ------- hundred and nightly eight non active residues but this dataset also unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset which contain 2062 active residues and 2062 non active residues.
A Graph can be defined by using their vertices and edges , in our research approach we used undirected , labeled and weighted graph. Simple graph can be represent by using adjacency matrix alsowe used adjacency matrix to represent Contacting each residues in a graph.
In our proposed method, we generate set of graphs based on contacting each residues each to others, nodes in a graph is represent a residue of protein structure which might be positive or negative on the other words, functional active site or non –functional site these node are labeled by using pssm values , the pssm values are indicated biological conservation of each amino acids.The edge is defined based on contacting residues each to othersFinding Contacting between each residues is little bit complicate because of each residues consist of number of atoms so we need to consider all atoms in a residues with each atoms in a another residue, if at least one atom in a residue contacting with a atom in another residue then these two residue can be consider as contacting. Based on this contacting we create a edge between two nodes. These edge is weighted by length between two nodes. In our approach always we assume length equal to 1.The information need to calculate distance between two residues provide by pdb files and VDW file.
In simple kernel is a matrix, each element of the matrix is result of the vector product. Graph kernel is also a matrix which used graph instead of vector, each element of the matrix is similarity between two graphs, graph similarity calculate based on comparison each pair of nodes of both graphs.Shortest path graph also graph kernel which calculated similarity between two graph based shorted path between each pair of nodes in each graphs.
We used the floyed-warshall algorithm convert original graph to shortest-path graphs kernel, the shortest-path graphs kernel is used to calculate similarity between two graph. Graph similarity is calculated by comparing each pair of nodes in both graphs based on labeled values of each nodes and weighted of edges between particular pair of nodes.This function is used to calculate similarity between two graphs, e1 and e2 mean two edges between pair of each nodes
The nearest neighbor method is used to classify target dataset by using training set based on their some properties example is distance with their neighborsIn our approach we used three nearest neighbor variants to classify test set based on graph similarity between training set and test set.The training dataset contain set of functional site and set of non-functional graphs which are verified by laboratory experiment. The test set always represent only graph which is either functional or non-functional which also verified by laboratory experiment.
There are two type of cross validationIn K-fold cross validation , whole dataset is divided into number of part equal to k then one of them used as test dataset and rest of them used as training set.But when k equal to number of instance in a dataset , it becomes a leave one out cross validation. In our approach we used leave one out cross validation for better evolution of predictors. in other words , in our approach , every time use a graph as test dataset while rest of all graphs used as training set.How ever we eliminate graphs of same type in a same protein when used the predictors.
Next Section is result and discussion.
Result and discussion We used two balanced dataset , one for catalytic enzyme active site and other one for phosporylational site, both datasets is consist of functional sites and non-functional sites which are laboratory verified .Catalytic enzyme active site dataset is consist of 201 functional sites and similar number of non-active sites.And we used nearest neighbor method for classification and used three variant of nearest neighbor method based on similarity , max , average and top10ave.Based on the given classification , we calculated percentage of accuracy of our predication method, the method is shown the best performance which value is 77.1% with catalytic site dataset. While we calculate same value of phosphorylation site , It shows 63.8% best performance.As a summary of result , our method is shown best performance with catalytic enzyme dataset than the posphoryation dataset.
The process of calculating percentile data, first need to sort based on similarity values on ascending order then divide the position location of each element by total numbers element of the list. In our approach we use full dataset and calculated the percentile based of all nearest neighbor variants in other words , max ,average and top 10 average values.The given result shown in next side.
The result are clearly shown that most of active sites belong to group under 10% of percentile
Opsitely no-active s
BLAST :Basic Local Alignment Search Tool , The basic BLAST algorithm can be implemented in DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences.
1. For template-based modeling (TBM) and fold recognition methods, a prediction model can be built based on the coordinates of the appropriate template(s) [1]. These approaches generally involve four steps: 1) a representative protein structure database is searched to identify a template that is structurally similar to the protein target; 2) an alignment between the target and the template is generated that should align equivalent residues together as in the case of a structural alignment; 3) a prediction structure of the target is built based on the alignment and the selected template structure, and 4) model quality evaluation. The first two steps significantly affect the quality of the final model prediction in TBM methods.2. The main signature of residue microenvironment‐based methods is the focus on a single residue or position in the structure and its surrounding environment. Usually, a set of structural, physicochemical and evolutionary properties are collected and encoded into a fixed‐length vector. Sets of functional (positive) and non‐functional (negative) residues are then incorporated into supervised machine learning approaches3. Most methods discussed in this paper focus on the prediction of enzyme active sites, co-factor binding sites, orpost-translational modification sites, where a relatively compact local structural region is involved. However, a largegroup of algorithms and tools have been developed to identify particular classes of larger structural neighborhoods, e.g.surface patches, pockets, cavities or clefts, which provide interfaces to ligands or macromolecular partners. Thesemethods are highly valuable because protein-protein interactions lie at the center of almost every cellular process andprotein-DNA binding is essential for genetic activities. Similarly, accurate identification of ligand-binding sites is valuablein the context of structure-based drug design. Residue macroenvironment-based methods have been reviewed recently,thus we provide only a brief summary and refer authors to relevant publications where appropriate.4. Based on the types ofstructuralpatternsthey search for, graph‐theoretic approaches can be used in anyof the three main methodological groups (template, residue microenvironment, residuemacroenvironment). However, these approaches represent a special category based on the distinctproblem formulations and algorithmic approaches. Instead of using atomic coordinates directly, graph‐theoretic methodsstart with transforming protein structuresinto graphs and then exploit various motiffinders and graph similarity measures, combined with machine learning, to discover functional sites.Representative graph similarity measures involve subgraph enumeration, subgraph isomorphism, oridentification of frequentsubgraphs, although other measures, e.g. random walk‐based scoring, can beapplied as well
NR Database:non-redundant protein sequence database,