This document summarizes work on protein structure prediction using threading and context-specific alignment potentials. It introduces the problem of predicting protein structure for distant homologs using threading approaches. The work presents a solution that models protein alignment as a conditional probability using a context-specific conditional neural field (CNF) model incorporating both local and global alignment information. Evaluation on 1000 test cases showed improved accuracy over HHpred, an established threading approach, demonstrating the effectiveness of the proposed context-specific alignment potential.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Protein threading using context specific alignment potential ismb-2013
1. Protein Threading Using Context-
Specific Alignment Potential
Sheng Wang
http://raptorx.uchicago.edu
Toyota Technological Institute at Chicago,
Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu
ISMB 2013
Jul 22, ICC Berlin, Germany
2. Outline
• Where we are @ template-based modeling
• What’s our work
• What’s the problem
• What’s our solution
• Welcome to our server
3. Template-based Modeling (or, Threading)
• Observation
– ~50,000 non-redundant structures in PDB
– ~ 1,200 unique structure folds (SCOP)
• Methodology
– Use known structures to predict a new one
Template sequence
Query sequence DDVYILDQAEEG
DE-FIVD-PDEH
DDVYILDQAEEG
SPCKR---ADEG
DDVYILDQAEEG
E--IFVDQADDS
DDVYILDQAEEG
NMCVFGQWERTY
database
4. Template-based Modeling Procedures
Easy: similar sequences → similar structures
Sequence-based method, e.g., BLAST, FASTA
Works only for close homologous (>70% sequence identity)
Medium: similar profiles → similar structures
Protein profile is a matrix that represents a multiple sequence
alignment of the similar proteins
Profile-based method, e.g., PSI-BLAST , HHMER, HHpred,
Works for relative remote homologous (>40% sequence identity)
Challenge: dissimilar profiles → similar structures
Adding structural information, or context-specific into sequence/profile
based methods
Threading method, e.g., MUSTER, RAPTOR, CS-BLAST
Works for distant remote homologous (<40% sequence identity)
5. Our Work
• CNFpred: Transform a template-sequence
alignment problem into a Machine Learning
problem to calculate the alignment’s probability.
• DeepAlign: Prepare for high quality training
data of structural alignment.
• CNF model: Combined Machine Learning model
that incorporate Conditional Random Field (CRF)
and Neural Network (NN).
6. Protein Alignment Model
S A L R Q
L
P
L
S
E
M
M
M
M
L P L S - E
S A - L R Q
Template
Sequence
Match states (M)
M M Is M It M
Insertion at sequence (Is)
Insertion at template (It)
The structural alignment generated by DeepAlign is used for training data
7. DeepAlign for Structure Alignment
• evolutionary information
• local sub-structure similarity
• angular similarity for hydrogen bonding
BLOSUM is the local amino acid substitution matrix;
CLESUM is the local sub-structure substitution matrix;
v(i,j) measures the angular similarity for hydrogen bonding;
d(i,j) measures the spatial proximity of two aligned residues.
local similarity global similarity
Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)
8. CNF-based Alignment Model
E: a neural network estimating the log-likelihood of state transition
Z(S,T): normalization factor
1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment
Define a conditional probability
between Sequence S and Template T
Where,
),(/)),,,(exp(),|( 1 TSZTSaaETSAp
i
ii
Context-Specific
9. Comprehensive Features
MTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
How similar two
residues : EAA
How similar query’s
sequence and profile and
template’s profile: Esp,
Epp
How similar template’s
secondary structure and
sequence’s predicted second
structure (3-class and 8-class):
Ess3, Ess8
Sequence S
How similar is the query’s solvent
accessibility and template’s
solvent accessibility: Esa
Total scoring function is a non-linear combination of:
E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa )
Template T
MTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF--
For disordered regions, Ediso,
no structure information used.
10. What’s the problem?
• Only the alignment probability is described,
instead of the log-odds potential compared to
background.
• Only incorporate local information, insufficient
of global information.
11. Our solution
Propose a protein alignment potential
• With an elaborately designed reference state.
• Can be generalized into sequence-sequence,
sequence-structure as well as structure-structure
alignment.
Incorporate both local and global terms
• For local term, CNFpred potential is applied.
• For global term, EPAD potential is employed.
12. Protein alignment potential
Similarly, given one alignment A between sequence S and template T,
we define the potential of A as follows.
N
N
i
ref
yxAP
TSAP
AP
TSAP
TSAu
1
),|(
),|(
log
)(
),|(
log),|(
Given 2 AAs a and b, their mutation potential is defined as follows.
)()(
)(
log
)(
)(
log)(
bPaP
baP
baP
baP
bau
ref
x and y are two random proteins with
the as S and T, respectively.
Assumption: the alignment maximizing the potential is the optimal.
13. ),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP
The alignment probability given sequence S and template T could be modeled
as follows,
local term global term
partition function
A
TSAPtsZ ),|(),(
Protein alignment potential
15. Model the local potential
i
ii TSaaETSAF ),,,(),|( 1
From CNFpred, we use a context-specific linear chain model as,
The expectation term can be calculated by uniformly sampling a few
thousand protein pairs, so the local potential is
The local potential is defined as,
),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal
i
iiiilocal aaETSaaETSAU )),(),,,((),|( 11
16. Maximize on probability Maximize on potential
Long but less informative and
highly false positive.
Good for building models.
Template Template
Sequence
Sequence
Short but relevant and highly
significant.
Good for ranking templates.
What’s the difference between
17. Model the global potential
ji
ji
T
ij ssdPTSAG ),|(log),|(
From EPAD, we use a context-specific distance-dependent model as,
The expectation term can be calculated by uniformly sampling a few
thousand residue pairs from templates, so the global potential is
The global potential is defined as,
),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal
ji
T
ijji
T
ijglobal dPssdPTSAU ))(log),|((log),|(
18. What’s global information given an
alignment?
i j
i j
ji
ji
T
ij ssdPTSAG ),|(log),|(
Template T
Sequence S
T
ijd
T
ijd
i j
If the alignment is good, the distance of a sequence residue pair
shall match well with that of their aligned template residue pair.
si
sj
20. Welcome to our server
http://raptorx.uchicago.edu/
Binding
Contact
21. Thank you
Jinbo Xu
Feng Zhao
Jianzhu Ma
National Institutes of Health (R01GM0897532)
National Science Foundation (DBI-0960390)
NSF CAREER award CCF-1149811
Alfred P. Sloan Research Fellowship
Notas do Editor
Currently, template-based modeling is the main-stream approach in protein structure prediction. This is based on the observation that although we have around 50,000 non-redundant structures in PDB, the unique structure fold in SCOP is only about 12 hundred. And what most important thing is, in recent years after 2010, the new unique fold less appeared, which implies that number of naturally occurring protein fold is limited, and this becomes a fundamental assumption that, we could use known structures to predict an unknown query sequence.More formally, the definition of template-based modelingis, given a query protein one-dimension amino acid sequence, and a template database with known three-dimension structure, we align each template and query to find the best match and build the query model upon the template.
Here we move into the first part, how to define the label for protein alignment data. In details, we transfer an alignment path into a series of continuous labels with M,Is and It, these three states. So there are nine adjacent state transitions in total.After defined the label, we could apply DeepAlign to generate the training data by structurally similar proteins.