[Talk]

Machine Learning Algorithms
for Protein Structure Prediction

Jianlin Cheng

Institute for Genomics and Bioinformatics
School of Information and Computer Sciences
University of California Irvine
2006

Outline
I. Introduction
II. 1D Prediction
III. 2D Prediction (Beta-Sheet Topology)
IV. 3D Prediction (Fold Recognition)
V. Publications and Bioinformatics Tools

Importance of Protein Structure
Prediction
AGCWY……

Cell

Sequence Structure Function

Four Levels of Protein Structure
Primary Structure (a directional sequence of amino acids/residues)

N C
…

Residue1 Residue2
Peptide bond
Secondary Structure (helix, strand, coil)

Alpha Helix Beta Strand / Sheet Coil

Four Levels of Protein Structure

Tertiary Structure Quaternary Structure (complex)

G Protein Complex

1D: Secondary Structure Prediction

MWLKKFGINLLIGQSV…
Helix

Neural Networks
Coil
+ Alignments

CCCCHHHHHCCCSSSSS…
Strand
Accuracy: 78%

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

1D: Solvent Accessibility Prediction
Exposed
MWLKKFGINLLIGQSV…

Neural Networks
+ Alignments

eeeeeeebbbbbbbbeeeebbb…
Buried
Accuracy: 79%


1D: Disordered Region Prediction Using Neural
Networks
MWLKKFGINLLIGQSV…
Disordered Region

1D-RNN

OOOOODDDDOOOOO…

93% TP at 5% FP
Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005

1D: Protein Domain Prediction Using Neural
Networks
MWLKKFGINLLIGQSV…
Boundary
+ SS and SA

1D-RNN

NNNNNNNBBBBBNNNN…
HIV capsid protein Inference/Cut
Domain 1 Domain 2 Domains
Top ab-initio domain predictor in CAFASP4

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.

1D: Predict Single-Site Mutation From Sequence
Using Support Vector Machine
Correlation = 0.76

Support
…MWLAVFILINLK… Vector
Machine

• First method to predict energy changes from sequence
accurately
• Useful for protein engineering, protein design, and
mutagenesis analysis
Cheng, Randall, and Baldi. Proteins, 2006

2D: Contact Map Prediction
3D Structure 2D Contact Map
1 2 ………..………..…j...…………………..…n
1
2
3
.
.
.
.
i
.
.
.
.
.
.
.
n
Distance Threshold = 8Ao


2D: Disulfide Bond Prediction
Cysteine i
Support yes
2D-RNN
Vector
Machine

Disulfide Bond

Graph
Cysteine j Matching

[1] Baldi, Cheng, Vullo. NIPS, 2004.
[2] Cheng, Saigo, Baldi. Proteins, 2005

2D: Prediction of Beta-Sheet Topology
N terminus
Beta Sheet • Ab-Initio Structure
Prediction
• Fold Recognition
Beta
Strand
• Protein Design
• Protein Folding

Cheng and Baldi, Bioinformatics, 2005

C terminus
Beta Residue
Pair

An Example of Beta-Sheet Topology
Level 1
4 5

2 1 3 6 7

Structure of Beta Sheets
Protein 1VJG

Level 1 Level 2
4 5 Antiparallel

2 1 3 6 7
Parallel

Structure of Beta Sheets Strand
Protein 1VJG Strand Pair
Strand Alignment
Pairing Direction

Level 1 Level 2 Level 3
4 5 Antiparallel

H-bond
2 1 3 6 7
Parallel

Structure of Beta Sheets Strand Beta Residue
Protein 1VJG Strand Pair Residue Pair
Strand Alignment
Pairing Direction

Three-Stage Prediction of Beta-
Sheets
• Stage 1
Predict beta-residue pairing probabilities
using 2D-Recursive Neural Networks (2D-
RNN, Baldi and Pollastri, 2003)

• Stage 2
Use beta-residue pairing probabilities to
align beta-strands
• Stage 3
Predict beta-strand pairs and beta-sheet
topology using graph algorithms

Stage 1: Prediction of Beta-Residue Pairings
Using 2D-Recusive Neural Networks
Input Matrix I (m×m) Output / Target Matrix (m×m)

Iij
(i,j)
2D-RNN
O = f(I)

i j Oij: Pairing Prob.
Tij: 0/1
…AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….

20 for Residues 3 SS 2 SA

An Example (Target)
1 2 3 45 6 7

Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)

An Example (Target)
1 2 3 45 6 7
Antiparallel

Parallel

Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)

Stage 2: Beta-Strand Alignment
Antiparallel
• Use output probability
matrix as scoring matrix 1 m
• Dynamic programming n 1
• Disallow gaps and use
Parallel
the simplified search
algorithm 1 m
1 n

Total number of alignments = 2(m+n-1)

Strand Alignment and Pairing Matrix
• The alignment score is the
sum of the pairing
probabilities of the aligned
residues
• The best alignment is the
alignment with the
maximum score
• Strand Pairing Matrix

Strand Pairing Matrix of 1VJG

Stage 3: Prediction of Beta-Strand
Pairings and Beta-Sheet Topology

(a) Seven strands of protein 1VJG in sequence order

(b) Beta-sheet topology of protein 1VJG

Minimum Spanning Tree Like
Algorithm
Strand Pairing Graph (SPG)

(a) Complete SPG
Strand Pairing Matrix

Minimum Spanning Tree Like
Algorithm
Strand Pairing Graph (SPG)

(a) Complete SPG (b) True Weighted SPG
Strand Pairing Matrix

Goal: Find a set of connected subgraphs that maximize the
sum of the alignment scores and satisfy the constraints
Algorithm: Minimum Spanning Tree Like Algorithm

An Example of MST Like Algorithm
1 2 3 4 5 6 7

1 0 Step 1: Pair strand 4 and 5
2 1.3 0

3 .94 .37 0

4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0

6 .10 .05 .74 .04 .04 0

7 .02 .02 .03 .02 .02 .20 0


1 2 3 4 5 6 7

2 1.3 0

3 .94 .37 0

4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0

6 .10 .05 .74 .04 .04 0

7 .02 .02 .03 .02 .02 .20 0

2 1

N

1 2 3 4 5 6 7

2 1.3 0

3 .94 .37 0

4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0

6 .10 .05 .74 .04 .04 0

7 .02 .02 .03 .02 .02 .20 0

2 1 3

N

1 2 3 4 5 6 7

2 1.3 0

3 .94 .37 0

4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0

6 .10 .05 .74 .04 .04 0

7 .02 .02 .03 .02 .02 .20 0

6
2 1 3

N

1 2 3 4 5 6 7

2 1.3 0

3 .94 .37 0

4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0

6 .10 .05 .74 .04 .04 0
C
7 .02 .02 .03 .02 .02 .20 0

Strand Pairing Matrix of 1VJG 7
6
2 1 3

N

1.Beta Residue Pairing
Method Specificity/ Ratio of
Sensitivity Improvement
BetaPairing 41% 17.8
CMAPpro 27% 11.7
(Pollastri and Baldi, 2002)

2. Beta Strand Alignment
Method Alignment Pairing
Accuracy Direction
BetaPairing 66% 84%
Statistical Potential (Hubbard, 1994) 40% X
Pseudo-energy (Zhu and Braun, 1999) 35% X

Information Theory (Steward and Thornton, 2002) 37% X

3. Beta Strand Pairing
Method Specificity Sensitivity % of non-local pairs

MST Like 53% 59% 20%

3D Structure Prediction
MWLKKFGINLLIGQSV…
•Ab-Initio Structure Prediction
Simulation
Physical force field – protein folding ……
Contact map - reconstruction
Select structure with
minimum free energy
•Template-Based Structure Prediction

Query protein
Fold
MWLKKFGINKH…
Recognition Alignment

Template
Protein Data Bank

A Machine Learning Information Retrieval
Framework for Fold Recognition

Fold Recognition
Cheng and Baldi, Bioinformatics, 2006

Query Protein Alignment
MWLKKFGIN……

Template
Protein Data Bank

Machine Learning Ranking

Classic Fold Recognition Approaches

Sequence - Sequence Alignment
(Needleman and Wunsch, 1970. Smith and Waterman, 1981)

Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL

Alignment (similarity) score

Works for >40% sequence identity
(Close homologs in protein family)

Profile - Sequence Alignment
(Altschul et al., 1997)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL
Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Average
Score

Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

More sensitive for distant homologs in superfamily.
(> 25% identity)

Profile - Sequence Alignment
(Altschul et al., 1997)

12………………………………….………………n 1 2 … n
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.4
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.1
Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL …
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5

Position Specific Scoring Matrix
Or Hidden Markov Model
Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

More sensitive for distant homologs in superfamily.
(> 25% identity)

Profile - Profile Alignment
(Rychlewski et al., 2000)

1 2 … n
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.1
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.4
Family ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL …
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5

1 2 … m

Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN A 0.3
IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN C 0.5
Family IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM …
W 0.2

More sensitive for very distant homologs.
(> 15% identity)

Sequence - Structure Alignment (Threading)
(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)

Query Fit
Fitness
MWLKKFGINLLIGQS…. Score

Template Structure
Useful for recognizing similar folds without sequence similarity.
(no evolutionary relationship)

Integration of Complementary Approaches

FR Server1

Query
Meta Server FR server2
Consensus
(Lundstrom et al.,2001. Fischer, 2003)

FR server3

Internet

1. Reliability depends on availability of external servers
2. Make decisions on a handful candidates

Machine Learning Classification Approach

Support Vector Machine (SVM) Class 1

Proteins Class 2

Class m

Classify individual proteins to several or dozens of structure classes
(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)

Problem 1: can’t scale up to thousands of protein classes
Problem 2: doesn’t provide templates for structure modeling

Machine Learning Information
Retrieval Framework
Query-Template Pair
Relevance Function (e.g., SVM) Score 1

+
Score 2 Rank
.
.
- .

Score n

• Extract pairwise features
• Comparison of two pairs (four proteins)
• Relevant or not (one score) vs. many classes
• Ranking of templates (retrieval)

Pairwise Feature Extraction
• Sequence / Family Information Features
Cosine, correlation, and Gaussian kernel
• Sequence – Sequence Alignment Features
Palign, ClustalW
• Sequence – Profile Alignment Features
PSI-BLAST, IMPALA, HMMer, RPS-BLAST
• Profile – Profile Alignment Features
ClustalW, HHSearch, Lobster, Compass, PRC-HMM
• Structural Features
Secondary structure, solvent accessibility, contact map, beta-
sheet topology

Relevance Function: Support Vector
Machine Learning
Feature Space
Positive Pairs
(Same Folds)

Support
Negative Pairs
Vector
(Different Folds)
Machine
Training/Learning

Hyperplane
Training Data Set

Relevance Function: Support Vector
Machine Learning
(1) (2)

Margin
Margin

f(x) =
K is Gaussian Kernel:

Training and Cross-Validation
• Standard benchmark (Lindahl’s dataset, 976 proteins)
• 976 x 975 query-template pairs (about 7,468 positives)
Query
Query 1’s pairs
1 975 pairs
2 Query 2’s pairs Train / Learn
3 975 pairs .
. .
. .
. (90%: 1- 878)
Rank 975
. Test templates
. (10%: 879 – 976)
975 pairs for each
976
query

Results for Top Five Ranked Templates
Method Family Superfamily Fold
PSI-BLAST 72.3 27.9 4.7
HMMER 73.5 31.3 14.6
SAM-T98 75.4 38.9 18.7
BLASTLINK 78.9 4.06 16.5
SSEARCH 75.5 32.5 15.6
SSHMM 71.7 31.6 24
THREADER 58.9 24.7 37.7
FUGUE 85.8 53.2 26.8
RAPTOR 77.8 50 45.1
SPARKS3 86.8 67.7 47.4
FOLDpro 89.9 70.0 48.3
•Family: close homologs, more identity
•Superfamily: distant homologs, less identity
•Fold: no evolutionary relation, no identity

Specificity-Sensitivity Plot (Family)

Specificity-Sensitivity Plot (Superfamily)

Specificity-Sensitivity Plot (Fold)

Advantages of MLIR Framework
• Integration
• Accuracy
• Extensibility
• Simplicity
• Reliability
• Completeness
• Potentials
Disadvantages
Slower than some alignment methods

A CASP7 Example: T0290
Query sequence (173 residues):
RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFM
VQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVV
FGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP
FOLDpro

Compare with the experimental
structure:
RMSD = 1Ao

Predicted Structure

Publications and Bioinformatics Tools
1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond
Connectivity. NIPS 2004.
[DIpro 1.0]
2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide
Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks,
and Weighted Graph Matching. Proteins, 2006.
[DIpro 2.0]
3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by
Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.
[BETApro]
4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein
Structure and Structural Feature Prediction Server. Nucleic Acids Research,
2005.
[SSpro 4/ACCpro 4/CMAPpro 2]
5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein
Disordered Regions by Mining Protein Structure Data. Data Mining and
Knowledge Discovery, 2005.
[DISpro]

Publications and Bioinformatics Tools
6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a
Generative, Scalable, Software Infrastructure for Pathway Bioinformatics
and Systems Biology. IEEE Intelligent Systems, 2005.
[Sigmoid]
7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes
for Single Site Mutations Using Support Vector Machines. Proteins, 2006.
[MUpro]
8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J.
Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H.
Lathrop. Functional Census of Mutation Sequence Spaces: The Example of
p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology
and Bioinformatics, 2006.

9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction
Using Profiles, Secondary Structure, Relative Solvent Accessibility, and
Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.
[DOMpro]
10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach
to Protein Fold Recognition. Bioinformatics, 2006.
[FOLDpro]

Acknowledgements
• Pierre Baldi
• G. Wesley Hatfield, Eric Mjolsness, Hal
Stern, Dennis Decoste, Suzanne Sandmeyer,
Richard Lathrop, Gianluca Pollastri, Chin-
Rang Yang
• Mike Sweredoski, Arlo Randall, Liza Larsen,
Sam Danziger, Trent Su, Hiroto Saigo,
Alessandro Vullo, Lucas Scharenbroich

[Talk]

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to [Talk]

Similar to [Talk] (8)

More from butest

More from butest (20)

[Talk]