1. Machine Learning Algorithms
for Protein Structure Prediction
Jianlin Cheng
Institute for Genomics and Bioinformatics
School of Information and Computer Sciences
University of California Irvine
2006
2. Outline
I. Introduction
II. 1D Prediction
III. 2D Prediction (Beta-Sheet Topology)
IV. 3D Prediction (Fold Recognition)
V. Publications and Bioinformatics Tools
4. Four Levels of Protein Structure
Primary Structure (a directional sequence of amino acids/residues)
N C
…
Residue1 Residue2
Peptide bond
Secondary Structure (helix, strand, coil)
Alpha Helix Beta Strand / Sheet Coil
5. Four Levels of Protein Structure
Tertiary Structure Quaternary Structure (complex)
G Protein Complex
8. 1D: Disordered Region Prediction Using Neural
Networks
MWLKKFGINLLIGQSV…
Disordered Region
1D-RNN
OOOOODDDDOOOOO…
93% TP at 5% FP
Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
9. 1D: Protein Domain Prediction Using Neural
Networks
MWLKKFGINLLIGQSV…
Boundary
+ SS and SA
1D-RNN
NNNNNNNBBBBBNNNN…
HIV capsid protein Inference/Cut
Domain 1 Domain 2 Domains
Top ab-initio domain predictor in CAFASP4
Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
10. 1D: Predict Single-Site Mutation From Sequence
Using Support Vector Machine
Correlation = 0.76
Support
…MWLAVFILINLK… Vector
Machine
• First method to predict energy changes from sequence
accurately
• Useful for protein engineering, protein design, and
mutagenesis analysis
Cheng, Randall, and Baldi. Proteins, 2006
22. Stage 2: Beta-Strand Alignment
Antiparallel
• Use output probability
matrix as scoring matrix 1 m
• Dynamic programming n 1
• Disallow gaps and use
Parallel
the simplified search
algorithm 1 m
1 n
Total number of alignments = 2(m+n-1)
23. Strand Alignment and Pairing Matrix
• The alignment score is the
sum of the pairing
probabilities of the aligned
residues
• The best alignment is the
alignment with the
maximum score
• Strand Pairing Matrix
Strand Pairing Matrix of 1VJG
24. Stage 3: Prediction of Beta-Strand
Pairings and Beta-Sheet Topology
(a) Seven strands of protein 1VJG in sequence order
(b) Beta-sheet topology of protein 1VJG
25. Minimum Spanning Tree Like
Algorithm
Strand Pairing Graph (SPG)
(a) Complete SPG
Strand Pairing Matrix
26. Minimum Spanning Tree Like
Algorithm
Strand Pairing Graph (SPG)
(a) Complete SPG (b) True Weighted SPG
Strand Pairing Matrix
Goal: Find a set of connected subgraphs that maximize the
sum of the alignment scores and satisfy the constraints
Algorithm: Minimum Spanning Tree Like Algorithm
28. An Example of MST Like Algorithm
1 2 3 4 5 6 7
1 0 Step 2: Pair strand 1 and 2
2 1.3 0
3 .94 .37 0
4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0
6 .10 .05 .74 .04 .04 0
7 .02 .02 .03 .02 .02 .20 0
Strand Pairing Matrix of 1VJG
2 1
N
29. An Example of MST Like Algorithm
1 2 3 4 5 6 7
1 0 Step 3: Pair strand 1 and 3
2 1.3 0
3 .94 .37 0
4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0
6 .10 .05 .74 .04 .04 0
7 .02 .02 .03 .02 .02 .20 0
Strand Pairing Matrix of 1VJG
2 1 3
N
30. An Example of MST Like Algorithm
1 2 3 4 5 6 7
1 0 Step 4: Pair strand 3 and 6
2 1.3 0
3 .94 .37 0
4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0
6 .10 .05 .74 .04 .04 0
7 .02 .02 .03 .02 .02 .20 0
Strand Pairing Matrix of 1VJG
6
2 1 3
N
31. An Example of MST Like Algorithm
1 2 3 4 5 6 7
1 0 Step 5: Pair strand 6 and 7
2 1.3 0
3 .94 .37 0
4 .02 .02 .04 0 4 5
5 .02 .02 .03 1.9 0
6 .10 .05 .74 .04 .04 0
C
7 .02 .02 .03 .02 .02 .20 0
Strand Pairing Matrix of 1VJG 7
6
2 1 3
N
32. 1.Beta Residue Pairing
Method Specificity/ Ratio of
Sensitivity Improvement
BetaPairing 41% 17.8
CMAPpro 27% 11.7
(Pollastri and Baldi, 2002)
2. Beta Strand Alignment
Method Alignment Pairing
Accuracy Direction
BetaPairing 66% 84%
Statistical Potential (Hubbard, 1994) 40% X
Pseudo-energy (Zhu and Braun, 1999) 35% X
Information Theory (Steward and Thornton, 2002) 37% X
3. Beta Strand Pairing
Method Specificity Sensitivity % of non-local pairs
MST Like 53% 59% 20%
33. 3D Structure Prediction
MWLKKFGINLLIGQSV…
•Ab-Initio Structure Prediction
Simulation
Physical force field – protein folding ……
Contact map - reconstruction
Select structure with
minimum free energy
•Template-Based Structure Prediction
Query protein
Fold
MWLKKFGINKH…
Recognition Alignment
Template
Protein Data Bank
34. A Machine Learning Information Retrieval
Framework for Fold Recognition
Fold Recognition
Cheng and Baldi, Bioinformatics, 2006
Query Protein Alignment
MWLKKFGIN……
Template
Protein Data Bank
Machine Learning Ranking
35. Classic Fold Recognition Approaches
Sequence - Sequence Alignment
(Needleman and Wunsch, 1970. Smith and Waterman, 1981)
Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
Works for >40% sequence identity
(Close homologs in protein family)
36. Classic Fold Recognition Approaches
Profile - Sequence Alignment
(Altschul et al., 1997)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL
Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Average
Score
Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in superfamily.
(> 25% identity)
37. Classic Fold Recognition Approaches
Profile - Sequence Alignment
(Altschul et al., 1997)
12………………………………….………………n 1 2 … n
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.4
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.1
Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL …
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5
Position Specific Scoring Matrix
Or Hidden Markov Model
Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in superfamily.
(> 25% identity)
38. Classic Fold Recognition Approaches
Profile - Profile Alignment
(Rychlewski et al., 2000)
1 2 … n
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.1
Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.4
Family ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL …
ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5
1 2 … m
Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN A 0.3
IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN C 0.5
Family IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM …
W 0.2
More sensitive for very distant homologs.
(> 15% identity)
39. Classic Fold Recognition Approaches
Sequence - Structure Alignment (Threading)
(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)
Query Fit
Fitness
MWLKKFGINLLIGQS…. Score
Template Structure
Useful for recognizing similar folds without sequence similarity.
(no evolutionary relationship)
40. Integration of Complementary Approaches
FR Server1
Query
Meta Server FR server2
Consensus
(Lundstrom et al.,2001. Fischer, 2003)
FR server3
Internet
1. Reliability depends on availability of external servers
2. Make decisions on a handful candidates
41. Machine Learning Classification Approach
Support Vector Machine (SVM) Class 1
Proteins Class 2
Class m
Classify individual proteins to several or dozens of structure classes
(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)
Problem 1: can’t scale up to thousands of protein classes
Problem 2: doesn’t provide templates for structure modeling
42. Machine Learning Information
Retrieval Framework
Query-Template Pair
Relevance Function (e.g., SVM) Score 1
+
Score 2 Rank
.
.
- .
Score n
• Extract pairwise features
• Comparison of two pairs (four proteins)
• Relevant or not (one score) vs. many classes
• Ranking of templates (retrieval)
43. Pairwise Feature Extraction
• Sequence / Family Information Features
Cosine, correlation, and Gaussian kernel
• Sequence – Sequence Alignment Features
Palign, ClustalW
• Sequence – Profile Alignment Features
PSI-BLAST, IMPALA, HMMer, RPS-BLAST
• Profile – Profile Alignment Features
ClustalW, HHSearch, Lobster, Compass, PRC-HMM
• Structural Features
Secondary structure, solvent accessibility, contact map, beta-
sheet topology
45. Relevance Function: Support Vector
Machine Learning
Feature Space
Positive Pairs
(Same Folds)
Support
Negative Pairs
Vector
(Different Folds)
Machine
Training/Learning
Hyperplane
Training Data Set
52. Advantages of MLIR Framework
• Integration
• Accuracy
• Extensibility
• Simplicity
• Reliability
• Completeness
• Potentials
Disadvantages
Slower than some alignment methods
53. A CASP7 Example: T0290
Query sequence (173 residues):
RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFM
VQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVV
FGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP
FOLDpro
Compare with the experimental
structure:
RMSD = 1Ao
Predicted Structure
54. Publications and Bioinformatics Tools
1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond
Connectivity. NIPS 2004.
[DIpro 1.0]
2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide
Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks,
and Weighted Graph Matching. Proteins, 2006.
[DIpro 2.0]
3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by
Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.
[BETApro]
4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein
Structure and Structural Feature Prediction Server. Nucleic Acids Research,
2005.
[SSpro 4/ACCpro 4/CMAPpro 2]
5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein
Disordered Regions by Mining Protein Structure Data. Data Mining and
Knowledge Discovery, 2005.
[DISpro]
55. Publications and Bioinformatics Tools
6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a
Generative, Scalable, Software Infrastructure for Pathway Bioinformatics
and Systems Biology. IEEE Intelligent Systems, 2005.
[Sigmoid]
7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes
for Single Site Mutations Using Support Vector Machines. Proteins, 2006.
[MUpro]
8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J.
Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H.
Lathrop. Functional Census of Mutation Sequence Spaces: The Example of
p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology
and Bioinformatics, 2006.
9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction
Using Profiles, Secondary Structure, Relative Solvent Accessibility, and
Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.
[DOMpro]
10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach
to Protein Fold Recognition. Bioinformatics, 2006.
[FOLDpro]
56. Acknowledgements
• Pierre Baldi
• G. Wesley Hatfield, Eric Mjolsness, Hal
Stern, Dennis Decoste, Suzanne Sandmeyer,
Richard Lathrop, Gianluca Pollastri, Chin-
Rang Yang
• Mike Sweredoski, Arlo Randall, Liza Larsen,
Sam Danziger, Trent Su, Hiroto Saigo,
Alessandro Vullo, Lucas Scharenbroich