Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Protein structure 2
1.
2. Protein Structural Bioinformatics
Definition
The subdiscipline of bioinformatics that focuses on the
representation, storage, retrieval, analysis, and display of
structural information at the atomic and subcellular spatial
scales.
(From Structural Bioinformatics, by P.E. Bourne & H. Weissig (eds.), John Wiley &
Sons, Inc., 2003, pp.4.)
Why is STRUCTURAL bioinformatics important?
Because a protein’s function is determined by its structure.
Knowledge of a protein’s structure is necessary in order to gain
a full understanding of the biological role of a protein.
3. Bioinformatics methods can be used to analyze
protein structural data in the following ways:
• Visualization of protein structures
• Alignment of protein structures
• Classification of proteins into families, based on similarity
of their structures
• Prediction of protein structures
• Simulation of protein folding and dynamic motions
4. Protein structure determination by x-ray crystallography or
NMR is difficult (see Powerpoint slides from last module).
It takes 1-3 years to solve a protein structure by these methods. Certain
proteins, such as membrane proteins, are extremely difficult or impossible to
solve by these methods. Due to genomic sequencing efforts, the gap
between known protein sequences and known protein structures is
increasing– only about 3,000 unique protein structures have been
determined, but over 1 million unique sequences have been determined.
Therefore, it is necessary to use bioinformatics methods to predict the
structures of proteins for which a crystal structure or NMR structure has not
been determined.
Bioinformatics methods can predict:
(1) secondary structural elements in a protein sequence
(2) the tertiary structure of the entire sequence
(3) “special” structures, such as transmembrane a-helices,
transmembrane b-barrels, coiled coils, and leucine zippers
5. Protein Secondary Structure Prediction
All secondary structure prediction is based on the assumption that there
should be a correlation between amino acid sequence and secondary
structure– in other words, it is assumed that certain stretches of amino acids
are more likely to form one type of secondary structure than another.
During secondary structure prediction, the conformational state of each
residue in a protein sequence is predicted; generally each residue is
predicted as having one of three possible states:
(1) a-helical structure
(2) b-strand
(3) “other” (b-turn, loop, or random coil)
Sometimes b-turn is separated as a 4th state.
Why is prediction of secondary structure useful?
It can help guide sequence alignment or improve existing sequence
alignment of distantly related sequences. It is also an intermediate step in
some methods for tertiary structure prediction.
6. Methods of secondary structure prediction fall into
two broad classes:
Ab initio methods– predict secondary structure based solely
on protein sequence; these methods compute statistics for the
residues that occur in different secondary structural elements in
proteins with known structures, in order to identify “patterns” in
the types of residues that occur in a given type of secondary
structure.
Homology-based methods– make use of multiple sequence
alignments of homologous proteins to predict secondary
structure; these methods are able to locate conserved patterns
that are characteristic of particular secondary structural
elements across the aligned family members.
7. Certain amino acids are observed more frequently than others in a-
helices, b-strands, and b-turns in crystal structures (see Figure). This
leads to the idea that each amino acid tends to “prefer” being
constrained in a certain type of secondary structure, or has an
“intrinsic propensity” to adopt that secondary structure.
Fig. 4-10 from Lehninger Principles of Biochemistry, 4th ed.
The figure shows that:
Glu, Met, Ala are most
frequent in a-helices
Val, Tyr, Ile are most
frequent in b-strands
Pro, Gly, Asn are most
frequent in b-turns
Based on this data, it
is believed that Glu
has a high a-helical
propensity, but a low
b-strand propensity.
8. Ab initio methods of secondary structure prediction:
• These methods calculate the relative propensity (intrinsic tendency) of each
amino acid in a protein sequence to belong to a certain secondary structural
element.
• Propensity scores for the 20 amino acids are derived from known protein
structures: these propensities are calculated from the relative frequency of a
given amino acid within the proteins, its frequency in a given type of
secondary structure, and the fraction of all amino acids occurring in that type
of secondary structure.
• Stretches of a protein’s sequence that contain many residues with a high a-
helical propensity are predicted to fold into a-helices. Stretches of sequence
that contain many residues with a high b-strand propensity are predicted to
fold into b-strands.
• Two examples: Chou-Fasman method, GOR method
9. Accuracy of ab initio methods:
• These methods are not very accurate:
• Chou-Fasman method, 50%-60% accuracy
• GOR method, 64% accuracy, drastically underpredicts b-strands
• These methods are only a little better than randomly assigning secondary
structure! Known proteins consist of ~31% a-helix and ~28% b-sheet, so
randomly assigning secondary structural elements to residues would result in
~30% accuracy.
• Specific problems with these methods:
• Tend to underpredict the lengths of a-helices and b-strands– can’t
identify the first and last residues of helices and strands very well
• Tend to miss b-strands completely
10. A few homology-based 2o structure prediction methods:
Neural network methods:
PROFsec (an improved version of PHDsec)
http://www.predictprotein.org/
PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/
SSpro (newest version is 4.0)
http://scratch.proteomics.ics.uci.edu/
SAM-T (SAM-T08 is newest version; SAM-T06, SAM-T02, SAM-T99-- old versions)
http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
Nearest-neighbor methods:
NNSSP
no longer available online
PREDATOR
http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::predator
HMM methods:
HMMSTER
http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php
11. A few methods for predicting transmembrane a-helices:
TMHMM
http://www.cbs.dtu.dk/services/TMHMM/
HMMTOP
http://www.enzim.hu/hmmtop/index.html
Phobius (also predicts presence of signal peptides)
http://phobius.sbc.su.se/
TopPred
http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::toppred
PRED-TMR
http://athina.biol.uoa.gr/PRED-TMR/
DAS
http://mendel.imp.ac.at/sat/DAS/DAS.html
TMpred
http://www.ch.embnet.org/software/TMPRED_form.html
MEMSAT
http://bioinf.cs.ucl.ac.uk/psipred/
Accuracies of the methods:
Levels of accuracy are reported by the developers to be in the range of 75-95%.
At least one study (2001) found TMHMM to be the best performing program.
It is best to use several methods and compare the results to arrive at a consensus
prediction. When different methods, specifically methods that are based on different
algorithms, give similar results, the reliability of the results is higher.
12. Tertiary structure prediction methods fall into three
classes:
(1) Homology modeling (also called comparative modeling)
A structure is built based on the known structure of another protein that is
similar in sequence (a homolog).
(2) Threading (also called structural fold recognition)
A structure is predicted for a protein by “threading” its sequence through a
variety of known structures to determine which structure the sequence best
fits.
(3) Ab initio prediction (also called de novo prediction)
A structure is predicted based only on the amino acid sequence of the
protein, using the physicochemical properties of its residues and the
principles governing protein folding.
13. Homology modeling for tertiary structure prediction:
Homology modeling is based on the idea that if two proteins share a high
degree of sequence similarity (i.e., they are close homologs), they are likely
to have very similar 3D structures. In general, proteins that share >30%
sequence identity are likely to be quite similar in structure.
Therefore, if a protein of unknown structure is similar in sequence to a
protein of known structure, the known structure can be used as a template to
which the unknown sequence is fit. The structure that is built for the
unknown sequence is then called a homology model for the structure of that
sequence.
The “safe homology
modeling zone,” above the
gray curve, is the region
where two proteins are likely
to have the same structure.
Fig. 5 from R. Nair & B. Rost,
Protein Science (2002) 11: 2836-47.
14. Steps in homology modeling for tertiary structure
prediction:
The protein of unknown structure for which a structural model is to be built
will be called the “target sequence.”
1. Template selection– Identify protein(s) in the PDB that are
homologous to the target sequence using BLAST or PSI-BLAST. If a close
homolog with known structure is found, its structure will serve as a template
to which the target sequence will be matched. The template should have
at least 30% sequence identity with the target. (Proteins that share less
than 30% sequence identity may not be similar enough in structure to carry
out homology modeling.) If PSI-BLAST does not identify a suitable template,
it will probably be necessary to construct a structural model by threading.
It is possible to use multiple templates if more than one good template is
identified. When multiple templates are available, it is best to use more than
one template to avoid biasing the model toward a single protein. The
template used in the next step of homology modeling will then be an
averaged structure based on all of the chosen templates.
15. Steps in homology modeling for tertiary structure
prediction:
2. Sequence alignment– Construct a multiple sequence alignment of
the target, the template, and other homologous sequences. It is actually the
alignment of the target and template that is of interest, but the inclusion of
other homologs provides more information, helping to ensure that the best
alignment of homologous residues is achieved. The quality of the target-
template alignment is critical for constructing an accurate structural
model for the target. If a given residue in the target is not aligned with the
proper residue in the template, the error cannot be corrected in later steps of
model building. A robust multiple sequence alignment program should be
used for this step, and the resulting alignment should be very carefully
examined and manually refined if necessary.
16. Steps in homology modeling for tertiary structure
prediction:
3. Backbone model building– Residues in the aligned regions of the
target and template are assumed to adopt the same structure. Therefore,
the backbone atoms of these residues in the target can be placed in the
same 3D location as the backbone atoms of these residues in the template.
See the alignment below as an example.
Target: ...FKSQAAIHEAYCNFHYKVTAAASRTPEIDFDVHFSSIF...
Template: ...FKQQANIHCAYCNGAYKIG-------GKELQVHFSWLF...
For these residues, backbone atoms of the target are assumed
to occupy the same 3D location as those of the template.
F aligned with F. They are identical,
so all atoms of target F will overlap
the 3D positions of all atoms of
template F.
E aligned with D. They are not identical, but
their backbone atoms can be assumed to
occupy the same 3D position. So backbone
atoms of target D will overlap the 3D
positions of backbone atoms of template E.
17. Steps in homology modeling for tertiary structure prediction:
4. Loop building– There are likely to be regions in the alignment where
gaps appear because the target sequence does not match the template. The
target sequence residues in these gap regions are assumed to form a loop that
is not present in the template structure. The structure of this loop can be built
using several different methods. In any case, it is a difficult problem since the
template provides no information to guide the building of the loop structure.
Target: ...FKSQAAIHEAYCNFHYKVTAAASRTPEIDFDVHFSSIF...
Template: ...FKQQANIHCAYCNGAYKIG-------GKELQVHFSWLF...
“Extra” residues in the target sequence do not
match the template and are assumed to form a loop.
target loop
18. Steps in homology modeling for tertiary structure
prediction:
5. Side chain addition– The side chains are added to the backbone
structure. Each side chain could potentially have many possible
conformations due to bond rotation, but steric clashes with neighboring
atoms are not allowed. Therefore, side chain that have the lowest interaction
energy with nearby atoms are chosen.
Target: ...FKSQAAIHEAYCNFHYKVTAAASRTPEIDFDVHFSSIF...
Template: ...FKQQANIHCAYCNGAYKIG-------GKELQVHFSWLF...
Target and template are both F, so
all atoms of the target side chain
can be modeled as having the same
3D positions as the template side
chain, at least initially. (Small
changes in position may be
necessary in later refinement steps.)
Target and template have different
side chains (D vs. E), so the side
chain rotamer that is chosen for the
target D must not overlap/clash with
any neighboring atoms.
19. Steps in homology modeling for tertiary structure
prediction:
6. Model refinement– Unfavorable bond angles, bond lengths, and
atom contacts are likely to exist in the preliminary model, so an energy
minimization procedure is applied to refine the model. In this procedure,
atom positions are shifted so that the overall conformation of the entire
structure has the lowest energy potential. Only limited energy minimization
should be applied (a few hundred iterations) so that major errors are
removed but residues are not moved from their correct positions.
7. Model evaluation– The model is checked for anomalies in dihedral
angles, bond lengths, and atom contacts.
20. Programs for homology modeling:
Many programs for automated homology modeling are now available, so
anyone can construct a homology model on a regular PC. However,
construction of a “good” homology model (at least for sequences that are not
highly similar) usually requires some expertise and usually should be done
with human intervention, rather than in a fully automated fashion.
A few of the freely available programs for homology
modeling:
SWISS-MODEL– Produces accurate models; fast; good tutorials available.
http://swissmodel.expasy.org/
I-TASSER– Produces accurate models; easy to use, but slow
http://zhanglab.ccmb.med.umich.edu/I-TASSER/
Modeller– must be downloaded and installed locally
http://salilab.org/modeller/modeller.html
WHAT IF
http://swift.cmbi.ru.nl/servers/html/index.html
http://swift.cmbi.ru.nl/whatif/
21. Is a homology model CORRECT?
Since the actual (experimentally determined) structure of the target is not
known, there is no way to say whether or not the homology model is
“correct.” Instead, the best a researcher can do is compare the homology
model to the structure of the template from which it was derived. If the atom
positions in the model do not deviate very much from those of the template,
the homology model is said to be “accurate.” The greater the deviation
between model and template, the lower the accuracy of the model.
When is a homology model definitely INCORRECT?
A homology model has regions that are incorrect if it contains structural
features that do not occur in native proteins, such as:
• Hydrophobic side chains on the surface of the model (these side
chains should be buried)
• Unreasonable bond lengths or angles
• Unfavorable noncovalent contacts between atoms (clashes)
• Unreasonable dihedral angles
22. Accuracy of homology modeling:
The template selection and alignment accuracy are crucial to the accuracy of a homology
model. The accuracy of the model depends on the percentage of sequence identity
between the target and template. The average coordinate agreement between the
modeled structure and the actual structure drops ~0.3 Å for each 10% reduction in
sequence identity.
The largest structural differences between homologous proteins are in surface loops. In
other words, the structure of the protein core is more highly conserved. Therefore, the
regions that are most likely to be in error in a homology model are the surface loops.
High-accuracy homology models can be built when the target and template have 50%
or greater sequence identity. Errors are mostly mistakes in side-chain packing, small
shifts of the core backbone regions, and occasionally larger errors in loops.
Medium-accuracy homology models can be built when the proteins share 30-50%
sequence identity. There can be alignment mistakes, and there are more frequent side-
chain packing, core distortion, and loop modeling errors.
Low-accuracy homology models are based on proteins that share <30% sequence
identity. If a model is based on an almost insignificant alignment to a known structure, the
model may have an entirely incorrect fold.
The best model-building programs will produce models of similar accuracy, provided that
the methods are used optimally.