Poster from 3DSIG 2013 on CE-Symm. For a more recent version, see http://www.slideshare.net/sbliven/3dsig-2014-systematic-detection-of-internal-symmetry-in-proteins
Aligning Subunits of Internally Symmetric Proteins with CE-Symm
1. Aligning Subunits of Internally Symmetric Proteins
(Left) Fibroblast growth factor 1 [3JUT], colored to show internal symmetry. (Right) Dot plot
showing equivalent residues within the protein. Red lines correspond to a 120° clockwise
rotation of the protein around the 3-fold axis, and cyan to the 240° rotation. After
duplicating the matrix, each alignment forms a sequential diagonal line which can be fully
detected by CE. Gray shading indicates regions near the diagonal which are penalized by the
scoring function.
References
Screenshot of the CE-Symm interface, showing a
two-fold axis of EPSP synthase [1G6S].
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Background
with CE-Symm
Proteins can have quaternary symmetry and/or internal
symmetry
Symmetry is widespread in proteins and can be observed at a number of levels, from
crystal symmetry within complexes to pseudo-symmetry in individual chains and
domains. Symmetry is known to play a role in protein evolution,1
allosteric regulation, 2 DNA binding,3 and cooperative enzyme effects.4 Symmetry has
also been utilized to understand protein folding5 and to aid the computational design
of large proteins.6
Quaternary symmetry consists of multiple identical polypeptide chains arranged in
a symmetric fashion. Such symmetry is extremely common in proteins, occurring in
approximately 80% of structures in the Protein Data Bank (PDB). Detecting
quaternary symmetry relies on accurate assignment of the correct biological assembly
for each protein. The PDB now annotates protein structures with their quaternary symmetry (Peter Rose et al., in
preparation).
Proteins can also have internal or ternary symmetry, when a single chain contains two or more equivalent
subunits. The subunits generally will differ in the exact sequence, but have substantially similar structures. Internal
symmetry i s sometimes styled as
pseudosymmetry to reflect that the
equivalence between subunits is generally at
the level of residues or secondary structure
elements rather than atoms or electron
density, as is common with quaternary
symmetry.
Internal symmetr y can arise from
quaternary by gene duplication or fusion.
Thus, in addition to the many functional
implications of symmetry, identifying
protein symmetry can provide information
about the evolutionary history of a protein.
Such fission and fusion events often
preserve the overall structure and function
of the active complex.
Existing methods for finding internal symmetry
Several computational methods are available to detect symmetry. Some methods search for periodic sequences or
structure (e.g. DAVROS7). These are generally limited in their ability to handle large insertions. Methods based on
structural alignment algorithms (SymD,8 GANGSTA+9) can tolerate large insertions, but produce pairwise
alignments between adjacent symmetric subunits rather than a global alignment of all subunits. This leads to
ambiguous alignments, where a single residue could be aligned to several residues in each other subunit, depending
on the order in which rotation operations are performed.
Conclusion
CE-Symm was run over a large hand-curated benchmark, and is able to
detect symmetric proteins with a high degree of accuracy, even in the
presence of large insertions. The resulting alignment includes exactly
one residue from each subunit, as expected for a multiple alignment. It
runs quickly and is able to detect symmetry broadly across a variety of
folds.
The refinement stage can also be used as an independent tool in
conjunction with seed alignments from other tools. This allows the
circularly permuted alignments from tools such as SymD8 to be refined
into multiple alignments between individual subunits.
Because symmetry is hypothesized to derive from gene duplications and
fusions,12 aligning subunits within symmetric proteins can reveal ancient
homologies and conserved sequences. CE-Symm is useful both for
identifying symmetric proteins and for aligning the subunits for further
study.
Availability:
CE-Symm source code is available under the LGPL license from https://github.com/rcsb/symmetry
An online server is available at http://source.rcsb.org/jfatcatserver/symmetry.jsp
Spencer Bliven
Bioinformatics and Systems Biology Program
University of California San Diego
Douglas Myers-Turnbull
Dept. of Computer Science & Engineering
University of California San Diego
Philip Bourne
Skaggs School of Pharmacy and Pharmaceutical Sciences
University of California San Diego
Andreas Prlić
San Diego Supercomputer Center
University of California San Diego
(Left) Beta-carbonic anhydrase from Porphyridium purpureum [1I6O] is a quatramer with D2
quaternary symmetry. (Right) The beta-carbonic anhydrase in E. coli [1DDZ] consists of
only two chains, which each have internal C2 symmetry in addition to the C2 quaternary
symmetry. The two halves of the chain have 68% sequence identity, strongly indicating
that a duplication and fusion event has occured in the evolution of E. coli.
D5 quaternary symmetry of GTP
cyclohydrolase I [1A8R]. The
main 5-fold axis is shown in red;
the five 2-fold axes are in blue.
Methods
The CE-Symm program is able to detect internal symmetry in proteins. It first identifies structurally similar
regions within the protein structure. It then refines this alignment to improve the correspondence between
subunits.
1. Identify structurally similar regions
The CE-Symm algorithm starts by
identifying a non-trivial structural
alignment between a protein and itself
using Combinatorial Extension10 (CE).
This uses the dynamic programming and
progressive refinement of CE, but with
two modifications.
1.A strong penalty term is added to self-aligned
residues to prevent the trivial 0°
rotation from dominating.
2.The alignment matrix is duplicated in the
manner of Uliel et al.11 to account for the
circular permutation which is introduced
when comparing a symmetric protein
against a rotated copy of itself.
2. Refinement to ensure transitivity
The structural alignment from the first step is then refined to produce a residue-level equivalence map between
subunits. Refinement produces a consistent multiple alignment between all identified subunits.
The order, k, of rotational symmetry present in the protein (if any) is determined by successively applying the seed
alignment until the original orientation is found.
Let f be a function over all residues in the protein, such that f(i)=j when i is aligned to j. The goal is to modify f such
that k applications of f (i.e. rotations of the protein) give a trivial alignment. Formally, ∀i f k(i)=i. To constrain the
modifications, we introduce a penalty function σ(i) which goes to zero when the previous condition is met. Two
such penalty functions were considered:
1. σ(i) = |f k(i)-i|. This measures the number of insertions or deletions which would need to be added to be
made in order to bring residue i into alignment
2. σ(i) = |d( f k-1(i), f k(i)) - d(i,f k-1(i))|, where d(i,j) gives the distance between alpha carbons of residues i and j.
This minimizes the changes in RMSD required during refinement.
The algorithm works by choosing the residue with minimal score and modifying the alignment such that f k(i)=i. To
ensure that the alignment remains sequential and well-formed, the selection of residue to modify is limited by the
following “eligibility criteria.”
1. f k-1(i) is defined (f k(i) may be undefined)
2. σ(i)>0
3. σ(f k-1(i)) > 0
4. ∀j s.t. σ(j)=0: sign(f k-1(i)-j ) = sign( i-f(j) )
Eligible residues are chosen in order of increasing score, and the alignment modified to set f k-1(i) ⟵i. This
process is repeated until no eligible residues remain, at which point remaining residues are removed from the
alignment.
This algorithm terminates in a multiple alignment between the symmetric subunits with exactly one residue per
subunit in each aligned column. The process can also be interleaved with structure-based refinement to iteratively
improve the alignment RMSD while preserving the multiple alignment property.
Results
Symmetry detection
SCOP class Number of
Superfamilies
Percentage of SCOP superfamiles with internal symmetry, as detected by CE-Symm
Refinement
Trypanosoma sialidase [SCOP domain d2agsa2], a six-bladed
beta propeller. The alignment shown corresponds to a 120°
rotation, permuting the structure by two blades.
Superposition of the structure with itself (a) prior to
refinement, and (b) after one iteration of refinement. A
number of extraneous loops not shared by all blades are
marked as unaligned by the refinement procedure.
(c) Multiple alignment of the three two-blade subunits
considered here.
(c)
SSRVE---LFKRKNSTVPFEESNGTIRERVVH---SFRIPT-IVNVD----GVMVAIADARYETSFDNSFIETAVKYSVDDGA
GKPVS---LKP--LFPAEFDGI------LTKE---FIGGVGAAIVASN---GNLVYPVQIADMG----GRVFTKIMYSEDDGN
WVEALGTLSHV--WTN------------SPTSNQQDCQSS--FVAVTIEGKRVMLFTHPLNLKGRW--MRDRLHLWMTD--NQ
TWNTQIAIKNSRASSVSRVMDATVIVKGNKLYILVGSFNKTRNSWTQHRDGSDWEPLLVVGE-----VTKSAANGKTTATISW
TWKFAEGRSKF------GCSEPAVLEWEGKLIINNRVD--------------GNRRLVYESS-----DMGKT-----------
RIFDVGQISIGDE----NSGYSSVLYKDDKLYSLHEINTND-----------VYSLVFVRLIGELQLM---------------
Poster first presented at the 21st Annual International Conference on Intelligent Systems for Molecular Biology (2013).
The RCSB PDB is supported by the National Science Foundation [NSF DBI 0829586]; National Institute of General Medical Sciences; Office
of Science, Department of Energy; National Library of Medicine; National Cancer Institute; National Institute of Neurological Disorders
and Stroke; and the National Institute of Diabetes & Digestive & Kidney Diseases. The RCSB PDB is a member of the wwPDB.
(a) (b)
% symmetric
α 503 17.4%
β 354 17.5%
α/β 244 17.6%
α+β 549 12.5%
multi-domain 66 3.0%
membrane 108 22.0%
All classes 1,832 16.0% ROC curves showing the performance of CE-Symm for
detecting symmetry, on a benchmark of 1000 randomly
selected and manually annotated SCOP superfamilies. Two
scoring functions were considered for classification power:
TM-Score,13 and an alternate score incorporating the
detection of symmetry order. The TM-Score classifier has an
AUC of 0.94.
Abstract
The CE-Symm algorithm has been developed to detect internal symmetry within protein chains. Symmetry is
common across protein fold space and is tied to a number of important biological functions. Using CE-Symm we
find that 16% of SCOP superfamilies contain internal symmetry.
The algorithm can produce unambiguous multiple alignments between symmetric subunits. It can also be applied
to the output of other symmetry detection algorithms to refine alignments and identify conserved regions between
all subunits.
1. Lee, J. & Blaber, M. PNAS 108, 126–130 (2011).
2. Monod, J. et al. J Mol Biol 12, 88–118 (1965).
3. Juo, Z. S. et al. J Mol Biol 261, 239–254 (1996).
4. Goodsell, D. S. & Olson, A. J. Annu Rev Biophys Biomol
Struct 29, 105–153 (2000).
5. Gosavi, S. et al. J Mol Biol 357, 986–996 (2006).
6. Fortenberry, C. et al. J Am Chem Soc 133, 18026–18029
(2011).
7. Murray, K. B. et al. J Mol Biol 316, 341–363 (2002).
8. Kim, C. et al. BMC Bioinformatics 11, 303 (2010).
9. Guerler, A. et al. J Chem Inf Model 49, 2147–2151
(2009).
10. Shindyalov, I. N. & Bourne, P. E. Protein Eng 11, 739–
747 (1998).
11. Uliel, S. et al. Bioinformatics 15, 930–936 (1999).
12. Abraham, A.-L. et al. J Mol Biol 394, 522–534 (2009).
13. Zhang, Y., & Skolnick, J. (2004). Proteins: Structure,
Function, and Bioinformatics, 57(4), 702–710