This document summarizes a study that analyzed three major evolutionary signals in protein sequences: conservation, specificity determining positions (SDPs), and coevolution between residues. These signals result from different evolutionary mechanisms and have been used by bioinformatics methods to predict functionally important sites. The study evaluated several methods for predicting conserved residues, SDPs, and coevolving positions using a dataset of 434 protein families. It found that the methods capture different information and identify different top-scoring residues. Conservation and mutual information scores performed best at detecting catalytic residues, but combining scores could improve predictions. SDP prediction remains challenging due to limited data and methods detecting conserved residues may miss SDPs until more sequences are available.
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution
1. Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution
Elin Teppa1, Diego Zea 2, Morten Nielsen 1 3 and Cristina Marino Buslje 1
1 Structural
Bioinformatics Unit, Leloir Institute Foundation
2 Structural Bioinformatics Group, National University of Quilmes
3 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark
INTRODUCTION RESULTS AND DISCUSSION
Protein sequences evolve under several constraints and each constraint leads to a We calculate the Spearman rank correlation between methods to find out if they
specific pattern of conservation and variation in protein sequences. In this study we capture different pieces of information (Fig. 1). The analysis shows as expected, a
focused on the analysis of three major evolutionary signals: conservation, strong correlation between ET rv and conservation. This is because the first
specificity determining positions and coevolution between residues. These signals includes conservation information in it score. Surprisingly, the correlation between
are the results of different evolutionary mechanisms and have been used by ET iv, SDPfox and XDET is less than expected. For the first two methods, this can
different bioinformatics methods to predict functionally important sites. be understood by the strong dependence of the results of the sequence clustering
method, which are phylogenetically and functionally based for ET iv and SDPfox
Fully conserved position in a Multiple Sequence Alignment (MSA) are interpreted respectively. On the other hand the prediction of SDPs by XDET is based on the
as important residues for the structure and function of the protein. At the beginning, comparison of the mutational behavior of a position respect to the family mutational
the computational methods used this information to predict functional important trend. Such approach may detect important positions (as they have the same
sites including catalytic residues. Nowadays, more factors are taken into account to behavior than the evolution of the whole family) but this is not enough evidence to
improve the performance of prediction methods. assign their biological importance to the determination of the specificity of the
enzyme.
Other positions show a more subtle pattern of conservation, they are conserved
within a group of sequences (sub-family) but may change in another group. Such We also analyzed to which extent are overlap the best scores predicted residues
positions are responsible for protein specificity i.e. ligand binding, protein-protein by the different methods. We take into account the best N scores for each method,
interaction, etc. (named: Specificity-Determining Positions –SDPs- ). The were N is equal to 10% of the total length of the sequence. We illustrated in Figure
classification of proteins into groups can be defined according to different criteria 2, the average of the overlapping residues for the 434 families. Except for ET rv
i.e. identity, phylogenetically, functional similarity, among others. SDPs are with conservation and ET iv, the others methods differ in which residues are most
suggested to be located in the proximity of the catalytic residues in order to carry important for the family.
out their role of defining the substrate specificity.
Coevolution between residues is another signal that can be extracted from MSAs.
Coevolution is the result of compensatory mutations, namely they are those
residues that have undergone concerted changes to overcome a common
selection pressure. Owing to the limitations on the amino acid diversity in the
proximity of an active site, the catalytic residues carry a particular signature defined
by a close proximity network of residues with high mutual information.
In summary, in this study we consider different methods that attempt to capture
information from three different evolutionary signals. They have in common the
prediction of functionally important sites and are capable of detecting the catalytic
residues or to point the residues nearby the catalytic residues.
Disentangling the function of different positions in an alignment will allows us to
create methods that take profit from different information contained in an
alignment. That could be use for the deeper study of any proteins. Besides it would
help to do better and accurate annotations of proteins with unknown function. Figure 1 : Heat map of the Spearman Figure 2 : Average percentage residues predicted in
rank correlation between methods common between methods considering the top 10% ranked
positions.
MATERIALS AND METHODS As an example we illustrate in Fig.3 the highest scores of the Phosphofructokinase
1 family mapped in the 3D structure of the reference protein.
The dataset was constructed based on the catalytic site atlas (CSA) database [1]
and Pfam database [2]. A total of 434 proteins families which in turn have 1212
cayalytic residues have been studied. Figure 3 : Mapping of the predicted
For a given family one reference pdb entry was selected and the MSAs were functionally important sites using six
different prediction scores. Plotted is the
prepared removing redundant sequences at the level of 62% identity and trimmimg cartoon representation of the PDB:
deletions and insertions across the whole alignment so as to preserve the 1PFK.The top 10% prediction scores are
continuity of the reference sequence. In addition, all positions with >50% gaps, as represented in green. The catalytic
residues are show in red sticks, and the
well as sequences covering <50% of the reference sequence length were SDPs known experimentally are show in
removed. blue sticks.
Conservation: It was used the Kullback-Leibler conservation score. Predictive
Method
performance
Mutual Information: Mutual Information was calculated as describe in [3]. MI pC 0.83899
gives a value for each pair of residues in a MSA. We calculated a cumulative
pMI 0.80342
Mutual Information score (cMI) for each residue as the sum of MI values above
certain threshold of every amino acid pair where the particular residue appears. pET rv 0.86774
pET iv 0.63360
Evolutionary Tracing: The ET method identified invariant specific residues by pSDPfox 0.63602
partitioning the phylogenetic tree into subgroups of similar sequences [4]. ET iv
score represents conservation within groups in a qualitative way and predicts Table 1 : Predictive perfomance for
SDPs; whereas ET rv score incorporate entropy as a quantitative measure of detecting catalytic residues in terms to
conservation giving a rank of positions by their relative importance. AUC value on the 434 Pfam entries.
SDPfox: This method predicts SDPs in a phylogeny-independent manner. At first it
We demonstrate that the methods capture different information and identify with
performs an identification of specificity groups through assign each protein to a
the highest scores different residues positions. An exception is ET rv scores that
group by iterations till convergence. This classification allows the prediction of
shows a strong correlation with conservation.
SDPs that end up separated on a phylogenetic tree [5].
pET rv, pCons and pMI scores have shown a good performance to detect catalytic
XDET: This method implements the mutational behaviour algorithm based on the residues. However, only pMI could be combined with other scores to improve the
comparison of the mutational behaviour of a position with the mutational behaviour prediction of catalytic residues, because this has a low correlation with other
of the whole alignment. The principle is that positions showing a family dependent measures.
conservation pattern would have a similar mutational behaviour as the whole family
[6]. A weakness of the SDPs prediction methods is that some conserved positions
could mask SDPs positions which would be detected if more sequences become
Proximity scores for each method was calculated as the sum of the scores of available for the family.
residues within a distance ≤ 6Ǻ in the 3D structure to the given amino acid.
The predictive performance in detecting catalytic residues using the proximity There is a lack of publicly available SDP database, which hinders the direct testing
scores was evaluated in terms of the area under the ROC curve per family. of methods for their prediction.
REFERENCES The SDP prediction methods even with different approaches, share the use of
conserved amino acids as indicators of likely functional significance. In this context
1 Porter, C.T et al, The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural
data. Nucl. Acids Res., 2004. 32(suppl_1): p. D129-133.
the co-evolution is less representative of the global evolution of a whole family or
2 Finn, R.D., et al., The Pfam protein families database. Nucl. Acids Res., 2008. 36(suppl_1): p. D281-288 subfamily, thus providing information of specific events that required a common
3 Buslje, C.M., et al., Correction for phylogeny, small number of observations and data redundancy improves the
identification of coevolving amino acid pairs using mutual information. Bioinformatics, 2009. 25(9): p. 1125-1131. adaptation of two or more residues and can be detected even in phylogenetically
4 Lichtarge, O., et al., A family of Evolution-Entropy Hybrid Methods for ranking protein residudes by importance.
J.Mol.Biol, 2004. 336: p. 1265-82.
divergent family.
5 Kalinina O.V. et al., An automated stochastic approach to the identification of the protein specificity determinants and
functional subfamilies. AMB, 2010: p.5-29.
6 Del Sol A. et al., Automatic Methods for Predicting Functionally Important Residues. J.Mol.Biol, 2003.326(4):1289-1302