STRING - Prediction of a functional association network for the yeast mitochondrial system
1. STRING
Prediction of a functional association network
for the yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg
2. Overview
• Prediction of functional associations between proteins
– What is STRING?
– Genomic context methods
– Integration of large-scale experimental data
– Combination and cross-species transfer of evidence
• (Coffee break)
• The yeast mitochondrial system
– Prediction of mitochondrial proteins
– A functional association network for mitochondria
– Mapping and correlating features of mitochondrial proteins
3. Part 1
Prediction of functional association between proteins
Lars Juhl Jensen
EMBL Heidelberg
4. What is STRING?
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Exp. interaction data
Microarray expression data
Literature co-mentioning
5. Let the data speak for themselves ...
• Classification schemes are obviously difficult to predict if they are not
supported by the data – there are no obvious features separating:
– Presidents vs. non-presidents
– Actors vs. non-actors
• Unsupervised methods may discover a more meaningful classification:
– Holding your pinky to your mouth is a clear sign of evil
– Wearing a bowtie is a sign of good
– So is consumption of alcoholic drinks
8. Score calibration against a common reference
• Many diverse types of evidence
– The quality of each is judged by
very different raw scores
– Quality differences exist among
data sets of the same type
• Solved by calibrating all scores
against a common reference
– Scores are directly comparable
– Probabilistic scores allow
evidence to be combined
• Requirements for the reference
– Must represent a compromise
of the all types of evidence
– Broad species coverage
9. Integrating physical interaction screens
Make binary
representation
of complexes
Yeast two-hybrid
data sets are
inherently binary
Calculate score
from number of
(co-)occurrences
Calculate score
from non-shared
partners
Calibrate against KEGG maps
Infer associations in other species
Combine evidence from experiments
10. Mining microarray expression databases
Re-normalize arrays
by modern method
to remove biases
Build
expression
matrix
Combine
similar arrays
by PCA
Construct predictor
by Gaussian kernel
density estimation
Calibrate
against
KEGG maps
Infer
associations in
other species
11. ?
Source species
Target species
Evidence transfer based on “fuzzy orthology”
• Orthology transfer is tricky
– Correct assignment of orthology
is difficult for distant species
– Functional equivalence cannot
be guaranteed for in-paralogs
• These problems are addressed
by our “fuzzy orthology” scheme
– Confidence scores for functional
equivalence are calculated from
all-against-all alignment
– Evidence is distributed across
possible pairs according to
confidence scores in the case of
many-to-many relationships
13. Image: Molecular Biology of the Cell, 3.rd edition
Metabolism overview
Defined manually:
cutting metabolic
maps into pathways
Purine
biosynthesis
Histidine
biosynthesis
Predicting and defining metabolic pathways
and other functional modules
Defined objectively:
standard clustering
of genome-scale data
14. Part 2
The yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg
15. Yeast mitochondria – why it should work well
• Because it is metabolism
– STRING was developed using KEGG pathways as a reference
– This may have caused STRING to function best on metabolism
• Because it is yeast
– By far the best covered organism in terms of physical interactions
– Many microarray gene expression studies
– Literature mining works well due to standardization of gene names
• Because it is prokaryotic
– Evolutionarily, mitochondria are of bacterial origin
– The genomic context methods in STRING are very powerful, but
can only provide evidence for proteins with prokaryotic orthologs
16. Strategy for extracting a functional association
network of the mitochondrial system
• Starting point:
– Reference set of proteins known to mitochondrial
– A large, diverse set of experiments relevant for predicting
mitochondrial proteins
– The global STRING network for yeast
• Predict mitochondrial candidate genes
– Use reference set to train neural networks for predicting candidate
genes based on experimental data
– Use very high-confidence STRING links to suggest additional
candidates based interactions with reference and candidate genes
• Extract network that includes lower confidence interactions
and identify functional modules by clustering
17. Predicting mitochondrial proteins
• Training was done with
5-fold cross validation
– Reference set used as
positive examples
– All other genes used
as negative examples
• Top 800 contains more
than 90% of known
mitochondrial genes
• Surprising performance
of the linear model
– As good as NN with
250 hidden neurons
– Better than MitoP2
22. Composition and interconnectivity of clusters
• A network of clusters
– Most probably path
between clusters used
as score
• Interacting clusters are
preferentially within the
same compartment
• Protobacterial clusters
typically localize to the
mitochondria
23. Correlations among gene features
• Expression data agree
with data on NF specific
growth defects
• Genes with detectable
human orthologs are
more conserved among
yeasts
• Disease orthologs are
often protobacterial
• Knockout of disease
orthologs cause less
severe growth defects
24. Can human disease genes be predicted?
• Mitochondrial genes are already enriched in disease genes
• Previous slide showed that mitochondrial genes of protobacterial origin
and are further enriched in disease gene orthologs
• Disease gene orthologs show less growth defect than other
mitochondrial genes with human orthologs
25. Getting more specific – generally speaking
• Benchmarking against
one common reference
allows integration of
heterogeneous data
• The different types of
data do not all tell us
about the same kind of
functional associations
• It should be possible to
assign likely interaction
types from supporting
evidence types
• An accurate model of the
yeast mitotic cell cycle
• Approach
– High confidence set of
physical interactions
– Custom analysis of cell
cycle expression data
• Observations
– Dynamic assembly of
cell cycle complexes
– Temporal regulation of
Cdk specificity
Dynamic complex formation during the yeast cell cycle
Ulrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork
to appear in Science
26. Conclusions
• Genomic context methods are able to infer the function of
many prokaryotic proteins from genome sequences alone
• New genomic context methods are still being developed
• Integration of large-scale experimental data allows similar
predictions to be made for eukaryotic proteins
• Successful data integration requires benchmarking and
cross-species transfer of information
• Protein networks are useful for the analysis of large,
complex biological systems
27. Acknowledgments
• The STRING team
– Christian von Mering
– Berend Snel
– Martijn Huynen
– Daniel Jaeggi
– Steffen Schmidt
– Mathilde Foglierini
– Peer Bork
• New genomic context methods
– Jan Korbel
– Christian von Mering
– Peer Bork
• ArrayProspector web service
– Julien Lagarde
– Chris Workman
• NetView visualization tool
– Sean Hooper
• Study of yeast mitochondria
– Fabiana Perocchi
– Lars Steinmetz
• Analysis of yeast cell cycle
– Ulrik de Lichtenberg
– Thomas Skøt
– Anders Fausbøll
– Søren Brunak
• Web resources
– string.embl.de
– www.bork.embl.de/ArrayProspector
– www.bork.embl.de/synonyms