STRING - Prediction of a functional association network for the yeast mitochondrial system

STRING
Prediction of a functional association network
for the yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg

Overview
• Prediction of functional associations between proteins
– What is STRING?
– Genomic context methods
– Integration of large-scale experimental data
– Combination and cross-species transfer of evidence
• (Coffee break)
• The yeast mitochondrial system
– Prediction of mitochondrial proteins
– A functional association network for mitochondria
– Mapping and correlating features of mitochondrial proteins

Part 1
Prediction of functional association between proteins
Lars Juhl Jensen
EMBL Heidelberg

What is STRING?
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Exp. interaction data
Microarray expression data
Literature co-mentioning

Let the data speak for themselves ...
• Classification schemes are obviously difficult to predict if they are not
supported by the data – there are no obvious features separating:
– Presidents vs. non-presidents
– Actors vs. non-actors
• Unsupervised methods may discover a more meaningful classification:
– Holding your pinky to your mouth is a clear sign of evil
– Wearing a bowtie is a sign of good
– So is consumption of alcoholic drinks

T
rends in Microb
Inferring functional modules from
gene presence/absence patterns
Resting
protuberances
Protracted
protuberance
Cellulose
© Trends Microbiol, 1999
Cell
Cell wall
Anchoring
proteins
Cellulosomes
Cellulose
The “Cellulosome”

Genomic context methods
© Nature Biotechnology, 2004

Score calibration against a common reference
• Many diverse types of evidence
– The quality of each is judged by
very different raw scores
– Quality differences exist among
data sets of the same type
• Solved by calibrating all scores
against a common reference
– Scores are directly comparable
– Probabilistic scores allow
evidence to be combined
• Requirements for the reference
– Must represent a compromise
of the all types of evidence
– Broad species coverage

Integrating physical interaction screens
Make binary
representation
of complexes
Yeast two-hybrid
data sets are
inherently binary
Calculate score
from number of
(co-)occurrences
Calculate score
from non-shared
partners
Calibrate against KEGG maps
Infer associations in other species
Combine evidence from experiments

Mining microarray expression databases
Re-normalize arrays
by modern method
to remove biases
Build
expression
matrix
Combine
similar arrays
by PCA
Construct predictor
by Gaussian kernel
density estimation
Calibrate
against
KEGG maps
Infer
associations in
other species

?
Source species
Target species
Evidence transfer based on “fuzzy orthology”
• Orthology transfer is tricky
– Correct assignment of orthology
is difficult for distant species
– Functional equivalence cannot
be guaranteed for in-paralogs
• These problems are addressed
by our “fuzzy orthology” scheme
– Confidence scores for functional
equivalence are calculated from
all-against-all alignment
– Evidence is distributed across
possible pairs according to
confidence scores in the case of
many-to-many relationships

Multiple evidence types from several species

Image: Molecular Biology of the Cell, 3.rd edition
Metabolism overview
Defined manually:
cutting metabolic
maps into pathways
Purine
biosynthesis
Histidine
biosynthesis
Predicting and defining metabolic pathways
and other functional modules
Defined objectively:
standard clustering
of genome-scale data

Part 2
The yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg

Yeast mitochondria – why it should work well
• Because it is metabolism
– STRING was developed using KEGG pathways as a reference
– This may have caused STRING to function best on metabolism
• Because it is yeast
– By far the best covered organism in terms of physical interactions
– Many microarray gene expression studies
– Literature mining works well due to standardization of gene names
• Because it is prokaryotic
– Evolutionarily, mitochondria are of bacterial origin
– The genomic context methods in STRING are very powerful, but
can only provide evidence for proteins with prokaryotic orthologs

Strategy for extracting a functional association
network of the mitochondrial system
• Starting point:
– Reference set of proteins known to mitochondrial
– A large, diverse set of experiments relevant for predicting
mitochondrial proteins
– The global STRING network for yeast
• Predict mitochondrial candidate genes
– Use reference set to train neural networks for predicting candidate
genes based on experimental data
– Use very high-confidence STRING links to suggest additional
candidates based interactions with reference and candidate genes
• Extract network that includes lower confidence interactions
and identify functional modules by clustering

Predicting mitochondrial proteins
• Training was done with
5-fold cross validation
– Reference set used as
positive examples
– All other genes used
as negative examples
• Top 800 contains more
than 90% of known
mitochondrial genes
• Surprising performance
of the linear model
– As good as NN with
250 hidden neurons
– Better than MitoP2

TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair

TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair
Protobacterial
orthologs

TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair
Human disease
orthologs
RCC_Asy

Composition and interconnectivity of clusters
• A network of clusters
– Most probably path
between clusters used
as score
• Interacting clusters are
preferentially within the
same compartment
• Protobacterial clusters
typically localize to the
mitochondria

Correlations among gene features
• Expression data agree
with data on NF specific
growth defects
• Genes with detectable
human orthologs are
more conserved among
yeasts
• Disease orthologs are
often protobacterial
• Knockout of disease
orthologs cause less
severe growth defects

Can human disease genes be predicted?
• Mitochondrial genes are already enriched in disease genes
• Previous slide showed that mitochondrial genes of protobacterial origin
and are further enriched in disease gene orthologs
• Disease gene orthologs show less growth defect than other
mitochondrial genes with human orthologs

Getting more specific – generally speaking
• Benchmarking against
one common reference
allows integration of
heterogeneous data
• The different types of
data do not all tell us
about the same kind of
functional associations
• It should be possible to
assign likely interaction
types from supporting
evidence types
• An accurate model of the
yeast mitotic cell cycle
• Approach
– High confidence set of
physical interactions
– Custom analysis of cell
cycle expression data
• Observations
– Dynamic assembly of
cell cycle complexes
– Temporal regulation of
Cdk specificity
Dynamic complex formation during the yeast cell cycle
Ulrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork
to appear in Science

Conclusions
• Genomic context methods are able to infer the function of
many prokaryotic proteins from genome sequences alone
• New genomic context methods are still being developed
• Integration of large-scale experimental data allows similar
predictions to be made for eukaryotic proteins
• Successful data integration requires benchmarking and
cross-species transfer of information
• Protein networks are useful for the analysis of large,
complex biological systems

Acknowledgments
• The STRING team
– Christian von Mering
– Berend Snel
– Martijn Huynen
– Daniel Jaeggi
– Steffen Schmidt
– Mathilde Foglierini
– Peer Bork
• New genomic context methods
– Jan Korbel
– Christian von Mering
– Peer Bork
• ArrayProspector web service
– Julien Lagarde
– Chris Workman
• NetView visualization tool
– Sean Hooper
• Study of yeast mitochondria
– Fabiana Perocchi
– Lars Steinmetz
• Analysis of yeast cell cycle
– Ulrik de Lichtenberg
– Thomas Skøt
– Anders Fausbøll
– Søren Brunak
• Web resources
– string.embl.de
– www.bork.embl.de/ArrayProspector
– www.bork.embl.de/synonyms

STRING - Prediction of a functional association network for the yeast mitochondrial system

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Mais de Lars Juhl Jensen

Mais de Lars Juhl Jensen (20)

STRING - Prediction of a functional association network for the yeast mitochondrial system