SlideShare uma empresa Scribd logo
1 de 28
STRING
Prediction of a functional association network
for the yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg
Overview
• Prediction of functional associations between proteins
– What is STRING?
– Genomic context methods
– Integration of large-scale experimental data
– Combination and cross-species transfer of evidence
• (Coffee break)
• The yeast mitochondrial system
– Prediction of mitochondrial proteins
– A functional association network for mitochondria
– Mapping and correlating features of mitochondrial proteins
Part 1
Prediction of functional association between proteins
Lars Juhl Jensen
EMBL Heidelberg
What is STRING?
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Exp. interaction data
Microarray expression data
Literature co-mentioning
Let the data speak for themselves ...
• Classification schemes are obviously difficult to predict if they are not
supported by the data – there are no obvious features separating:
– Presidents vs. non-presidents
– Actors vs. non-actors
• Unsupervised methods may discover a more meaningful classification:
– Holding your pinky to your mouth is a clear sign of evil
– Wearing a bowtie is a sign of good
– So is consumption of alcoholic drinks
T
rends in Microb
Inferring functional modules from
gene presence/absence patterns
Resting
protuberances
Protracted
protuberance
Cellulose
© Trends Microbiol, 1999
Cell
Cell wall
Anchoring
proteins
Cellulosomes
Cellulose
The “Cellulosome”
Genomic context methods
© Nature Biotechnology, 2004
Score calibration against a common reference
• Many diverse types of evidence
– The quality of each is judged by
very different raw scores
– Quality differences exist among
data sets of the same type
• Solved by calibrating all scores
against a common reference
– Scores are directly comparable
– Probabilistic scores allow
evidence to be combined
• Requirements for the reference
– Must represent a compromise
of the all types of evidence
– Broad species coverage
Integrating physical interaction screens
Make binary
representation
of complexes
Yeast two-hybrid
data sets are
inherently binary
Calculate score
from number of
(co-)occurrences
Calculate score
from non-shared
partners
Calibrate against KEGG maps
Infer associations in other species
Combine evidence from experiments
Mining microarray expression databases
Re-normalize arrays
by modern method
to remove biases
Build
expression
matrix
Combine
similar arrays
by PCA
Construct predictor
by Gaussian kernel
density estimation
Calibrate
against
KEGG maps
Infer
associations in
other species
?
Source species
Target species
Evidence transfer based on “fuzzy orthology”
• Orthology transfer is tricky
– Correct assignment of orthology
is difficult for distant species
– Functional equivalence cannot
be guaranteed for in-paralogs
• These problems are addressed
by our “fuzzy orthology” scheme
– Confidence scores for functional
equivalence are calculated from
all-against-all alignment
– Evidence is distributed across
possible pairs according to
confidence scores in the case of
many-to-many relationships
Multiple evidence types from several species
Image: Molecular Biology of the Cell, 3.rd edition
Metabolism overview
Defined manually:
cutting metabolic
maps into pathways
Purine
biosynthesis
Histidine
biosynthesis
Predicting and defining metabolic pathways
and other functional modules
Defined objectively:
standard clustering
of genome-scale data
Part 2
The yeast mitochondrial system
Lars Juhl Jensen
EMBL Heidelberg
Yeast mitochondria – why it should work well
• Because it is metabolism
– STRING was developed using KEGG pathways as a reference
– This may have caused STRING to function best on metabolism
• Because it is yeast
– By far the best covered organism in terms of physical interactions
– Many microarray gene expression studies
– Literature mining works well due to standardization of gene names
• Because it is prokaryotic
– Evolutionarily, mitochondria are of bacterial origin
– The genomic context methods in STRING are very powerful, but
can only provide evidence for proteins with prokaryotic orthologs
Strategy for extracting a functional association
network of the mitochondrial system
• Starting point:
– Reference set of proteins known to mitochondrial
– A large, diverse set of experiments relevant for predicting
mitochondrial proteins
– The global STRING network for yeast
• Predict mitochondrial candidate genes
– Use reference set to train neural networks for predicting candidate
genes based on experimental data
– Use very high-confidence STRING links to suggest additional
candidates based interactions with reference and candidate genes
• Extract network that includes lower confidence interactions
and identify functional modules by clustering
Predicting mitochondrial proteins
• Training was done with
5-fold cross validation
– Reference set used as
positive examples
– All other genes used
as negative examples
• Top 800 contains more
than 90% of known
mitochondrial genes
• Surprising performance
of the linear model
– As good as NN with
250 hidden neurons
– Better than MitoP2
TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair
TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair
Protobacterial
orthologs
TOM
MRPL
Ribosome
Related
MRPS
Vacuolar
Acidification
Fatty Acid
Biosynth.
Secondary
RCC_Asy
RCCII
RCCIV
RCCV
RCC_Asy
HAP
Complex
Arg
Biosynth.
PDH/KGD/
GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing
and CH remodelling
APC
Fission/
Fusion
rRNA
Processing
mRNA
Processing
TFIIIC
Complex
m-AAA
Complex
TCA Cycle
Iron Homeostasis/
Chaperone Activity
RCCI
rRNA
Processing
Leu/Val/Ile
Biosynth.
DNA
Repair
GARP
Complex
Cytosolic
Ribosome
TIM
RCC_Asy
Actin
tRNA
Splicing
RCCIII
NUP
Replication/
DNA Repair
Human disease
orthologs
RCC_Asy
Composition and interconnectivity of clusters
• A network of clusters
– Most probably path
between clusters used
as score
• Interacting clusters are
preferentially within the
same compartment
• Protobacterial clusters
typically localize to the
mitochondria
Correlations among gene features
• Expression data agree
with data on NF specific
growth defects
• Genes with detectable
human orthologs are
more conserved among
yeasts
• Disease orthologs are
often protobacterial
• Knockout of disease
orthologs cause less
severe growth defects
Can human disease genes be predicted?
• Mitochondrial genes are already enriched in disease genes
• Previous slide showed that mitochondrial genes of protobacterial origin
and are further enriched in disease gene orthologs
• Disease gene orthologs show less growth defect than other
mitochondrial genes with human orthologs
Getting more specific – generally speaking
• Benchmarking against
one common reference
allows integration of
heterogeneous data
• The different types of
data do not all tell us
about the same kind of
functional associations
• It should be possible to
assign likely interaction
types from supporting
evidence types
• An accurate model of the
yeast mitotic cell cycle
• Approach
– High confidence set of
physical interactions
– Custom analysis of cell
cycle expression data
• Observations
– Dynamic assembly of
cell cycle complexes
– Temporal regulation of
Cdk specificity
Dynamic complex formation during the yeast cell cycle
Ulrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork
to appear in Science
Conclusions
• Genomic context methods are able to infer the function of
many prokaryotic proteins from genome sequences alone
• New genomic context methods are still being developed
• Integration of large-scale experimental data allows similar
predictions to be made for eukaryotic proteins
• Successful data integration requires benchmarking and
cross-species transfer of information
• Protein networks are useful for the analysis of large,
complex biological systems
Acknowledgments
• The STRING team
– Christian von Mering
– Berend Snel
– Martijn Huynen
– Daniel Jaeggi
– Steffen Schmidt
– Mathilde Foglierini
– Peer Bork
• New genomic context methods
– Jan Korbel
– Christian von Mering
– Peer Bork
• ArrayProspector web service
– Julien Lagarde
– Chris Workman
• NetView visualization tool
– Sean Hooper
• Study of yeast mitochondria
– Fabiana Perocchi
– Lars Steinmetz
• Analysis of yeast cell cycle
– Ulrik de Lichtenberg
– Thomas Skøt
– Anders Fausbøll
– Søren Brunak
• Web resources
– string.embl.de
– www.bork.embl.de/ArrayProspector
– www.bork.embl.de/synonyms
Thank you!

Mais conteúdo relacionado

Mais procurados

Applications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomicApplications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomicSusan Rey
 
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...Jonathan Karr
 
Protein-protein interaction
Protein-protein interactionProtein-protein interaction
Protein-protein interactionsigma-tau
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatsidjena70
 
Analisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresionAnalisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresionCinthya Yessenia
 
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderPersonalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderThe Hive
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interactionZeshan Haider
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid systemiqraakbar8
 
DNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyDNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyBikash1489
 
Protein protein interaction basic
Protein protein interaction basicProtein protein interaction basic
Protein protein interaction basicAyesha Aftab
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksBITS
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomicsNikhil Aggarwal
 
Proteomics a search tool for vaccines
Proteomics a search tool for vaccinesProteomics a search tool for vaccines
Proteomics a search tool for vaccinesLawrence Okoror
 

Mais procurados (19)

Applications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomicApplications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomic
 
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...
Introduction to whole-cell modeling lecture | Whole-cell modeling summer scho...
 
genomic comparison
genomic comparison genomic comparison
genomic comparison
 
Protein-protein interaction
Protein-protein interactionProtein-protein interaction
Protein-protein interaction
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 format
 
WrightPettibone_Austin_GeneDrives
WrightPettibone_Austin_GeneDrivesWrightPettibone_Austin_GeneDrives
WrightPettibone_Austin_GeneDrives
 
Protein protein interactions
Protein protein interactionsProtein protein interactions
Protein protein interactions
 
Analisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresionAnalisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresion
 
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderPersonalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interaction
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Myers CV_2015
Myers CV_2015Myers CV_2015
Myers CV_2015
 
Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid system
 
DNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyDNA Sequencing in Phylogeny
DNA Sequencing in Phylogeny
 
Protein protein interaction basic
Protein protein interaction basicProtein protein interaction basic
Protein protein interaction basic
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networks
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 
Proteomics a search tool for vaccines
Proteomics a search tool for vaccinesProteomics a search tool for vaccines
Proteomics a search tool for vaccines
 

Destaque

香港六合彩
香港六合彩香港六合彩
香港六合彩eqhnwl
 
香港六合彩
香港六合彩香港六合彩
香港六合彩urevii
 
Edgars Demo
Edgars DemoEdgars Demo
Edgars Demoeachavez
 
Postgre Sql в веб приложениях иван золотухин
Postgre Sql в веб приложениях   иван золотухинPostgre Sql в веб приложениях   иван золотухин
Postgre Sql в веб приложениях иван золотухинMedia Gorod
 
六合彩-香港六合彩
六合彩-香港六合彩六合彩-香港六合彩
六合彩-香港六合彩uahnjk
 
EU Pilot Presentation in Mumbai, India April 2008
EU Pilot Presentation in Mumbai, India April 2008EU Pilot Presentation in Mumbai, India April 2008
EU Pilot Presentation in Mumbai, India April 2008esangathan
 
六合彩,香港六合彩
六合彩,香港六合彩六合彩,香港六合彩
六合彩,香港六合彩gcxfax
 
六合彩,香港六合彩
六合彩,香港六合彩六合彩,香港六合彩
六合彩,香港六合彩uahnjk
 
香港六合彩-六合彩
香港六合彩-六合彩香港六合彩-六合彩
香港六合彩-六合彩eqpswi
 
Dj Cata
Dj CataDj Cata
Dj Cata* *
 
Teleprotection Testing Rev1
Teleprotection Testing Rev1Teleprotection Testing Rev1
Teleprotection Testing Rev1guesta41f36
 
香港六合彩
香港六合彩香港六合彩
香港六合彩pchgmf
 
International Council Members
International Council MembersInternational Council Members
International Council Membersguest89812b
 
ICE conference in Lissabon, June 2008
ICE conference in Lissabon, June 2008ICE conference in Lissabon, June 2008
ICE conference in Lissabon, June 2008esangathan
 
Java Script решения для разработки больших проектов андрей сумин
Java Script решения для разработки больших проектов   андрей суминJava Script решения для разработки больших проектов   андрей сумин
Java Script решения для разработки больших проектов андрей суминMedia Gorod
 
International Council Members
International Council MembersInternational Council Members
International Council MembersArlene Calara
 

Destaque (20)

香港六合彩
香港六合彩香港六合彩
香港六合彩
 
香港六合彩
香港六合彩香港六合彩
香港六合彩
 
Edgars Demo
Edgars DemoEdgars Demo
Edgars Demo
 
Postgre Sql в веб приложениях иван золотухин
Postgre Sql в веб приложениях   иван золотухинPostgre Sql в веб приложениях   иван золотухин
Postgre Sql в веб приложениях иван золотухин
 
Aprendealeer
AprendealeerAprendealeer
Aprendealeer
 
六合彩-香港六合彩
六合彩-香港六合彩六合彩-香港六合彩
六合彩-香港六合彩
 
EU Pilot Presentation in Mumbai, India April 2008
EU Pilot Presentation in Mumbai, India April 2008EU Pilot Presentation in Mumbai, India April 2008
EU Pilot Presentation in Mumbai, India April 2008
 
六合彩,香港六合彩
六合彩,香港六合彩六合彩,香港六合彩
六合彩,香港六合彩
 
test2
test2test2
test2
 
六合彩,香港六合彩
六合彩,香港六合彩六合彩,香港六合彩
六合彩,香港六合彩
 
香港六合彩-六合彩
香港六合彩-六合彩香港六合彩-六合彩
香港六合彩-六合彩
 
No Mas
No MasNo Mas
No Mas
 
Dj Cata
Dj CataDj Cata
Dj Cata
 
Teleprotection Testing Rev1
Teleprotection Testing Rev1Teleprotection Testing Rev1
Teleprotection Testing Rev1
 
just for test
just for testjust for test
just for test
 
香港六合彩
香港六合彩香港六合彩
香港六合彩
 
International Council Members
International Council MembersInternational Council Members
International Council Members
 
ICE conference in Lissabon, June 2008
ICE conference in Lissabon, June 2008ICE conference in Lissabon, June 2008
ICE conference in Lissabon, June 2008
 
Java Script решения для разработки больших проектов андрей сумин
Java Script решения для разработки больших проектов   андрей суминJava Script решения для разработки больших проектов   андрей сумин
Java Script решения для разработки больших проектов андрей сумин
 
International Council Members
International Council MembersInternational Council Members
International Council Members
 

Mais de Lars Juhl Jensen

One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...Lars Juhl Jensen
 
One tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineOne tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineLars Juhl Jensen
 
Extract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationExtract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationLars Juhl Jensen
 
Network visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeNetwork visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeLars Juhl Jensen
 
STRING & STITCH : Network integration of heterogeneous data
STRING & STITCH: Network integration of heterogeneous dataSTRING & STITCH: Network integration of heterogeneous data
STRING & STITCH : Network integration of heterogeneous dataLars Juhl Jensen
 
Biomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textBiomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textLars Juhl Jensen
 
Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Lars Juhl Jensen
 
Network Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeNetwork Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeLars Juhl Jensen
 
Cellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textCellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textLars Juhl Jensen
 
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Lars Juhl Jensen
 
STRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataSTRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataLars Juhl Jensen
 
Tagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionTagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionLars Juhl Jensen
 
Network Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textNetwork Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textLars Juhl Jensen
 
Medical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsMedical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsLars Juhl Jensen
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textLars Juhl Jensen
 
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsMedical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsLars Juhl Jensen
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textLars Juhl Jensen
 
Biomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationBiomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationLars Juhl Jensen
 

Mais de Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...
 
One tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineOne tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicine
 
Extract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationExtract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotation
 
Network visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeNetwork visualization: A crash course on using Cytoscape
Network visualization: A crash course on using Cytoscape
 
STRING & STITCH : Network integration of heterogeneous data
STRING & STITCH: Network integration of heterogeneous dataSTRING & STITCH: Network integration of heterogeneous data
STRING & STITCH : Network integration of heterogeneous data
 
Biomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textBiomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured text
 
Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...
 
Network Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeNetwork Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and Cytoscape
 
Cellular networks
Cellular networksCellular networks
Cellular networks
 
Cellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textCellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and text
 
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
 
STRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataSTRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous data
 
Tagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionTagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognition
 
Network Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textNetwork Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and text
 
Medical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsMedical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactions
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsMedical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
 
Cellular Network Biology
Cellular Network BiologyCellular Network Biology
Cellular Network Biology
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Biomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationBiomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritization
 

STRING - Prediction of a functional association network for the yeast mitochondrial system

  • 1. STRING Prediction of a functional association network for the yeast mitochondrial system Lars Juhl Jensen EMBL Heidelberg
  • 2. Overview • Prediction of functional associations between proteins – What is STRING? – Genomic context methods – Integration of large-scale experimental data – Combination and cross-species transfer of evidence • (Coffee break) • The yeast mitochondrial system – Prediction of mitochondrial proteins – A functional association network for mitochondria – Mapping and correlating features of mitochondrial proteins
  • 3. Part 1 Prediction of functional association between proteins Lars Juhl Jensen EMBL Heidelberg
  • 4. What is STRING? Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
  • 5. Let the data speak for themselves ... • Classification schemes are obviously difficult to predict if they are not supported by the data – there are no obvious features separating: – Presidents vs. non-presidents – Actors vs. non-actors • Unsupervised methods may discover a more meaningful classification: – Holding your pinky to your mouth is a clear sign of evil – Wearing a bowtie is a sign of good – So is consumption of alcoholic drinks
  • 6. T rends in Microb Inferring functional modules from gene presence/absence patterns Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring proteins Cellulosomes Cellulose The “Cellulosome”
  • 7. Genomic context methods © Nature Biotechnology, 2004
  • 8. Score calibration against a common reference • Many diverse types of evidence – The quality of each is judged by very different raw scores – Quality differences exist among data sets of the same type • Solved by calibrating all scores against a common reference – Scores are directly comparable – Probabilistic scores allow evidence to be combined • Requirements for the reference – Must represent a compromise of the all types of evidence – Broad species coverage
  • 9. Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • 10. Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
  • 11. ? Source species Target species Evidence transfer based on “fuzzy orthology” • Orthology transfer is tricky – Correct assignment of orthology is difficult for distant species – Functional equivalence cannot be guaranteed for in-paralogs • These problems are addressed by our “fuzzy orthology” scheme – Confidence scores for functional equivalence are calculated from all-against-all alignment – Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships
  • 12. Multiple evidence types from several species
  • 13. Image: Molecular Biology of the Cell, 3.rd edition Metabolism overview Defined manually: cutting metabolic maps into pathways Purine biosynthesis Histidine biosynthesis Predicting and defining metabolic pathways and other functional modules Defined objectively: standard clustering of genome-scale data
  • 14. Part 2 The yeast mitochondrial system Lars Juhl Jensen EMBL Heidelberg
  • 15. Yeast mitochondria – why it should work well • Because it is metabolism – STRING was developed using KEGG pathways as a reference – This may have caused STRING to function best on metabolism • Because it is yeast – By far the best covered organism in terms of physical interactions – Many microarray gene expression studies – Literature mining works well due to standardization of gene names • Because it is prokaryotic – Evolutionarily, mitochondria are of bacterial origin – The genomic context methods in STRING are very powerful, but can only provide evidence for proteins with prokaryotic orthologs
  • 16. Strategy for extracting a functional association network of the mitochondrial system • Starting point: – Reference set of proteins known to mitochondrial – A large, diverse set of experiments relevant for predicting mitochondrial proteins – The global STRING network for yeast • Predict mitochondrial candidate genes – Use reference set to train neural networks for predicting candidate genes based on experimental data – Use very high-confidence STRING links to suggest additional candidates based interactions with reference and candidate genes • Extract network that includes lower confidence interactions and identify functional modules by clustering
  • 17. Predicting mitochondrial proteins • Training was done with 5-fold cross validation – Reference set used as positive examples – All other genes used as negative examples • Top 800 contains more than 90% of known mitochondrial genes • Surprising performance of the linear model – As good as NN with 250 hidden neurons – Better than MitoP2
  • 18. TOM MRPL Ribosome Related MRPS Vacuolar Acidification Fatty Acid Biosynth. Secondary RCC_Asy RCC_Asy RCCII RCCIV RCCV RCC_Asy HAP Complex Arg Biosynth. PDH/KGD/ GCV Cell Wall & pH Reg. DNA Repair Glucose sensing and CH remodelling APC Fission/ Fusion rRNA Processing mRNA Processing TFIIIC Complex m-AAA Complex TCA Cycle Iron Homeostasis/ Chaperone Activity RCCI rRNA Processing Leu/Val/Ile Biosynth. DNA Repair GARP Complex Cytosolic Ribosome TIM RCC_Asy Actin tRNA Splicing RCCIII NUP Replication/ DNA Repair
  • 19. TOM MRPL Ribosome Related MRPS Vacuolar Acidification Fatty Acid Biosynth. Secondary RCC_Asy RCC_Asy RCCII RCCIV RCCV RCC_Asy HAP Complex Arg Biosynth. PDH/KGD/ GCV Cell Wall & pH Reg. DNA Repair Glucose sensing and CH remodelling APC Fission/ Fusion rRNA Processing mRNA Processing TFIIIC Complex m-AAA Complex TCA Cycle Iron Homeostasis/ Chaperone Activity RCCI rRNA Processing Leu/Val/Ile Biosynth. DNA Repair GARP Complex Cytosolic Ribosome TIM RCC_Asy Actin tRNA Splicing RCCIII NUP Replication/ DNA Repair Protobacterial orthologs
  • 20. TOM MRPL Ribosome Related MRPS Vacuolar Acidification Fatty Acid Biosynth. Secondary RCC_Asy RCCII RCCIV RCCV RCC_Asy HAP Complex Arg Biosynth. PDH/KGD/ GCV Cell Wall & pH Reg. DNA Repair Glucose sensing and CH remodelling APC Fission/ Fusion rRNA Processing mRNA Processing TFIIIC Complex m-AAA Complex TCA Cycle Iron Homeostasis/ Chaperone Activity RCCI rRNA Processing Leu/Val/Ile Biosynth. DNA Repair GARP Complex Cytosolic Ribosome TIM RCC_Asy Actin tRNA Splicing RCCIII NUP Replication/ DNA Repair Human disease orthologs RCC_Asy
  • 21.
  • 22. Composition and interconnectivity of clusters • A network of clusters – Most probably path between clusters used as score • Interacting clusters are preferentially within the same compartment • Protobacterial clusters typically localize to the mitochondria
  • 23. Correlations among gene features • Expression data agree with data on NF specific growth defects • Genes with detectable human orthologs are more conserved among yeasts • Disease orthologs are often protobacterial • Knockout of disease orthologs cause less severe growth defects
  • 24. Can human disease genes be predicted? • Mitochondrial genes are already enriched in disease genes • Previous slide showed that mitochondrial genes of protobacterial origin and are further enriched in disease gene orthologs • Disease gene orthologs show less growth defect than other mitochondrial genes with human orthologs
  • 25. Getting more specific – generally speaking • Benchmarking against one common reference allows integration of heterogeneous data • The different types of data do not all tell us about the same kind of functional associations • It should be possible to assign likely interaction types from supporting evidence types • An accurate model of the yeast mitotic cell cycle • Approach – High confidence set of physical interactions – Custom analysis of cell cycle expression data • Observations – Dynamic assembly of cell cycle complexes – Temporal regulation of Cdk specificity Dynamic complex formation during the yeast cell cycle Ulrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork to appear in Science
  • 26. Conclusions • Genomic context methods are able to infer the function of many prokaryotic proteins from genome sequences alone • New genomic context methods are still being developed • Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins • Successful data integration requires benchmarking and cross-species transfer of information • Protein networks are useful for the analysis of large, complex biological systems
  • 27. Acknowledgments • The STRING team – Christian von Mering – Berend Snel – Martijn Huynen – Daniel Jaeggi – Steffen Schmidt – Mathilde Foglierini – Peer Bork • New genomic context methods – Jan Korbel – Christian von Mering – Peer Bork • ArrayProspector web service – Julien Lagarde – Chris Workman • NetView visualization tool – Sean Hooper • Study of yeast mitochondria – Fabiana Perocchi – Lars Steinmetz • Analysis of yeast cell cycle – Ulrik de Lichtenberg – Thomas Skøt – Anders Fausbøll – Søren Brunak • Web resources – string.embl.de – www.bork.embl.de/ArrayProspector – www.bork.embl.de/synonyms