SlideShare uma empresa Scribd logo
1 de 32
Exploratory analysis  of phenotyping screens:  enrichment, clustering, ranking Network Biology - lecture 2
High-dimensional phenotypes by microscopy or molecular profiling Low-dimensional phenotypes A- Time Size
A  challenge  for computation and statistics
Today’s lecture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
High-throughput phenotyping Weak Strong Phenotypes of 100s to 10.000s  perturbed genes Hits
Gene Ontology (GO) www.geneontology.org A GO Term with  a gene set  annotated to it
GO over-representation Hyper-geometric test Hits Weak Strong Phenotype GO All genes All Hits Hits in GO term Genes in GO term
Hyper-geometric distribution N  genes altogether n  hits k  hits in GO term m  genes in GO term Probability to observe k  hits in GO term Number of possibilities to choose  n  hits out of  N  genes k  hits fall into the GO term of size  m The other  n-k  hits are all genes outside the GO term
Hyper-geometric test: example pvalue = phyper (  k   ,  m   ,  N-m   ,  n   ,  lower.tail = FALSE ) Hold these fixed : N = 10 000 m = 200 See what happens if we vary: k = 1,2,…, 30 n = 50, 100, 200, 300, 400 k  or more! N n k m Number  k  of hits in GO term p-value [log10] 50 100 200 300 400
Gene Set Enrichment Analysis Not restricted to GO , could be any collection of gene sets,  e.g. MSigDB at http://www.broad.mit.edu/gsea/ Subramanian et al. (2005)   Weak Strong Phenotype GO Weak Strong Phenotype No significant trend Significant   trend
 
GSEA: construction Weak Strong Phenotype Nr.  non- hits before  i Nr. all  non- hits i Nr. hits before  i Nr.  all  hits if  p = 0
GSEA: examples
PROs and CONs Gene set 1 Gene set 2 Gene set 3 Gene set 4 … e.g.  Wnt pathway DNA repair Chromosome 1 Expressed in liver p -values  (hyper-geometric or GSEA) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Result:
Map phenotypes to network Where  do the hits fall in the network and  what  are they connected to?
Sub-networks rich in hits
Sub-networks  with highly correlated phenotypes
Predicting phenotypes Guilt by association Use known  phenotypes  in the network Use  edge weights  if possible Success depends on  quality and coverage  of linkage in the network ? ? ?
Which networks? Networks from large-scale experiments Networks from analyzing the experimental literature Networks from probabilistic data integration
Cluster first, think later Perturbed Genes Phenotypic profiles Features = Expression profiles, parameters of cell shape,  protein concentration or localization Genes with similar phenotypic profiles often have similar molecular functions or act in the same pathways. Principle for function prediction: Guilt by association
From data to distances Phenotypic Profiles Perturbed genes D [i,j] = dist(  M [i,] ,  M [j,] ) M D D [j,i] =  D [i,j] D [i,i] = 0  for all i dist(. , .) What distance measure should we use? Distance or dissimilarity matrix i j i j j i
Examples of distances how is this related to correlation? a b Euclidean distance dist( a,b ) = a b dist( a,b ) = Manhattan distance  a b dist( a,b ) = Cosine distance
Linkage: distances to clusters dist(C1, C2) =  max  { dist( i , j ) :  i  in C1,  j  in C2 } complete linkage =  min   { dist( i , j ) :  i  in C1,  j  in C2 } single linkage =  mean  { dist( i , j ) :  i  in C1,  j  in C2 } average linkage D Phenotype 2 Phenotype 1 1 2 3 6 5 4 D[3,4] 1 2 3 6 5 4 Phenotype 2 Phenotype 1 D[ (2,3) ,4] = ??? Distances between  individual genes 3
Hierarchical clustering Ingredients: data matrix , distance function, linkage function Phenotype 2 Phenotype 1 1 2 3 6 5 4 1 2 3 6 5 4 Dendrogram
Phenotypic Data 332 knock-downs in 5 conditions Correlation  matrix Dendrogram Data by  Klaas Mulder, CRI
Clustering: PROs and CONs ,[object Object],[object Object],[object Object],PRO NEG ,[object Object],Brown et al (2005) Ivanova et al (2006) Bakal et al (2007)
Query-based gene ranking ,[object Object],Often addressed by clustering: Which cluster does the query gene fall into? ,[object Object],[object Object],[object Object],[object Object],[object Object],Cluster 1 Cluster 2 Cluster 3 Query   gene
Ranking by PhenoBLAST ,[object Object],[object Object],PhenoBLAST Gunsalus et al (2004) Perturbed Genes Ordered  by similarity to query gene Phenotypes
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Three take-home messages ,[object Object],[object Object],[object Object]
From clusters to pathways ,[object Object],Next lectures:   graph-based and probabilistic models to infer (causal) pathway structure from phenotypic data.
Network Biology - lecture 2 ≥ 3  Questions !

Mais conteúdo relacionado

Mais procurados

Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Amol Kunde
 

Mais procurados (20)

markers and their role
markers and their rolemarkers and their role
markers and their role
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
HMM (Hidden Markov Model)
HMM (Hidden Markov Model)HMM (Hidden Markov Model)
HMM (Hidden Markov Model)
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Bioinfromatics - local alignment
Bioinfromatics - local alignmentBioinfromatics - local alignment
Bioinfromatics - local alignment
 
cisgenesis and intragenesis
cisgenesis and intragenesiscisgenesis and intragenesis
cisgenesis and intragenesis
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)
 

Semelhante a Network Biology Lent 2010 - lecture 1

probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
FELIX75
 
BOSE, Debasish - Research Plan
BOSE, Debasish - Research PlanBOSE, Debasish - Research Plan
BOSE, Debasish - Research Plan
Debasish Bose
 

Semelhante a Network Biology Lent 2010 - lecture 1 (20)

Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
STRING: Prediction of protein networks through integration of diverse large-s...
STRING: Prediction of protein networks through integration of diverse large-s...STRING: Prediction of protein networks through integration of diverse large-s...
STRING: Prediction of protein networks through integration of diverse large-s...
 
PhD midterm report
PhD midterm reportPhD midterm report
PhD midterm report
 
presentation
presentationpresentation
presentation
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
BOSE, Debasish - Research Plan
BOSE, Debasish - Research PlanBOSE, Debasish - Research Plan
BOSE, Debasish - Research Plan
 
COMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.pptCOMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.ppt
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Gene expression profiling i
Gene expression profiling  iGene expression profiling  i
Gene expression profiling i
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Making Protein Function and Subcellular Localization Predictions: Challenges ...
Making Protein Function and Subcellular Localization Predictions: Challenges ...Making Protein Function and Subcellular Localization Predictions: Challenges ...
Making Protein Function and Subcellular Localization Predictions: Challenges ...
 

Network Biology Lent 2010 - lecture 1

  • 1. Exploratory analysis of phenotyping screens: enrichment, clustering, ranking Network Biology - lecture 2
  • 2. High-dimensional phenotypes by microscopy or molecular profiling Low-dimensional phenotypes A- Time Size
  • 3. A challenge for computation and statistics
  • 4.
  • 5. High-throughput phenotyping Weak Strong Phenotypes of 100s to 10.000s perturbed genes Hits
  • 6. Gene Ontology (GO) www.geneontology.org A GO Term with a gene set annotated to it
  • 7. GO over-representation Hyper-geometric test Hits Weak Strong Phenotype GO All genes All Hits Hits in GO term Genes in GO term
  • 8. Hyper-geometric distribution N genes altogether n hits k hits in GO term m genes in GO term Probability to observe k hits in GO term Number of possibilities to choose n hits out of N genes k hits fall into the GO term of size m The other n-k hits are all genes outside the GO term
  • 9. Hyper-geometric test: example pvalue = phyper ( k , m , N-m , n , lower.tail = FALSE ) Hold these fixed : N = 10 000 m = 200 See what happens if we vary: k = 1,2,…, 30 n = 50, 100, 200, 300, 400 k or more! N n k m Number k of hits in GO term p-value [log10] 50 100 200 300 400
  • 10. Gene Set Enrichment Analysis Not restricted to GO , could be any collection of gene sets, e.g. MSigDB at http://www.broad.mit.edu/gsea/ Subramanian et al. (2005) Weak Strong Phenotype GO Weak Strong Phenotype No significant trend Significant trend
  • 11.  
  • 12. GSEA: construction Weak Strong Phenotype Nr. non- hits before i Nr. all non- hits i Nr. hits before i Nr. all hits if p = 0
  • 14.
  • 15. Map phenotypes to network Where do the hits fall in the network and what are they connected to?
  • 17. Sub-networks with highly correlated phenotypes
  • 18. Predicting phenotypes Guilt by association Use known phenotypes in the network Use edge weights if possible Success depends on quality and coverage of linkage in the network ? ? ?
  • 19. Which networks? Networks from large-scale experiments Networks from analyzing the experimental literature Networks from probabilistic data integration
  • 20. Cluster first, think later Perturbed Genes Phenotypic profiles Features = Expression profiles, parameters of cell shape, protein concentration or localization Genes with similar phenotypic profiles often have similar molecular functions or act in the same pathways. Principle for function prediction: Guilt by association
  • 21. From data to distances Phenotypic Profiles Perturbed genes D [i,j] = dist( M [i,] , M [j,] ) M D D [j,i] = D [i,j] D [i,i] = 0 for all i dist(. , .) What distance measure should we use? Distance or dissimilarity matrix i j i j j i
  • 22. Examples of distances how is this related to correlation? a b Euclidean distance dist( a,b ) = a b dist( a,b ) = Manhattan distance  a b dist( a,b ) = Cosine distance
  • 23. Linkage: distances to clusters dist(C1, C2) = max { dist( i , j ) : i in C1, j in C2 } complete linkage = min { dist( i , j ) : i in C1, j in C2 } single linkage = mean { dist( i , j ) : i in C1, j in C2 } average linkage D Phenotype 2 Phenotype 1 1 2 3 6 5 4 D[3,4] 1 2 3 6 5 4 Phenotype 2 Phenotype 1 D[ (2,3) ,4] = ??? Distances between individual genes 3
  • 24. Hierarchical clustering Ingredients: data matrix , distance function, linkage function Phenotype 2 Phenotype 1 1 2 3 6 5 4 1 2 3 6 5 4 Dendrogram
  • 25. Phenotypic Data 332 knock-downs in 5 conditions Correlation matrix Dendrogram Data by Klaas Mulder, CRI
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32. Network Biology - lecture 2 ≥ 3 Questions !