8. Hyper-geometric distribution N genes altogether n hits k hits in GO term m genes in GO term Probability to observe k hits in GO term Number of possibilities to choose n hits out of N genes k hits fall into the GO term of size m The other n-k hits are all genes outside the GO term
9. Hyper-geometric test: example pvalue = phyper ( k , m , N-m , n , lower.tail = FALSE ) Hold these fixed : N = 10 000 m = 200 See what happens if we vary: k = 1,2,…, 30 n = 50, 100, 200, 300, 400 k or more! N n k m Number k of hits in GO term p-value [log10] 50 100 200 300 400
10. Gene Set Enrichment Analysis Not restricted to GO , could be any collection of gene sets, e.g. MSigDB at http://www.broad.mit.edu/gsea/ Subramanian et al. (2005) Weak Strong Phenotype GO Weak Strong Phenotype No significant trend Significant trend
11.
12. GSEA: construction Weak Strong Phenotype Nr. non- hits before i Nr. all non- hits i Nr. hits before i Nr. all hits if p = 0
18. Predicting phenotypes Guilt by association Use known phenotypes in the network Use edge weights if possible Success depends on quality and coverage of linkage in the network ? ? ?
19. Which networks? Networks from large-scale experiments Networks from analyzing the experimental literature Networks from probabilistic data integration
20. Cluster first, think later Perturbed Genes Phenotypic profiles Features = Expression profiles, parameters of cell shape, protein concentration or localization Genes with similar phenotypic profiles often have similar molecular functions or act in the same pathways. Principle for function prediction: Guilt by association
21. From data to distances Phenotypic Profiles Perturbed genes D [i,j] = dist( M [i,] , M [j,] ) M D D [j,i] = D [i,j] D [i,i] = 0 for all i dist(. , .) What distance measure should we use? Distance or dissimilarity matrix i j i j j i
22. Examples of distances how is this related to correlation? a b Euclidean distance dist( a,b ) = a b dist( a,b ) = Manhattan distance a b dist( a,b ) = Cosine distance
23. Linkage: distances to clusters dist(C1, C2) = max { dist( i , j ) : i in C1, j in C2 } complete linkage = min { dist( i , j ) : i in C1, j in C2 } single linkage = mean { dist( i , j ) : i in C1, j in C2 } average linkage D Phenotype 2 Phenotype 1 1 2 3 6 5 4 D[3,4] 1 2 3 6 5 4 Phenotype 2 Phenotype 1 D[ (2,3) ,4] = ??? Distances between individual genes 3