1. Associative
methods in
Systems
Biology
Hugh
Shanahan
Associative methods in Systems Biology
Outline
Gene
Ontologies
Hugh Shanahan Over-representation
Semantic similarity
Associative
Measures
Department of Computer Science Hypotheses
Royal Holloway, University of London Linear Correlation
Partial Correlation
Non-linear measures
September 22, 2009 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
2. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
3. Gene Ontologies
Associative
methods in
Systems
Biology
Hugh
Before finding interactions, need to be able Shanahan
to systematically annotate all genes Outline
Gene
to determine which functional groupings are Ontologies
Over-representation
over-represented Semantic similarity
measure objectively the “functional similarity” of two Associative
Measures
genes. Hypotheses
Linear Correlation
Partial Correlation
Gene Ontology (GO) is a means to do this. Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
4. Ontologies
Associative
methods in
Systems
Biology
Abstract method for expressing structured data. Hugh
Shanahan
Annotation of any gene can be expressed in terms of
Outline
incresingly accurate description, e.g.
Gene
This gene is involved in transport. Ontologies
Over-representation
This gene is involved in vesicle mediated Semantic similarity
Associative
transport. Measures
Hypotheses
This gene is involved in vesicle fusion. Linear Correlation
Partial Correlation
Genes may not have an accurate annotation, so Non-linear measures
Validation
definition can stop higher up in this hierarchy. DREAM
Hugh Shanahan Associative methods in Systems Biology
5. More complexity required
Associative
methods in
Systems
Biology
Annotation is not a simple chain.
Hugh
A single gene can have have a very specific annotation, Shanahan
which comes from two (or more) more general Outline
descriptions. Gene
Ontologies
Different types of annotation as well: location, Over-representation
Semantic similarity
biochemistry, part of organism expressed in, and so on. Associative
Measures
An Ontology is a Directed Acyclic Graph (DAG), not a Hypotheses
Linear Correlation
Tree (this means a lot to Graph Theorists). Partial Correlation
Non-linear measures
Each node in the DAG is an annotation term. Validation
DREAM
Each “child” node can more than one “parent” nodes.
Hugh Shanahan Associative methods in Systems Biology
6. GO’s visualised
Associative
IEWS methods in
Systems
Biology
a b c Hugh
Biological
process (root) Shanahan
Outline
Transport Membrane organization
and biogenesis Gene
asing
ficity Ontologies
is_a is_a
or
Over-representation
larity
Vesicle-mediated Semantic similarity
Membrane fusion
transport
Associative
part_of is_a
Measures
Hypotheses
Vesicle fusion
Linear Correlation
Partial Correlation
Figure 1 | Simple trees versus directed acyclic graphs. Boxes represent nodes and arrows represent edges. a | An
Nature Reviews | Genetics
example of a simple tree, in which each child has only one parent and the edges are directed, that is, there is a source Non-linear measures
(parent) and a destination (child) for each edge. b | A directed acyclic graph (DAG), in which each child can have one or
Validation
Rhee et al., Nature Reviews Genetics, (2008)
more parents. The node with multiple parents is coloured red and the additional edge is coloured grey.c | An example of
a node, vesicle fusion, in the biological process ontology with multiple parentage. The dashed edges indicate that there DREAM
are other nodes not shown between the nodes and the root node (biological process). A root is a node with no incoming
edges, and at least one leaf (also called a sink). A leaf node is a node with no outgoing edges, that is, a terminal node with
no children (vesicle fusion). Similar to a simple tree, A DAG has directed edges and does not have cycles, that is, no path
starts and ends at the same node, and will always have at least one root node. The depth of a node is the length of the
longest path from the root to that node, whereas the height is the length of the longest path from that node to a leaf41.
is_a and part_of are types of relationships that link the terms in the GO ontology. More information about the
relationships between GO terms are found online (An Introduction to the Gene Ontology).
Hugh Shanahan Associative methods in Systems Biology
7. GO’s visualised
Associative
methods in
Systems
Biology
Hugh
Shanahan
Outline
Gene
Ontologies
Over-representation
Semantic similarity
Associative
Measures
Hypotheses
Linear Correlation
Partial Correlation
http://amigo.geneontology.org/ Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
8. Different types of Annotation
Associative
methods in
Systems
Biology
Hugh
Typically, there are three distinct ontologies Shanahan
(overwhelmingly used). Outline
Cellular Compartment Gene
Ontologies
Over-representation
Biological Process Semantic similarity
Molecular Function Associative
Measures
Hypotheses
Many other ontologies have been constructed, e.g. Linear Correlation
Partial Correlation
Plant Organ for Arabidopsis. Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
9. Caveat
Associative
methods in
Systems
Biology
The annotation of most genes (90%) have been carried out Hugh
Shanahan
computationally. The annotations usually work pretty well,
though they have a tendency not to be as accurate as those Outline
obtained by direct assay. Gene
Ontologies
All annotated genes have an evidence code (IED) Over-representation
Semantic similarity
associated with them in order to demonstrate how much we Associative
can rely on it. Measures
Hypotheses
Increasingly, co-expression is being used as a means to Linear Correlation
Partial Correlation
annotate genes, so one should be careful in not using this Non-linear measures
information in constructing annotations ! Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
10. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
11. Over-representation
Associative
methods in
Systems
One of the most useful tools to hand when one analyses Biology
micro-array data is to ask what functional groupings occur Hugh
Shanahan
more often than one expects.
Outline
Notation
Gene
N number of genes in the genome. Ontologies
Over-representation
Semantic similarity
n number of genes which have been found to be
Associative
differentially expressed. Measures
Hypotheses
m number of genes in the genome with a specific Linear Correlation
Partial Correlation
annotation. Non-linear measures
Validation
k number of genes which are differentially expressed DREAM
with the same annotation.
Hugh Shanahan Associative methods in Systems Biology
12. Probabilities
Associative
methods in
Systems
One can derive the probability Pk that k genes would be Biology
found by chance amongst n genes using the Hugh
Shanahan
hypergeometric probability distribution and the above
Outline
parameters.
Gene
For the record Ontologies
Over-representation
Semantic similarity
Associative
m C N−m C Measures
k n−k
Pk = NC
, (1) Hypotheses
Linear Correlation
n Partial Correlation
N N! Non-linear measures
Cm = . (2) Validation
(N − n)!n! DREAM
Hugh Shanahan Associative methods in Systems Biology
13. Difficulties
Associative
methods in
Systems
Biology
There are thousand’s of possible GO terms and one Hugh
should adjust the probabilities to deal with multiple Shanahan
hypotheses. Outline
Applying Bonferroni correction using all GO terms gives Gene
Ontologies
a p-value of 10−7 equivalent to 1% significence. Over-representation
Semantic similarity
Because of the structure of the GO terms these Associative
Measures
probabilities are highly correlated with each other. Hypotheses
Linear Correlation
In all these cases focussing on as short a list of Partial Correlation
Non-linear measures
possible biological processes as possible will minimise Validation
the above difficulties. DREAM
Hugh Shanahan Associative methods in Systems Biology
14. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
15. What genes match
In benchmarking methods to infer interactions between Associative
methods in
gene products, we expect genes that interact to have similar Systems
Biology
GO terms, though perhaps not entirely the same. Hugh
Semantic Similarity is a means to measure how similar the Shanahan
annotations of two genes are (0 being no similarity, 1 Outline
meaning total similarity). Gene
Ontologies
GO provides us with a means to do this in a quantitative Over-representation
Semantic similarity
fashion.
Associative
Measures
Hypotheses
Linear Correlation
Partial Correlation
Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
16. Simple implementation
Determine the ratio of the number of nodes two genes share Associative
methods in
with the total number of nodes they have between them. Systems
Biology
Hugh
#{N(G1 ) ∩ N(G2 )} Shanahan
GOsimUI = (3)
#{N(G1 ) ∪ N(G2 )} Outline
N(G1 ) being the set of nodes associated with G1 ’s Gene
Ontologies
annotation. Over-representation
Semantic similarity
Associative
Measures
Hypotheses
Linear Correlation
Partial Correlation
Non-linear measures
Validation
DREAM
More elaborate methods are available.
Hugh Shanahan Associative methods in Systems Biology
17. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
18. Motivation
Associative
methods in
Systems
Yesterday, encountered clustering. Biology
Hugh
Hypothesis 1 (weak) :- coexpression implies involvment Shanahan
in the same process.
Outline
Expand to many different experiments. Gene
Ontologies
Hypothesis 2 (strong) :- greater a level of association, Over-representation
Semantic similarity
greater the chance of interaction.
Associative
Measures
Hypothesis 2 is often referred to as “guilt by Hypotheses
association”. Linear Correlation
Partial Correlation
Non-linear measures
Association may tell us about interactions between
Validation
gene products. It does not tell us anything about the DREAM
regulation mechanism.
Hugh Shanahan Associative methods in Systems Biology
19. Associative
methods in
Systems
Biology
Hugh
Shanahan
Outline
Gene
Ontologies
Over-representation
Semantic similarity
Associative
Measures
Hypotheses
Linear Correlation
http://www.arabidopsis.leeds.ac.uk/act/index.php Partial Correlation
Non-linear measures
266841_at AT2G26150 Validation
heat shock transcription factor family protein contains Pfam profile: DREAM
PF00447 HSF-type DNA-binding domain
260978_at AT1G53540
17.6 kDa class I small heat shock protein
Hugh Shanahan Associative methods in Systems Biology
20. What do we mean by association ?
Associative
methods in
Systems
Knowing something about the expression level of one gene Biology
(over many different experiments) means we know Hugh
Shanahan
something about the expression level of the other.
Replotting the above Outline
Gene
Ontologies
Over-representation
Semantic similarity
Associative
Measures
Hypotheses
Linear Correlation
Partial Correlation
Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
21. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
22. Linear Correlation
coexpression
Associative
methods in
Simplest form of association. Systems
Biology
Assume that there is a linear relationship between Hugh
Shanahan
genes.
Outline
Formally :-
Gene
y1 = a12 + c12 y2 + η12 , (4) Ontologies
Over-representation
Semantic similarity
Associative
y1 , y2 are (log) expression levels Measures
η12 noise term. Hypotheses
Linear Correlation
a12 , c12 parameters to be determined. Partial Correlation
Non-linear measures
But we’re not interested in that ! Validation
DREAM
We are interested in asking how good a model is this
for this pair of genes ?
Hugh Shanahan Associative methods in Systems Biology
23. Covariance
Associative
methods in
Can estimate how good the linear model is by computing Systems
Biology
E((y1 − y 1 )(y2 − y 2 )) , Hugh
Shanahan
where y 1 , y 2 = E(y1 ), E(y2 ) are the means of y1 and y2 . Outline
Gene
E means the expectation value of the above (think of it Ontologies
Over-representation
for now as taking the average over all the points in the Semantic similarity
previous figure). Associative
Measures
Can prove to oneself (exercise) that the magnitude of Hypotheses
Linear Correlation
the covariance is largest when y1 can be perfectly Partial Correlation
Non-linear measures
expressed as a linear function of y2 . Validation
DREAM
The covariance is zero when there is no relationship at
all between y1 and y2 .
Hugh Shanahan Associative methods in Systems Biology
25. Correlation
Associative
methods in
Systems
Biology
Hugh
We want to compare every possible pair of genes, so using Shanahan
the covariance is not very practical since the maximum Outline
covariance will vary from pair of gene to pair of gene. Gene
However, Ontologies
Over-representation
Semantic similarity
E((y1 − y 1 )(y2 − y 2 )) Associative
ρ12 = , (5) Measures
E((y1 − y 1 )2 )E((y2 − y 2 )2 ) Hypotheses
Linear Correlation
Partial Correlation
is bounded: −1 ≤ ρ12 ≤ 1. Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
26. How well does it work ?
Associative
methods in
Systems
Number of examples of improved functional annotation. Biology
Unannotated gene which is highly correlated with gene Hugh
Shanahan
in a known response implies it is likely to be in the
same response. Outline
Gene
Ontologies
Over-representation
Semantic similarity
Associative
Measures
Hypotheses
Linear Correlation
Partial Correlation
Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
27. Outline
Associative
methods in
1 Outline Systems
Biology
2 Gene Ontologies Hugh
Shanahan
Over-representation
Outline
Semantic similarity
Gene
Ontologies
3 Associative Measures Over-representation
Semantic similarity
Hypotheses Associative
Linear Correlation Measures
Hypotheses
Partial Correlation Linear Correlation
Partial Correlation
Non-linear measures Non-linear measures
Validation
DREAM
4 Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
28. Associative
methods in
Systems
Biology
Hugh
Shanahan
Difficulty : genes correlate with many other genes, not
Outline
just a few.
Gene
Why ? Ontologies
Over-representation
Suggestion : correlations do not distinguish between Semantic similarity
Associative
potential direct interactions and indirect interactions Measures
Hypotheses
between gene products. Linear Correlation
Partial Correlation
Non-linear measures
Validation
DREAM
Hugh Shanahan Associative methods in Systems Biology
29. Example
Associative
methods in
Other interactions Systems
A Biology
Hugh
Shanahan
B
F Outline
Gene
Ontologies
D Over-representation
C Semantic similarity
Associative
Measures
E Hypotheses
Linear Correlation
Partial Correlation
Non-linear measures
B directly interacts with three other genes, but could be Validation
highly correlated with others. DREAM
C and D would be highly correlated with each other
even though they are not directly interacting.
Hugh Shanahan Associative methods in Systems Biology