2. Modules / communities
Cellular functions are carried out by groups of
biomolecules (e.g., proteins, RNA) acting in a
coordinated fashion.
Problem: how does this structure change under a
different condition?
4. Approaches to module detection
• Many algorithms for detecting modules in a single network
– Link clustering [Shi et al. 2013], label propagation [Gregory 2010],
Tensor decomposition [Anandkumar et al. 2013], mixed-membership
stochastic blockmodels [Airoldi et al. 2008], etc.
• Not obvious how to extend to the multiple network case:
Combine networks,
then detect modules
likely to miss
rare modules
Detect modules,
then combine results
inconsistent
module definition
Multi-MMSB
Jointly learns modules from
all networks, allow each to
be only present in a subset
of networks
8. Learning the model
Goal: optimize model likelihood
Expectation-Maximization algorithm to deal with latent variables
Need variational approximation
Random restarts to alleviate local optima issue
9. Performance metric
• Normalized mutual information (NMI)
Sequence of
structural queries
Learned
community
structure
True
community
structure
Answers
Answers
Calculate
mutual information
[Esquivel and Rosvall, 2012]
18. Summary
• We developed Multi-MMSB, a flexible way of
learning community structure over multiple
networks
• Multi-MMSB outperformed naive methods on
synthetic data
• When applied to real data, Multi-MMSB identified
context-specific modules that are biologically
plausible
19. Future directions
• Extending the model:
– Directed networks
– Weighted edges
• Application to other types of biological networks:
– Regulatory networks
– PPI
For instance, suppose we are observing an individual over the course of an environmental stress, such as a viral infection or a physical injury.
In this case we expect to see some group of genes that are only temporarily co-regulated in a specific situation. For example, genes involved in immune response to injuries would be temporarily co-regulated. And this would lead to something like the blue module, which is only active in one of the networks.
On the other hand, we also expect some housekeeping genes to be always turned on and highly co-regulated. This would lead to something that looks like a red module that is always present in the networks.
By identifying and functionally characterizing such modules with different patterns of occurrences,
one can start to reason about the biological processes that are affected or unaffected by the given context of interest.
With this motivation in mind, the goal of this project was to develop an algorithm
that takes as input multiple networks from different contexts,
and outputs the overall community structure with the associated activity pattern
that tells us in which subset of contexts each module appears in
So how do we go about doing this?
2 mins
First, it is important to know that there are a large number of module detection algorithms that work on a single network
This includes link clustering, label propagation, spectral decomposition, stochastic block models, et cetera.
However, extension of these methods to the multiple network case is not trivial.
One naïve approach one might consider is to combine all the networks into a single representative network
(for example, by taking the average of the adjacency matrices)
and to run existing module detection algorithm on it
Once we have a global set of modules,
Then we can go back to individual networks and check whether each identified module
is active or not
While this approach is fairly simple and easy to implement, this suffers from the limitation that
modules that are only active in a small number of networks
are more difficult to identify in the combined network.
This is because the merging process dilutes the signal in the data.
Another naïve approach, would be to apply module detection algorithms
on each network independently to learn the modules, and then try to combine the outputs by matching modules detected from different networks.
While this approach has no problem identifying modules that are rarely active,
when the detected boundaries of a module differ between networks
it is not clear how to resolve such disagreements in a principled manner.
In this talk, I present a hierarchical Bayesian model named multi-MMSB
that avoids both of these issues.
Our model learns a global community structure jointly from all networks,
while allowing each module to be only present in a subset of networks,
thereby increasing power to detect rare modules.
In the following section, I will describe the details of multi-MMSB. Let’s first start with a simple Bayesian model that forms the basis of our model.
Stochastic blockmodel is a probabilistic, generative model of random graphs that originates from the social network analysis literature.
The basic idea is that when we look at the adjacency matrix of a graph that shows a modular pattern it will have these “blocky” structure, where each block corresponds to a single module.
So the idea is to cluster the nodes such that within each cluster we see a lot of edges and not many edges are between different clusters.
Now we can formalize this model as follows.
First we introduce a parameter p_m that represents the connectivity level for each module m
p_0 represents the background connectivity between nodes of different modules, which can be thought of as the amount of noise in the data
and lastly for each node in the network, we have a latent label z_i that represents which module the node belongs to.
Given these variables, each edge is sampled independently from a Bernoulli distribution with parameter p_m if both nodes belong to module m and p_0 otherwise
A key limitation of stochastic blockmodel is that each node can only be assigned to a single module. However, in many applications, modules often overlap with each other. This is the motivation behind mixed-membership stochastic blockmodels, or MMSB.
In this version of the model, we allow each node to have a fractional membership to modules rather than a hard assignment. This is represented by the vector c_i.
In addition, we introduce a latent label z_ij for every pair of i and j to represent the conditional membership of node I with respect to node j. Intuitively speaking, this allows each node to be multi-faceted – they can change their module membership based on the node that they are interacting with.
In this new setup, an edge is sampled with probability p_m when the conditional memberships on both sides agree with each other.
While this model has been shown to be effective in a variety of settings, by design it only works on a single network.
In order to extend this to the multiple network case we first duplicate the latent variables z_ij across the networks while keeping only a single copy of c_i so that the fractional membership of each node remains identical in every network.
Furthermore, we introduce another layer of latent variables denoted as d_km, which represents context-specific activity of module m in network k.
Now, when we sample the edges using p_m, in addition to checking whether the conditional memberships match, we also check whether the module is active in the given network.
Note that, in practice, we are only given the edges and none of the latent variables.
A standard approach to learning a Bayesian model is to optimize the likelihood function given the observed data.
In this case, since there are variables that are not observed, we want to optimize what’s called a marginal likelihood. Which is the same as the complete likelihood where the latent variables are integrated out.
Expectation-maximization algorithm can be used to optimize this objective. But because the posterior distribution over the latent variables is intractable in this case, we need to use variational EM, which makes a simplifying assumption that the latent variables are independent from each other.
At the end of this training procedure, what we get is the optimal set of model parameters and our belief over the latent variables, from which we can extract the community structure learned by the model.
Because this approach is susceptible to local optima, we typically learn the model several times for a given setting and select the one with the highest objective for further analysis
Once we learn the model, we need a way to measure the accuracy, assuming the ground truth is available, which is the case in simulated data.
To quantify the similarity between two community structures, we use a metric called normalized mutual information which was first developed in the context of network covers by Esquivel and Rosvall in 2012.
I won’t go into too much detail here, but the basic intuition is as follows. First, we randomly generate a sequence of structural queries.
[Give example]
Then we send these queries through both the learned and the true community structures to get two sets of answers. Calculating mutual information between these two answer sets gives us our similarity score.
In the limiting case, if the two structures are exactly the same then the answers we get would be identical in all cases and this leads to an NMI of 1.
Note that this procedure does not require us to know the mapping between the modules between the two structures because mutual information doesn’t change even if we relabel the modules on either side.
Now we’ve established everything about the model. In the following section I will present some results on synthetic data.