This document summarizes research on inferring the structure of the JAK-STAT gene network using graphical models. It describes simulating data from the JAK-STAT pathway under interferon treatment using the Gillespie algorithm. It then applies three graphical model approaches - a shrinkage covariance matrix method, a lasso-based method, and a graphical lasso with L1 penalized likelihood - to estimate the network structure from the simulated data. The researchers find the lasso approaches estimate the network structure well with appropriate penalty parameter tuning.
Hierarchy of management that covers different levels of management
Inference of the JAK-STAT Gene Network via Graphical Models
1. 6th International Summer School
National University of Technology of the Ukraine
Kiev, Ukraine, August 8-20, 2011
Kiev Summer School: Appendix
Inference of the JAK-STAT Gene
Network via Graphical Models
Vilda Purutçuoğlu1 Tuğba Erdem2 Gerhard Wilhelm Weber3
1,2Department of Statistics, Middle East Technical University, Ankara, Turkey
3Institute of Applied Mathematics, Middle East Technical University, Ankara, Turkey
1vpurutcu@metu.edu.tr, 2terdem@metu.edu.tr, 3gweber@metu.edu.tr
2. Outline
• Introduction
• JAK-STAT Pathway under IFN Treatment
• Simulation of the data via Gillespie Algorithm
• Graphical models:
1. Graphical model from shrinkage covariance matrix
2. Lasso-based graphical model
3. Graphical lasso with L1 penalized likelihood
• Application and Conclusion
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 2
3. INTRODUCTION
• A biological network defines the elements and interactions of biologically linked components
in a cellular metabolism.
• In the graph theory, the networks are represented by nodes which denote genes, proteins or
species and the edges, i.e., interactions or links, between the nodes. The graphical models
define such structures under the conditional independency concept (Whittaker, 1990).
• To estimate the links in a network, several methods are proposed such as Boolean
approaches, differential equations, and stochastic modelling (Bower and Bolouri, 2001).
• Among them, the graphical models can be proposed as an alternative model where the
interactions between the nodes can be estimated and the network itself can be inferred via
both static and dynamic framework.
• In this study, we estimate the JAK-STAT biological system with realistic complexity via three
major approaches of graphical models:
– the shrinkage covariance method (Schafer and Strimmer, 2005)
– the lasso-based graphical model (Meinshausen and Bühlmann, 2006)
– the lasso with -penalized regression method (Friedman et al., 2008)
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 3
4. JAK-STAT PATHWAY UNDER IFN TREATMENT
• The JAK-STAT (Janus kinase/signal transducer and activator transcription) pathway is one of
the major signalling transaction systems which is activated by Type I interferons and regulates
cytokine-dependent gene expression and growth factors of mammals.
• In this study we consider the description of the system under the IFN treatment which is
developed againt the hepatitis C virus (HCV).
• Maiwald et al. (2010) represents this system
under IFN via 40 nodes and 66 reactions in which
the stochastic reaction rate constants are listed
by combining different data sources about this
pathway.
• In Figure 1, the simple representation of the JAK-
STAT system under IFN treatment described in
Maiwald et al. (2010) is drawn via simone R
package.
Figure.1: Simple representation of the IFN-mediated
JAK-STAT pathway
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 4
5. SIMULATION OF THE DATA VIA THE GILLESPIE
ALGORITHM
• In inference of the system via the graphical models, we use a time-course dataset which is
generated via the Gillespie algorithm, also known as the Direct method (Gillespie, 1977).
• This algorithm is the most common and usually the most efficient simulator based on the
chemical master equation which describes the stochastic behaviour of a system.
– Procedure: in each iteration the Gillespie generates a random value from the
exponential distribution with rate as the summation of total hazards in the system h0(Y),
t Exp(h0(Y)) to specify the time of the next reaction. (Y is the states explaining the
number of molecules for each species, t is the change in time, and h is the hazard, i.e.
the product of the number of distinct molecular reactant combinations available in the
state Y for each reaction with associated reaction rate constant).
– Once the next reaction time is determined, the algorithm chooses the reaction type
randomly during with probability hj(Y) / h0(Y) in which hj(Y) is the hazard of the jth
reaction.
– The system is updated according to the time to the next event and the event type.
• For the JAK-STAT system, we run this algorithm until the total time unit =100 while initializing
the number of molecules and the stochastic reaction rate constants as stated in Maiwald et
al. (2010). Then we take the values at the interger time unit from t = 90,...,99. A
measurement dataset for 10 time points for 40 nodes is constructed.
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 5
6. GRAPHICAL MODELS
1. Graphical Model From Shrinkage Covariance Matrix
• In estimation of precision matrix which is basically obtained from the covariance matrix of
the nodes in a network, the shrinkage of the covariance matrix improves the inference for a
sparse network. In this method the shrinked estimate of the covariance matrix is obtained by
S*= T + (1- )S, where S is the unbiased estimate of Σ. T= diag(s11, ..., spp) represents a low
dimensional target, and refers to the shrinkage parameter estimated by minimizing the mean
squared error loss function (Schafer and Strimmer, 2005).
• If is high, the shrinked S becomes less dimensional, but has higher variance. Whereas, if
is low, S becomes higher dimensional with lower variance. Therefore the objective is to find
the optimal value for , which is achieved by the minimizing the associated loss function.
• In the application of the graphical model in the JAK-STAT pathway, we observe that the
strengths of the interactions via the shrinkage estimates mostly validate the current literature
under = 0.56. For instance the estimated strength between IFN_influx and IFN_free which
gives relatively higher correlations, as = 0.43, is checked from the biological knowledge and
it is found that it possesses truely high interaction within each other.
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 6
7. GRAPHICAL MODELS
2. Lasso-based Graphical Model
• In inference of the strength of the interaction when the absent links are already known, we
can implement the lasso-based regression model which regresses each node on the
remaining ones via Y(p)= Y(-p)β+ε, where p is the last node and –p represents the remaining
nodes, β is the regression coefficients and ε is the error from normal distribution with zero
mean. In the estimation, β is found from the L1 penalty on β under the penalty term .
• This approach also enables us to infer the whole structure with existence of nodes and links
in sparse networks. it is computationally efficient and provides good approximation to the
distribution of variables, whereas, can produce non-symmetric covariance matrix
(Meinshausen and Bühlmann, 2006; Wit et al., 2010).
• For implementation of the lasso-based approach, we
control the number of correct estimated links for each and
we observe that the optimal solution for both the strength and
the estimated network structure is analysed under = 0.0001
which enforces the sparsity in the network.
Figure 2: Estimated system via the L1
penalized lasso regression under =0.1.
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 7
8. GRAPHICAL MODELS
3. Graphical Lasso With L1 Penalized Likelihood Approach
•Different from the previous lasso-method, this approach penalizes the entries of precision
matrix, rather than the regression parameters, by ensuring symmetric and invertible covariance
matrix in the regression model. The estimation is conducted via maxθ (log |θ|- trace (S θ)- || θ||1)
optimization, where θ is the precision matrix. In order to find an optimal value for , ROC-type
curves can be performed for the comparison of sensitivity and specificity values. Hereby the
which maximizes the sensitivity is chosen as the optimal penalty parameter (Wit et al, 2010).
• In the application of the L1 -penalized lasso regression,
we compute the true positive rate versus false positive
rate as shown in Table 1. From the results, it is seen that
the optimal is calculated for = 0.1 penalty parameter.
True positive rate False positive rate
0.1 0.3925 0.2076
0.5 0.3738 0.2016
0.7 0.3551 0.1956
0.75 0.3551 0.1969
0.8 0.3551 0.1956
0.9 0.3551 0.1942 Figure 3: Estimated system via the L1 penalized
0.95 0.3551 0.1929 lasso regression under =0.1.
Table 1: The true positive and false positive rate for -penalized lasso regression.
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 8
9. CONCLUSION and FUTURE WORK
We see that the graphical model is promising for the inference under sparse and high
dimensional network. Whereas the performance of the estimates is highly fluctuated with
respect to the chosen penalty parameter . Therefore we believe that the final network
structure can be inferred under different criteria including the model selection criteria such
as AIC and BIC as proposed in the current study of Wit et al. (2010).
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 9
10. REFERENCES
• Bower, J. and H. Bolouri (2001); Computational Modelling of Genetic and Biochemical
Networks; MIT, 2nd Edition
• Friedman, J., Hastie, T. and R. Tibshirani (2008); Sparse Inverse Covariance Estimation with
the Graphical Lasso, Biostatistics; Vol.9, No.3 (pp. 432-441)
• Gillespie, D.T. (1977); Exact Stochastic Simulation of Coupled Chemical Reactions; Journal of
Physical Chemistry; Vol. 81, No. 25 (pp. 2340-2361)
• Maiwald, T., Schneider, A., Busch, H., Sahle, S., Gretz, N., Weiss, T., Kummer, U. and U.
Klingmuller (2010); Combining Theoretical Analysis and Experimental Data Generation
Reveals IRF9 as a Crucial Factor for Accelerating Interferon Induced Early Antiviral Signalling;
FEBS Journal 277 (pp. 4741-4754)
• Meinshausen, N. and P. Bühlmann (2006); High Dimensional Graphs and Variable Selection
with the Lasso; Annals of Statistics; Vol. 34, No. 3 (pp. 1436-1462)
• Schafer, J. and K. Strimmer (2005); A Shrinkage Approach to Large-Scale Covariance Matrix
Estimation and Implications for Functional Genomics; SAGEM, Vol. 4, No. 1
• Whittaker, J. (1990); Graphical Models in Applied Multivariate Statistics; John Wiley and Sons
• Wit, E., Vinciotti, V. and V. Purutçuoğlu (2010); Statistics for Biological Networks; Short Course
Notes: 25th International Biometric Conference (IBC); Florianopolis, Brazil (pp. 1-197)
InterSymp 2011 - Inference of the JAK-STAT Gene Network via Graphical Models 10