1. 1
Artificial Intelligence in Systems Biology
Hai Huang, Master Student, IBBME
Huelsenbeck et al. found it very frustrating because it was
Abstract — Systems biology extends the perspective from hidden in a flood of data [2]. To find the clues from the
individual biological components to the system level. This voluminous data demands great experience, knowledge and
development requires advanced modelling skills and data patience. Human error will inevitably have intense adverse
processing techniques. Artificial intelligence could be the influence on the outcome.
solution of this demand. Artificial Intelligence has showed its
power in gene multiple alignment modelling and phylogenetic To understand system dynamic is to find out “How a
likelihood inference. Its active learning algorithm will system behaves over time and under various conditions. [1]”
accelerate the evolution of systems biology. A Biological system is far more complex than a mechanical
Index term — Systems biology, system structure, system system; sometimes the same chemical messenger can carry
dynamic, artificial intelligence, knowledge and reasoning, several signals simultaneously on different time scales [3].
machine learning. This brings a lot of confusion in understanding the roles of
different parallel progresses and feedback mechanisms.
I.INTRODUCTION Based on the knowledge of the structure and dynamic of a
S YSTEMS biology is a system level understanding of
biology [1], which was first introduced about 50 years
ago. Compared to traditional biology, it is still in its infancy.
particular system, control and design methods can be
utilized to control the state and modify the property of the
system [1]. For example, monitoring and controlling of the
But as a emerging science, it recently shows its potential in side effects are major issues in the development of new
dominating the developmental trend of molecular, genomic, drugs, especially gene-protein target drugs [4]. Difficulties
and pharmacological researches. However, further arise here because the target genes produce large amounts of
advancement in systems biology is not free of obstacles. proteins, some functions of which are unknown. Yet to
Technologies from other scientific fields are demanded to control the therapeutic effects and design the drugs, it is
assist the breakthroughs in systems biology. Artificial essential to identify the unknown functions and eliminate the
Intelligence (AI) as one of the assistant tools, has undue functions.
demonstrated its potential in overcoming the difficulties
faced by systems biology. Some concepts of AI have already
been applied in systems biology, while others are beginning III.ARTIFICIAL INTELLIGENCE
to be utilized. Facing all these challenges, systems biology adopts a lot of
The purpose of this paper is to provide an overview of techniques from other fields, such as System Engineering,
systems biology and AI. The application of AI in systems Information Technology, and Control Theory. AI is a
biology is introduced, and the trend of the future relatively new science incorporated in the development of
relationships between AI and systems biology is discussed. systems biology.
AI emerged as a new science category in the 1950s at the
same time when the term “systems biology” was coined. It
II.CHALLENGES IN SYSTEMS BIOLOGY refers to thinking and acting as a human being or at least
The understanding of system-level biology is derived from thinking and acting rationally, rather than just imitating what
the insight into the four key elements: system structure, a human being does [5]. If a system can only mimic a
system dynamic, control method, and design method [1]. person’s actions, it is just a manipulator, but not actual
Progress has been made in each of the above areas since the artificial-intelligence.
emergence of systems biology, but every step of The Turing Test in 1950s was the first landmark of AI. In
advancement was full of frustration. the Turing test, an interrogator was connected to a person or
System structure is not a list of isolated components of a a machine via a terminal, which prevented him/her from
cell or organism; it is more about the relationships between seeing his/her counterpart. His/her task was to find out
these components [1]. However, from a biological view whether the counterpart was a machine by only asking
point, to clearly describe those relationships is very questions [6]. If the machine could “fool” the interrogator,
challenging. For example, the similarity of DNA between this machine system was considered an intelligent entity. The
different species has a profound impact on evolutionary Turing Test demonstrated the possibility that a machine
biology; however, in searching for this similarity, J. P. could act as a human being.
Another well-known milestone of AI was Deep Blue. In
Manuscript received October 21, 2003.
May 1997, IBM's Deep Blue Supercomputer played a match
H. Huang is with the Institute of Bio-material and Bio-medical with the World Chess Champion, Garry Kasparov, and won
Engineering, University of Toronto. Canada (corresponding author to provide the game [7]. It revealed that machines were able to compete
e-mail: hai.huang@utoronto.ca).
with human beings to some degree.
2. 2
Generally speaking, the scope of AI covers all human systems biology today. Examples of its application at two
activities, such as observing the environment, judging different levels of systems biology are discussed in the
successful behaviour, seeking the proper method, and following paragraphs. Some pioneering studies are also being
adjusting knowledge while interacting with the target. It can done on the application of machine learning in systems
be classified into four categories: problem solving, biology. However the application of autonomous planning,
knowledge and reasoning, machine learning, autonomous communicating, perceiving and acting has not yet been seen.
planning, communicating, perceiving and acting [5].
Problem solving is the basis of AI. It presents a
A.Bayesian Inference of phylogeny
topological view. Usually in the AI perspective, a problem
like “How can one thing go from state A to state B?” could The idea that species are related is not new. More than
be solved by searching an existing database based on one century ago, Darwin became one of the pioneers in the
constraints and conditions. This search could be target area of evolutionary biology. These pioneers intended to
reveal a systematic structure from a biological point of
oriented, start-point oriented, or bidirectional.
view. Just like the trend of biology nowadays, biological
Knowledge and reasoning is to understand and identify a
phylogeny is more like a bioinformatics science. A lot of
successful behaviour in a complex environment. It is the key molecular data transform this question of the history of life to
component of AI. Knowledge and reasoning play a crucial a statistical and computational problem. Many different
role in dealing with partially observable environments. Based inferential methods were introduced into phylogenetic
on logic, probability, and the statistics theory, two important analysis, seeking the relationship between different
theories were developed. One is the Bayesian network; and biological classes. Among them, Bayesian inference, an
the other is the Hidden Markov Model, both of which are important AI theory and application, is relatively new in this
dominant in current AI. They will be discussed in detail in field, but it is a powerful tool for addressing a number of
section IV of this paper. long-standing and complex questions in evolutionary
Machine learning enables a system to adjust itself to the biology. Table 1 lists some Bayesian inference application in
environment. Whether supervised or unsupervised, passive or the phylogeny perspective.
active, machine learning is to improve the system’s ability to
act in the future. It is now the most important trend in the Problem Bayesian approach
development of AI. Find tree with maximum posterior probability;
Inferring phylogeny evaluate features in common among the sampled
Autonomous planning, communicating, perceiving and trees
acting are the implementations of the thinking part of AI into Evaluating Evaluate clade probabilities; form credible set
its acting part. They are the applications of problem solving, uncertainty in containing trees whose cumulative probability
phylogenies sums to 0.95
knowledge and reasoning, and machine learning.
Model substitution process on the codon and
The above four aspects enable AI to become a very good calculate probability of being in purifying or
tool to reduce human errors, improve efficiency, save time, Detecting selection positively selected class; sample substitutions and
and derease costs, and thus allows it to be applied in count number of synonymous and
nonsynonymous changes
overcoming the difficulties faced by systems biology.
Comparative Perform analysis on many trees, and weight results
analyses by the probability that each tree is correct
Use fossils as a calibration. Infer divergence times
Divergence times
IV.APPLICATION OF AI IN SYSTEMS BIOLOGY by using a strict or relaxed molecular clock
Testing molecular Calculate Bayes factor for the clock versus no
Systems biology and AI were developed parallel to each clock branch length restrictions
other before the 1980s as two distinct disciplines. However in
the past twenty years, the rapid technological development Table 1 Bayesian approach to problems in phylogeny
has created the opportunities for AI to be applied in systems
biology. The advancement in computer science and
information technology allows AI to have more powerful
computer platforms as its tool. At the same time, new Bayesian inference is to compute the posterior probability
theoretical concepts and approaches in computer science distribution for a set of query variables over a Bayesian
enhance the theoretical development of AI. On the other network, which is able to represent the dependencies among
hand, new technologies such as gene Microarray have been variables and give a concise specification of any full joint
brought into systems biology. These technologies create the probability distribution [5]. As a part of knowledge and
opportunities to digitize the experimental results and improve reasoning in AI categories, this inference is to identify the
the repeatability of tests. They also produce an enormous correct relationship between different elements. The basic
amount of data. Therefore, new methods are highly expression of Bayesian theory is:
demanded to process these data. AI has been acting as a
useful tool in these situations.
Of the four components of AI, problem solving is the basis In phylogeny, this expression is used to combine the prior
of the other three aspects. Thus, its application in systems probability of a phylogeny (Pr[Tree]) with the likelihood
biology is involved in the application of the other three (Pr[Data | Tree]) to produce a posterior probability
aspects of AI. Being the best studied element of AI, distribution on trees (Pr[Tree | Data]). Inferences about the
history of the group are based on the posterior probability of
knowledge and reasoning has a relatively wide application in
3. 3
trees. The tree with the highest posterior probability might be
chosen as the best estimate of phylogeny [2].
Huelsenbeck et al. implemented this approach by a
numerical method MCMC (Markov chain Monte Carlo) of
Bayesian inference. There were two important practical
problems associated with the application of MCMC. One was
the modelling assumption. A poorly fitted assumption would
lead to a wrong inference. Their assumption was the general
time reversible (GTR) model of DNA substitution in the
analyses, which allowed each nucleotide change to have its
own rate and the nucleotide bases to have different
frequencies. It allowed rates to vary across sites either by
assuming the randomness of the rate or by dividing the sites
into several codon positions. Another problem was to
determine how long to run a chain to obtain a good
approximation of the posterior probabilities of trees. In some
cases the MCMC algorithm would fail to converge.
Eventually they identified convergence by a trial-error
method.
Based on a variant of MCMC called Metropolis-coupled
MCMC, Huelsenbeck et al. deisgned a computer program
[2]. They applied this program to four large phylogenetic Figure 1 Convergence of independent Markov Chain
data. The smallest data set included 106 wingless sequences
sampled from insects, and the largest included 357 atpB
sequences sampled from plants. Figure 1 shows the posterior A particular challenge of gene-finding and functional
probability of a clade condition on the observed DNA annotation is how to describe multiple alignments. Multiple
sequences for two chains, each of them starting from alignments show the dynamic property of protein sequence.
different random trees. The posterior probabilities of the Finding multiple alignments can be done in a laboratory with
individual clade found in different chains are highly real “wet” experiments, which are very expensive and time
correlated. There is no obvious correlation found cross the consuming.
clades. This result proved that Bayesian inference could be a
precise method in phylogenetic analysis.
Huelsenbeck and his colleagues pointed out that Bayesian
inference could be used as an important method in the study
of Molecular Evolution, especially in the field of substitution
patterns. Their next step was to construct a large tree /
network for better understanding of the evolution of genome
in the context of phylogeny.
Huelsenbeck’s study implements an important AI theory -
Bayesian inference to find the relationship between different
species. This approach demonstrates that AI is able to
recognize and build the structure of a complex bio-system, Figure 2 An example of a multiple alignment.
such as an evolution tree.
B.Hidden Markov Model (HMM) in Biopolymers To save cost and time, Amitai wanted to find the solutions
from the existed public data (mainly genomic DNA,
Hidden Markov Model is one of the most important messenger RNA, and their corresponding protein sequences),
contributions of the Russian mathematician A.A. Markov. It which were in large amounts. Just in GenBank alone, there
is a very influential modeling method in the AI knowledge were approximately 28,507,990,166 bases in 22,318,883
and reasoning category. HMM is a temporal probabilistic sequence records as of January 2003 [10]. It was impossible
model, where the state of the process is described by a single for a human being to find out the hidden relationships from
discrete random variable; this variable is a possible state in these billions of bases. HMM, as a model of AI, was then
the real world [5]. The structure of HMM allows simple and utilized. There were three reasons for using HMM in
elegant computation of all basic AI logical algorithms. HMM modelling proteins and genes. First of all, HMM had the
is used to search for patterns and to detect phenomena in advantage of precise probabilistic modeling. Second, the
uncharacterized data. It was first used in speech recognition experience gained from the same tools in speech recognition
in the 1970s and 1980s. From the late 90s, some genomic could be utilized [8]. Third, some computer programs were
researchers started to use HMM as an analysis tool. In the well developed to build and apply HMM. Among these
year 2002, M. Amitai et al. tried to use HMM in the study of programs, there were a few focusing on the sequential
gene-finding and functional annotation [8]. analysis of protein, such as HMMer and SAM [12, 13].
Figure 2 is a real example taken from the PDGF (platelet-
4. 4
derived growth factor) family. In position 17, half of it has an V.CONCLUSION
amino acid; which could be proline or arginine in half-half Technologies of AI have been proven to be beneficial to
chance. Another half in position 17 has no amino acid (called the development of systems biology. Problem solving is the
deleted position) [8]. Although the statistical population is basis of AI, and its importance is represented in the
relatively small, based on the knowledge of protein application of all the other aspects of AI in systems biology.
evolution, the new member of the same family behaves Knowledge and reasoning is currently the most widely
similar in the same position [11]. This similarity meets the applied. It helps in the identification of system structure as in
assumption of HMM. Figure 3 is part of HMM a constructed the example of Bayesian Inference of phylogeny. It also
form of the multiple alignments. Here state M16 is shows its value in understanding system dynamics, as
corresponding to position 16. From this state, there is a 50% exemplified by the Hidden Markov Model in biopolymers.
possibility to D17 (deleted position), and a 50% possibility to The machine learning algorithm starts to demonstrate its
M17. M17 is a clustered state with 50% possibility of P power in assisting control and design methods. Also, the
(praline) and 50% possibility of R (arginine). Now the exploration of the application in systems biology of the areas
protein is aligned to the HMM according to the probabilities. such as building a knowledge base, choosing models,
This model identifies the similarity with other proteins, and analyzing data and evaluating results, will be the trend of AI
predicts the multiple alignments for the same family implementation in systems biology in the near future. The
members. It draws a dynamic picture of protein sequence. more complex application of autonomous planning,
communicating, perceiving and acting is likely to happen
after machine learning is well adopted in systems biology.
REFERENCES
[1] H. Kitano, “System’s biology: a brief overview,” Science, 2002, vol.
295, pp. 1662-1664
[2] J.P. Huelsenbeck, F. Ronquest, and R. Nielsen, “Bayesian Inference of
Phylogeny and Its Impact on Evolutionary Biology”, Science, 2002,
vol.294, 2310-2318
[3] N.C. Spitzer and T.J. Sejnowski, “Biological Information Processing:
Bits of Progress”, Science, 2000, vol. 277, pp. 1060-1063
[4] A. Renner and A. Aszodi, “High-throughput Functional Annotation of
Novel Gene Products using Document Clustering”, Pacific Symposium
on Biocomputing 2000,pp. 54 -68
Figure 3 Part of HMM for multiple alignment from figure 1
[5] S. J. Russell and P. Norvig, Artificial Intelligence – A Modern Approach,
Pearson Education, New Jersey, USA, 2003
[6] A.M Turing, “A Quarterly Review of Psychology and Philosophy, ”
In this example HMM, the most popular theory in AI, 1950, Available online: http://www.abelard.org/turpap/turpap.htm
describes the multiple alignments in PDGF. It shows AI’s [7] IBM, “Deep Blue”1997, Available online:
ability in finding and understanding the dynamics of a http://www.research.ibm.com/deepblue/
biological system. [8] M. Amitai, “Hidden Models in Biopolymers, ” Science, 2001, vol. 282,
pp. 1436-1440
[9] C. Yoo and G. F. Cooper, “An Evaluation of a System that Recommends
Microarray Experiments to Perform to Discover Gene-Regulation
C.Preliminary Application of Machine Learning Pathways,” unpublished.
A very important aspect of AI is learning like a human [10] NCBI, “What is GenBank?,” 2003, Availabe online:
being. This technique will be greatly helpful for the http://www.ncbi.nlm.nih.gov/Genbank/
modelling procedure. Usually the modelling and modelling [11] J. Sjolander et al., Comput. Appl. Biosci. 1996, vol. 12, pp 327
assumptions are crucial for the systems biology research. [12] S. Eddy, “HMMer: Profile HMMs for protein sequence analysis”, 2003.
There could be several possible models that can fit in one available on line: http://hmmer.wustl.edu/
topic. How to find the best choice is very difficult in most [13] UCSC, “Sequence Alignment and Modelling System,” 2003, available
online: http://www.cse.ucsc.edu/research/compbio/sam.html
cases. Because of no proper method to detect the validity of a
model, Huelsenbeck and his colleagues had to use the trial-
error method to determine the convergence. In this case, a
self-learning model convergence module could have been put
in their program to improve the efficiency.
Some pioneering studies are in process. C. Yoo and G. F.
Cooper introduced a system named GEEVE, which can
automatically pick the best model to find a causal pathway in
genes [9]. This system will try to recommend the model
based on previous results, and adjust the recommendation by
recent evaluation [9].