Ieee projects 2012 2013 - Bio Informatics

Elysium Technologies Private Limited
Approved by ISO 9001:2008 and AICTE for SKP Training
Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com

IEEE FINAL YEAR PROJECTS 2012 – 2013
BIO- INFORMATICS
Corporate Office: Madurai
227-230, Church road, Anna nagar, Madurai – 625 020.
0452 – 4390702, 4392702, +9199447933980
Email: info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
Website: www.elysiumtechnologies.com

Branch Office: Trichy
15, III Floor, SI Towers, Melapudur main road, Trichy – 620 001.
0431 – 4002234, +919790464324.
Email: trichy@elysiumtechnologies.com, elysium.trichy@gmail.com.

Branch Office: Coimbatore
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641 002.
+919677751577
Website: Elysiumtechnologies.com, Email: info@elysiumtechnologies.com

Branch Office: Kollam
Surya Complex, Vendor junction, Kollam – 691 010, Kerala.
0474 – 2723622, +919446505482.
Email: kerala@elysiumtechnologies.com.

Branch Office: Cochin
4th Floor, Anjali Complex, near south over bridge, Valanjampalam,
Cochin – 682 016, Kerala.
0484 – 6006002, +917736004002.
Email: kerala@elysiumtechnologies.com, Website: www.elysiumtechnologies.com

IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects


BIO - INFORMATICS 2012 - 2013
EGC
A Biologically Inspired Validity Measure for Comparison of Clustering Methods over
8201
Metabolic Data Sets

In the biological domain, clustering is based on the assumption that genes or metabolites involved in a common
biological process are coexpressed/coaccumulated under the control of the same regulatory network. Thus, a detailed
inspection of the grouped patterns to verify their memberships to well-known metabolic pathways could be very useful
for the evaluation of clusters from a biological perspective. The aim of this work is to propose a novel approach for the
comparison of clustering methods over metabolic data sets, including prior biological knowledge about the relation
among elements that constitute the clusters. A way of measuring the biological significance of clustering solutions is
proposed. This is addressed from the perspective of the usefulness of the clusters to identify those patterns that change
in coordination and belong to common pathways of metabolic regulation. The measure summarizes in a compact way
the objective analysis of clustering methods, which respects coherence and clusters distribution. It also evaluates the
biological internal connections of such clusters considering common pathways. The proposed measure was tested in
two biological databases using three clustering methods.

EGC A Co-Clustering Approach for Mining Large Protein-Protein Interaction Networks
8202

Several approaches have been presented in the literature to cluster Protein-Protein Interaction (PPI) networks. They can
be grouped in two main categories: those allowing a protein to participate in different clusters and those generating only
nonoverlapping clusters. In both cases, a challenging task is to find a suitable compromise between the biological
relevance of the results and a comprehensive coverage of the analyzed networks. Indeed, methods returning high
accurate results are often able to cover only small parts of the input PPI network, especially when low-characterized
networks are considered. We present a coclustering-based technique able to generate both overlapping and
nonoverlapping clusters. The density of the clusters to search for can also be set by the user. We tested our method on
the two networks of yeast and human, and compared it to other five well-known techniques on the same interaction data
sets. The results showed that, for all the examples considered, our approach always reaches a good compromise
between accuracy and network coverage. Furthermore, the behavior of our algorithm is not influenced by the structure
of the input network, different from all the techniques considered in the comparison, which returned very good results
on the yeast network, while on the human network their outcomes are rather poor.

EGC
A Comparative Study on Filtering Protein Secondary Structure Prediction
8203



Filtering of Protein Secondary Structure Prediction (PSSP) aims to provide physicochemically realistic results, while it
usually improves the predictive performance. We performed a comparative study on this challenging problem, utilizing
both machine learning techniques and empirical rules and we found that combinations of the two lead to the highest
improvement.

EGC A Computational Model for Predicting Protein Interactions Based on Multidomain
8204
Collaboration
Recently, several domain-based computational models for predicting protein-protein interactions (PPIs) have been
proposed. The conventional methods usually infer domain or domain combination (DC) interactions from already known
interacting sets of proteins, and then predict PPIs using the information. However, the majority of these models often
have limitations in providing detailed information on which domain pair (single domain interaction) or DC pair
(multidomain interaction) will actually interact for the predicted protein interaction. Therefore, a more comprehensive
and concrete computational model for the prediction of PPIs is needed. We developed a computational model to predict
PPIs using the information of intraprotein domain cohesion and interprotein DC coupling interaction. A method of
identifying the primary interacting DC pair was also incorporated into the model in order to infer actual participants in a
predicted interaction. Our method made an apparent improvement in the PPI prediction accuracy, and the primary
interacting DC pair identification was valid specifically in predicting multidomain protein interactions. In this paper, we
demonstrate that 1) the intraprotein domain cohesion is meaningful in improving the accuracy of domain-based PPI
prediction, 2) a prediction model incorporating the intradomain cohesion enables us to identify the primary interacting
DC pair, and 3) a hybrid approach using the intra/interdomain interaction information can lead to a more accurate
prediction.

EGC
8205 A Framework for Incorporating Functional Interrelationships into Protein Function
Prediction Algorithms

The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many
computational approaches have been developed in recent years to predict protein function, most of these traditional
algorithms do not take interrelationships among functional terms into account, such as different GO terms usually
coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of
Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term
similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and
Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function
prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large
size ones when considering functional interrelationships. We also compare our similarity measure with other two
widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed
measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing



algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our
method is robust to annotations in the database which are not complete at present. These results give new insights
about the importance of functional interrelationships in protein function prediction.

EGC
A Hybrid Approach to Survival Model Building Using Integration of Clinical and Molecular
8206
Information in Censored Data

In medical society, the prognostic models, which use clinicopathologic features and predict prognosis after a certain
treatment, have been externally validated and used in practice. In recent years, most research has focused on high
dimensional genomic data and small sample sizes. Since clinically similar but molecularly heterogeneous tumors may
produce different clinical outcomes, the combination of clinical and genomic information, which may be complementary,
is crucial to improve the quality of prognostic predictions. However, there is a lack of an integrating scheme for clinic-
genomic models due to the {rm P}gg{rm N} problem, in particular, for a parsimonious model. We propose a methodology
to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which
uses several dimension reduction techniques, {rm L}_{2} penalized maximum likelihood estimation (PMLE), and
resampling methods to tackle the problem. The predictive accuracy of the modeling approach is assessed by several
metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies on a
metastasis and death event, we show that the proposed methodology can improve prediction accuracy and build a final
model with a hybrid signature that is parsimonious when integrating both types of variables.

EGC A Hybrid EKF and Switching PSO Algorithm for Joint State and Parameter Estimation of
8207 Lateral Flow Immunoassay Models
In this paper, a hybrid extended Kalman filter (EKF) and switching particle swarm optimization (SPSO) algorithm is
proposed for jointly estimating both the parameters and states of the lateral flow immunoassay model through available
short time-series measurement. Our proposed method generalizes the well-known EKF algorithm by imposing physical
constraints on the system states. Note that the state constraints are encountered very often in practice that give rise to
considerable difficulties in system analysis and design. The main purpose of this paper is to handle the dynamic
modeling problem with state constraints by combining the extended Kalman filtering and constrained optimization
algorithms via the maximization probability method. More specifically, a recently developed SPSO algorithm is used to
cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding
a penalty term to the objective function. The proposed algorithm is then employed to simultaneously identify the
parameters and states of a lateral flow immunoassay model. It is shown that the proposed algorithm gives much
improved performance over the traditional EKF method.

EGC A Memory Efficient Method for Structure-Based RNA Multiple Alignment
8208



Structure-based RNA multiple alignment is particularly challenging because covarying mutations make sequence
information alone insufficient. Existing tools for RNA multiple alignment first generate pairwise RNA structure
alignments and then build the multiple alignment using only sequence information. Here we present PMFastR, an
algorithm which iteratively uses a sequence-structure alignment procedure to build a structure-based RNA multiple
alignment from one sequence with known structure and a database of sequences from the same family. PMFastR also
has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. The algorithm
also provides a method to utilize a multicore environment. We present results on benchmark data sets from BRAliBase,
which shows PMFastR performs comparably to other state-of-the-art programs. Finally, we regenerate 607 Rfam seed
alignments and show that our automated process creates multiple alignments similar to the manually curated Rfam seed
alignments. Thus, the techniques presented in this paper allow for the generation of multiple alignments using
sequence-structure guidance, while limiting memory consumption. As a result, multiple alignments of long RNA
sequences, such as 16S and 23S rRNAs, can easily be generated locally on a personal computer. The software and
supplementary data are available at http://genome.ucf.edu/PMFastR.

EGC
8209 A Metric for Phylogenetic Trees Based on Matching

Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of
such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have
been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be
shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance
is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small
changes—reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we
introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure
induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing
that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems
with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the
quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds
distance.

EGC
8210
A New Efficient Algorithm for the Gene-Team Problem on General Sequences

Identifying conserved gene clusters is an important step toward understanding the evolution of genomes and predicting
the functions of genes. A famous model to capture the essential biological features of a conserved gene cluster is called
the gene-team model. The problem of finding the gene teams of two general sequences is the focus of this paper. For
this problem, He and Goldwasser had an efficient algorithm that requires O(mn) time using O(m + n) working space,
where m and n are, respectively, the numbers of genes in the two given sequences. In this paper, a new efficient



algorithm is presented. Assume m ≤ n. Let C = ΣαϵΣ o1(α)o2(α), where Σ is the set of distinct genes, and o1(α) and o2(a)
are, respectively, the numbers of copies of a in the two given sequences. Our new algorithm requires O(min{C lg n,mn})
time using O(m + n) working space. As compared with He and Goldwasser's algorithm, our new algorithm is more
practical, as C is likely to be much smaller than mn in practice. In addition, our new algorithm is output sensitive. Its
running time is O(lg n) times the size of the output. Moreover, our new algorithm can be efficiently extended to find the
gene teams of k general sequences in O(k C Ig (n1n2...nk)) time, where ni is the number of genes in the ith input
sequence.

EGC
A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
8211

Today's genome analysis applications require sequence representations allowing for fast access to their contents while
also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence
representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly
reusable or programming language- specific implementations. We present a novel, space-efficient data structure
(GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character
transformations, wildcard support, and an assortment of internal representations optimized for different distributions of
wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our
representation requires only 2 + 8 · 10-6 bits per character. Implemented in C, our portable software implementation
provides a variety of methods for random and sequential access to characters and substrings (including different
reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence
descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq
is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show
that it is competitive with respect to space and time requirements

EGC A New Unsupervised Feature Ranking Method for Gene Expression Data Based on
8212
Consensus Affinity

Feature selection is widely established as one of the fundamental computational techniques in mining microarray data.
Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but
correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering
solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection
proposed to use these consensus solutions as oracles. However, these methods are dependent on both the particular
cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when
the true cluster number is not available, which is common in practice. In view of the above problems, a new
unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus
affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair
of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know
the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works.



Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when
compared to several state-of-the-art techniques.

EGC A Sparse Regulatory Network of Copy-Number Driven Gene Expression Reveals
8213
Putative Breast Cancer Oncogenes
The influence of DNA cis-regulatory elements on a gene's expression has been intensively studied. However, little is
known about expressions driven by trans-acting DNA hotspots. DNA hotspots harboring copy number aberrations are
recognized to be important in cancer as they influence multiple genes on a global scale. The challenge in detecting
trans-effects is mainly due to the computational difficulty in detecting weak and sparse trans-acting signals amidst co-
occuring passenger events. We propose an integrative approach to learn a sparse interaction network of DNA copy-
number regions with their downstream targets in a breast cancer dataset. Information from this network helps
distinguish copy-number driven from copy-number independent expression changes on a global scale. Our result
further delineates cis- and trans-effects in a breast cancer dataset, for which important oncogenes such as ESR1 and
ERBB2 appear to be highly copy-number dependent. Further, our model is shown to be efficient and in terms of
goodness of fit no worse than other state-of the art predictors and network reconstruction models using both simulated
and real data.

EGC A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray
8214
Analysis
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to

A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data
of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various
application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general
accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of
methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter
feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also
known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them
in a unified framework, using standardized notations in order to reveal their technical details and to highlight their
common characteristics as well as their particularities.

EGC
A Swarm Intelligence Framework for Reconstructing Gene Networks: Searching for
8215
Biologically Plausible Architectures

In this paper, we investigate the problem of reverse engineering the topology of gene regulatory networks from temporal
gene expression data. We adopt a computational intelligence approach comprising swarm intelligence techniques,
namely particle swarm optimization (PSO) and ant colony optimization (ACO). In addition, the recurrent neural network



(RNN) formalism is employed for modeling the dynamical behavior of gene regulatory systems. More specifically, ACO is
used for searching the discrete space of network architectures and PSO for searching the corresponding continuous
space of RNN model parameters. We propose a novel solution construction process in the context of ACO for
generating biologically plausible candidate architectures. The objective is to concentrate the search effort into areas of
the structure space that contain architectures which are feasible in terms of their topological resemblance to real-world
networks. The proposed framework is initially applied to the reconstruction of a small artificial network that has
previously been studied in the context of gene network reverse engineering. Subsequently, we consider an artificial data
set with added noise for reconstructing a subnetwork of the genetic interaction network of S. cerevisiae (yeast). Finally,
the framework is applied to a real-world data set for reverse engineering the SOS response system of the bacterium
Escherichia coli. Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding
biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast
space of network architectures.

EGC
A Top-r Feature Selection Algorithm for Microarray Gene Expression Data
8216

Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could
perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection.
Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample
classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly
of size h), then selects informative smaller subsets of genes (of size r <; h) from a subset and merges the chosen genes
with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into
one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene
expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the
relevance of the selected genes in terms of their biological functions.

EGC
Algorithms for Reticulate Networks of Multiple Phylogenetic Trees
8217

A reticulate network N of multiple phylogenetic trees may have nodes with two or more parents (called reticulation
nodes). There are two ways to define the reticulation number of N. One way is to define it as the number of reticulation
nodes in N in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate
network of the trees. The better way is to define it as the total number of parents of reticulation nodes in N minus the
number of reticulation nodes in N ; in this case, a reticulate network with the smallest reticulation number is called an
optimal type-II reticulate network of the trees. In this paper, we first present a fast fixed-parameter algorithm for
constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm
together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal
type-II reticulate network of the input trees. To our knowledge, these are the first fixed-parameter algorithms for the



problems. We have implemented the algorithms in ANSI C, obtaining programs CMPT and MaafB. Our experimental data
show that CMPT can construct optimal type-I reticulate networks rapidly and MaafB can compute better lower bounds
for optimal type-II reticulate networks within shorter time than the previously best program PIRN designed by Wu.

EGC Algorithms to Detect Multi-protein Modularity Conserved during Evolution
8218

Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that
is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a
measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein
modularity conserved during evolution that improves on the running time of previous algorithms for related problems
and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation
measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all
measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by
previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads
to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through
randomization experiments, we demonstrate that conserved modularity is a defining characteristic of interactomes.
Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila
melanogaster, combining results across algorithms, show that nearly 10 percent of current interactome proteins
participate in multiprotein modules with good evidence in the protein interaction data of being conserved between
human and Drosophila.

EGC An Efficient Algorithm for Haplotype Inferenceon Pedigrees with Recombinations and
8219 Mutations

Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies. Pedigrees
allow to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the
set of possible solutions. In this work, we define a new HI problem on pedigrees, called Minimum-Change Haplotype
Configuration (MCHC) problem, that allows two types of genetic variation events: recombinations and mutations. Our
new formulation extends the Minimum-Recombinant Haplotype Configuration (MRHC) problem, that has been proposed
in the literature to overcome the limitations of classic statistical haplotyping methods. Our contribution is twofold. First,
we prove that the MCHC problem is APX-hard under several restrictions. Second, we propose an efficient and accurate
heuristic algorithm for MCHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used
to solve the original MRHC problem and can take advantage of additional knowledge about the input genotypes.
Moreover, the L-reduction proves for the first time that MCHC and MRHC are O(nm/log nm)-approximable on general
pedigrees, where n is the pedigree size and m is the genotype length. Finally, we present an extensive experimental
evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees.



EGC An Efficient Method for Exploring the Space of Gene Tree/Species Tree Reconciliations in a
8220 Probabilistic Framework

Background. Inferring an evolutionary scenario for a gene family is a fundamental problem with applications both in
functional and evolutionary genomics. The gene tree/species tree reconciliation approach has been widely used to
address this problem, but mostly in a discrete parsimony framework that aims at minimizing the number of gene
duplications and/or gene losses. Recently, a probabilistic approach has been developed, based on the classical birth-
and-death process, including efficient algorithms for computing posterior probabilities of reconciliations and orthology
prediction. Results. In previous work, we described an algorithm for exploring the whole space of gene tree/species tree
reconciliations, that we adapt here to compute efficiently the posterior probability of such reconciliations. These
posterior probabilities can be either computed exactly or approximated, depending on the reconciliation space size. We
use this algorithm to analyze the probabilistic landscape of the space of reconciliations for a real data set of fungal gene
families and several data sets of synthetic gene trees. Conclusion. The results of our simulations suggest that, with
exact gene trees obtained by a simple birth-and-death process and realistic gene duplication/loss rates, a very small
subset of all reconciliations needs to be explored in order to approximate very closely the posterior probability of the
most likely reconciliations. For cases where the posterior probability mass is more evenly dispersed, our method allows
to explore efficiently the required subspace of reconciliations.

EGC
8221
An Efficient Method for Modeling Kinetic Behavior of Channel Proteins in
Cardiomyocytes

Characterization of the kinetic and conformational properties of channel proteins is a crucial element in the integrative
study of congenital cardiac diseases. The proteins of the ion channels of cardiomyocytes represent an important family
of biological components determining the physiology of the heart. Some computational studies aiming to understand
the mechanisms of the ion channels of cardiomyocytes have concentrated on Markovian stochastic approaches.
Mathematically, these approaches employ Chapman-Kolmogorov equations coupled with partial differential equations.
As the scale and complexity of such subcellular and cellular models increases, the balance between efficiency and
accuracy of algorithms becomes critical. We have developed a novel two-stage splitting algorithm to address efficiency
and accuracy issues arising in such modeling and simulation scenarios. Numerical experiments were performed based
on the incorporation of our newly developed conformational kinetic model for the rapid delayed rectifier potassium
channel into the dynamic models of human ventricular myocytes. Our results show that the new algorithm significantly
outperforms commonly adopted adaptive Runge-Kutta methods. Furthermore, our parallel simulations with coupled
algorithms for multicellular cardiac tissue demonstrate a high linearity in the speedup of large-scale cardiac simulations.

EGC Cluster-Oriented Ensemble Classifier: Impact of Multicluster Characterization on
8222
Ensemble Classifier Learning



All clustering methods have to assume some cluster relationship among the data objects that they are applied on.
Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel
multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional
dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the
latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects
being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical
analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are
proposed based on this new measure. We compare them with several well-known clustering algorithms that use other
popular similarity measures on various document collections to verify the advantages of our proposal.

EGC An Information Theoretic Approach to Constructing Robust Boolean Gene Regulatory
8223
Networks
We introduce a class of finite systems models of gene regulatory networks exhibiting behavior of the cell cycle. The
network is an extension of a Boolean network model. The system spontaneously cycles through a finite set of internal
states, tracking the increase of an external factor such as cell mass, and also exhibits checkpoints in which errors in
gene expression levels due to cellular noise are automatically corrected. We present a 7-gene network based on
Projective Geometry codes, which can correct, at every given time, one gene expression error. The topology of a
network is highly symmetric and requires using only simple Boolean functions that can be synthesized using genes of
various organisms. The attractor structure of the Boolean network contains a single cycle attractor. It is the smallest
nontrivial network with such high robustness. The methodology allows construction of artificial gene regulatory
networks with the number of phases larger than in natural cell cycle.

EGC
8224
Antilope—A Lagrangian Relaxation Approach to the de novo Peptide Sequencing
Problem

Peptide sequencing from mass spectrometry data is a key step in proteome research. Especially de novo sequencing,
the identification of a peptide from its spectrum alone, is still a challenge even for state-of-the-art algorithmic
approaches. In this paper, we present antilope, a new fast and flexible approach based on mathematical programming. It
builds on the spectrum graph model and works with a variety of scoring schemes. ANTILOPE combines Lagrangian
relaxation for solving an integer linear programming formulation with an adaptation of Yen's k shortest paths algorithm.
It shows a significant improvement in running time compared to mixed integer optimization and performs at the same
speed like other state-of-the-art tools. We also implemented a generic probabilistic scoring scheme that can be trained
automatically for a data set of annotated spectra and is independent of the mass spectrometer type. Evaluations on
benchmark data show that antilope is competitive to the popular state-of-the-art programs PepNovo and NovoHMM both
in terms of runtime and accuracy. Furthermore, it offers increased flexibility in the number of considered ion types.
ANTILOPE will be freely available as part of the open source proteomics library OpenMS.

EGC
8225
Assortative Mixing in Directed Biological Networks



We analyze assortative mixing patterns of biological networks which are typically directed. We develop a theoretical
background for analyzing mixing patterns in directed networks before applying them to specific biological networks.
Two new quantities are introduced, namely the in-assortativity and the out-assortativity, which are shown to be useful in
quantifying assortative mixing in directed networks. We also introduce the local (node level) assortativity quantities for
in- and out-assortativity. Local assortativity profiles are the distributions of these local quantities over node degrees and
can be used to analyze both canonical and real-world directed biological networks. Many biological networks, which
have been previously classified as disassortative, are shown to be assortative with respect to these new measures.
Finally, we demonstrate the use of local assortativity profiles in analyzing the functionalities of particular nodes and
groups of nodes in real-world biological networks.

EGC BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences
8226

Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to
compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account
direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum
length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having
a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant
segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a
constant, the time required by BpMatch to calculate the coverage is {rm O}(l^2n). On the average, by setting lge
2log_d(n), the time required to calculate the coverage is only {rm O}(n). BpMatch, thanks to the minRep parameter, can
also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial
solution of having a single segment coincident with the whole sequence.

EGC
Clustering 100,000 Protein Structure Decoysin Minutes
8227

Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called
decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods
are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large.
In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for
proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold
results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering
method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is
automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is
implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than



existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to
SPICKER's selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.

EGC
Composition Vector Method Based on Maximum Entropy Principle for Sequence
8228
Comparison

The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity
when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some
formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve
these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the
sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the
one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any
given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's
formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We
illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data
sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For
the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly,
where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more
accurate than Hao's and Yu's formulas even for small data sets.

EGC
Constructing and Drawing Regular Planar Split Networks
8229

Split networks are commonly used to visualize collections of bipartitions, also called splits, of a finite set. Such
collections arise, for example, in evolutionary studies. Split networks can be viewed as a generalization of phylogenetic
trees and may be generated using the SplitsTree package. Recently, the NeighborNet method for generating split
networks has become rather popular, in part because it is guaranteed to always generate a circular split system, which
can always be displayed by a planar split network. Even so, labels must be placed on the "outside” of the network,
which might be problematic in some applications. To help circumvent this problem, it can be helpful to consider so-
called flat split systems, which can be displayed by planar split networks where labels are allowed on the inside of the
network too. Here, we present a new algorithm that is guaranteed to compute a minimal planar split network displaying a
flat split system in polynomial time, provided the split system is given in a certain format. We will also briefly discuss
two heuristics that could be useful for analyzing phylogeographic data and that allow the computation of flat split
systems in this format in polynomial time.

EGC Constructing Complex 3D Biological Environments from Medical Imaging Using High
8230
Performance Computing



Extracting information about the structure of biological tissue from static image data is a complex task requiring
computationally intensive operations. Here, we present how multicore CPUs and GPUs have been utilized to extract
information about the shape, size, and path followed by the mammalian oviduct, called the fallopian tube in humans,
from histology images, to create a unique but realistic 3D virtual organ. Histology images were processed to identify the
individual cross sections and determine the 3D path that the tube follows through the tissue. This information was then
related back to the histology images, linking the 2D cross sections with their corresponding 3D position along the
oviduct. A series of linear 2D spline cross sections, which were computationally generated for the length of the oviduct,
were bound to the 3D path of the tube using a novel particle system technique that provides smooth resolution of self-
intersections. This results in a unique 3D model of the oviduct, which is grounded in reality. The GPU is used for the
processor intensive operations of image processing and particle physics based simulations, significantly reducing the
time required to generate a complete model.

EGC
CSD Homomorphisms between Phylogenetic Networks
8231

Since Darwin, species trees have been used as a simplified description of the relationships which summarize the
complicated network N of reality. Recent evidence of hybridization and lateral gene transfer, however, suggest that there
are situations where trees are inadequate. Consequently it is important to determine properties that characterize
networks closely related to N and possibly more complicated than trees but lacking the full complexity of N. A
connected surjective digraph map (CSD) is a map f from one network N to another network M such that every arc is
either collapsed to a single vertex or is taken to an arc, such that f is surjective, and such that the inverse image of a
vertex is always connected. CSD maps are shown to behave well under composition. It is proved that if there is a CSD
map from N to M, then there is a way to lift an undirected version of M into N, often with added resolution. A CSD map
from N to M puts strong constraints on N. In general, it may be useful to study classes of networks such that, for any N,
there exists a CSD map from N to some standard member of that class.

EGC
Designing Filters for Fast-Known NcRNA Identification
8232

Detecting members of known noncoding RNA (ncRNA) families in genomic DNA is an important part of sequence
annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high-
computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequences
that are unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite
recent advances, designing an efficient filter that can detect ncRNA instances lacking strong conservation while
excluding most irrelevant sequences remains challenging. In this work, we design three types of filters based on
multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e., a position weight matrix) with
secondary structure information but can still be efficiently scanned against long sequences. Multi-SSP-based filters
combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are
extensively tested in BRAliBase III data set, Rfam 9.0, and a published soil metagenomic data set. In addition, we



compare the SSP-based filters with several other ncRNA search tools including Infernal (with profile HMMs as filters),
ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant
speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families.

EGC Detection of Outlier Residues for Improving Interface Prediction in Protein
8233 Heterocomplexes

Unlike Sequence-based understanding and identification of protein binding interfaces is a challenging research topic
due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues.
This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned
training data are then used for improving the prediction performance. We use three novel measures to describe the
extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from
the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue
instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are
computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed.
The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM
ensemble trained on input data without outliers performs better than that with outliers. Our method is also more
accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier
interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface
regions.

EGC
8234
DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number

Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard
task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some
assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in
some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on
the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS,
which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final
clustering automatically and does not take any input parameters, a feature missing in many existing algorithms.
Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces
very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers
by being scalable, and by consuming very little memory and CPU.

EGC Disease Liability Prediction from Large Scale Genotyping Data Using Classifiers with a
8235 Reject Option



Many Genome-wide association studies (GWA) try to identify the genetic polymorphisms associated with variation in
phenotypes. However, the most significant genetic variants may have a small predictive power to forecast the future
development of common diseases. We study the prediction of the risk of developing a disease given genome-wide
genotypic data using classifiers with a reject option, which only make a prediction when they are sufficiently certain, but
in doubtful situations may reject making a classification. To test the reliability of our proposal, we used the Wellcome
Trust Case Control Consortium (WTCCC) data set, comprising 14,000 cases of seven common human diseases and
3,000 shared controls.

EGC Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label
8236
Learning

In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression
patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time.
Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology
terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions,
interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become
increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the
automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to
local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which
term corresponds to which region of which image in the group. In this paper, we address this problem using a new
machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the
annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the
MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized
Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant
performance improvement over state-of-the-art approaches.

EGC Efficient Approaches for Retrieving Protein Tertiary Structures
8237

The 3D conformation of a protein in the space is the main factor which determines its function in living organisms. Due
to the huge amount of newly discovered proteins, there is a need for fast and accurate computational methods for
retrieving protein structures. Their purpose is to speed up the process of understanding the structure-to-function
relationship which is crucial in the development of new drugs. There are many algorithms addressing the problem of
protein structure retrieval. In this paper, we present several novel approaches for retrieving protein tertiary structures.
We present our voxel-based descriptor. Then we present our protein ray-based descriptors which are applied on the
interpolated protein backbone. We introduce five novel wavelet descriptors which perform wavelet transforms on the
protein distance matrix. We also propose an efficient algorithm for distance matrix alignment named Matrix Alignment
by Sequence Alignment within Sliding Window (MASASW), which has shown as much faster than DALI, CE, and



MatAlign. We compared our approaches between themselves and with several existing algorithms, and they generally
prove to be fast and accurate. MASASW achieves the highest accuracy. The ray and wavelet-based descriptors as well
as MASASW are more accurate than CE.

EGC Efficient Genotype Eliminationvia Adaptive Allele Consolidation
8238

In We propose the technique of Adaptive Allele Consolidation, that greatly improves the performance of the Lange-
Goradia algorithm for genotype elimination in pedigrees, while still producing equivalent output. Genotype elimination
consists in removing from a pedigree those genotypes that are impossible according to the Mendelian law of
inheritance. This is used to find errors in genetic data and is useful as a preprocessing step in other analyses (such as
linkage analysis or haplotype imputation). The problem of genotype elimination is intrinsically combinatorial, and Allele
Consolidation is an existing technique where several alleles are replaced by a single "lumped” allele in order to reduce
the number of combinations of genotypes that have to be considered, possibly at the expense of precision. In existing
Allele Consolidation techniques, alleles are lumped once and for all before performing genotype elimination. The idea of
Adaptive Allele Consolidation is to dynamically change the set of alleles that are lumped together during the execution
of the Lange-Goradia algorithm, so that both high performance and precision are achieved. We have implemented the
technique in a tool called Celer and evaluated it on a large set of scenarios, with good results.

EGC
8239 Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data
compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local
repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats
captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix
tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the
best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on
finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the
Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the
space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per
base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the
timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing
massive texts such as the whole human genome, since the prior methods must use external memory. For the first time,
our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to
find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as
general-purpose open-source software for public use.



EGC Eigen-Genomic System Dynamic-Pattern Analysis (ESDA): Modeling mRNA Degradation
8240 and Self-Regulation
The High-throughput methods systematically measure the internal state of the entire cell, but powerful computational
tools are needed to infer dynamics from their raw data. Therefore, we have developed a new computational method,
Eigen-genomic System Dynamic-pattern Analysis (ESDA), which uses systems theory to infer dynamic parameters from
a time series of gene expression measurements. As many genes are measured at a modest number of time points,
estimation of the system matrix is underdetermined and traditional approaches for estimating dynamic parameters are
ineffective; thus, ESDA uses the principle of dimensionality reduction to overcome the data imbalance. Since
degradation rates are naturally confounded by self-regulation, our model estimates an effective degradation rate that is
the difference between self-regulation and degradation. We demonstrate that ESDA is able to recover effective
degradation rates with reasonable accuracy in simulation. We also apply ESDA to a budding yeast data set, and find that
effective degradation rates are normally slower than experimentally measured degradation rates. Our results suggest
that either self-regulation is widespread in budding yeast and that self-promotion dominates self-inhibition, or that self-
regulation may be rare and that experimental methods for measuring degradation rates based on transcription arrest
may severely overestimate true degradation rates in healthy cells.

.
EGC
8241
Empirical Evidence of the Applicability of Functional Clustering through Gene Expression
Classification a great range of prior biological knowledge about the roles and functions of genes and gene-gene
The availability of
interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and
interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of
functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive
accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes
are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning.
Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering
without biological relevance. We also show that functional clustering performs comparably to gene expression
clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of
functional clustering as a feature extraction technique is evaluated and discussed.

EGC
Evaluating Path Queries over Frequently Updated Route Collections
8242

This The recent advances in the infrastructure of Geographic Information Systems (GIS), and the proliferation of GPS
technology, have resulted in the abundance of geodata in the form of sequences of points of interest (POIs), waypoints,
etc. We refer to sets of such sequences as route collections. In this work, we consider path queries on frequently
updated route collections: given a route collection and two points n_s and n_t, a path query returns a path, i.e., a
sequence of points, that connects n_s to n_t. We introduce two path query evaluation paradigms that enjoy the benefits
of search algorithms (i.e., fast index maintenance) while utilizing transitivity information to terminate the search sooner.



Efficient indexing schemes and appropriate updating procedures are introduced. An extensive experimental evaluation
verifies the advantages of our methods compared to conventional graph-based search.

.
EGC Exploiting Intrastructure Information for Secondary Structure Prediction with Multifaceted
8243
Pipelines

Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for
tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of
sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel
perspective, in which understanding how available information sources are dealt with plays a central role. After
revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which
sources of information have been considered and which have not), we propose a generic software architecture designed
to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with
the proposed generic architecture has been implemented and compared with several state-of-the-art secondary
structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm
the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web
application or as downloadable stand-alone portable unpack-and-run bundle.

EGC Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic
8244 Topic Modeling
This Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for
tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of
sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel
perspective, in which understanding how available information sources are dealt with plays a central role. After
revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which
sources of information have been considered and which have not), we propose a generic software architecture designed
to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with
the proposed generic architecture has been implemented and compared with several state-of-the-art secondary
structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm
the validity of the approach.

EGC Fast Local Search for Unrooted Robinson-Foulds Supertrees
8245

A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that
is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of
splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective
local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted.
We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction,



thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that
our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in
terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.

EGC Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on
8246
GPU with CUDA and ELLPACK-R Sparse Format

Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks.
However, with increasing vast amount of data on biological networks, performance and scalability issues are becoming
a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively
parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve
substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the
latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a
very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations
and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format
to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks
data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL
running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only
possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with
their data.

EGC
Faster Mass Spectrometry-Based Protein Inference: Junction Trees Are More Efficient
5247
than Sampling and Marginalization by Enumeration

The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an
inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make
use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte
Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the
cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different
statistical inference methods using a common graphical model, and we demonstrate that junction tree inference
substantially improves rates of convergence compared to existing methods.

EGC Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning
8248

The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an
inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make
use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte



Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the
cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different
statistical inference methods using a common graphical model, and we demonstrate that junction tree inference
substantially improves rates of convergence compared to existing methods.

EGC
8249 GSGS: A Computational Approach to Reconstruct Signaling Pathway Structures from
Gene Sets

Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells.
The existing computational approaches often rely on unrealistic biological assumptions and do not explicitly consider
signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from the cell surface
to the nucleus and characterize a signaling pathway. In this paper, we propose a novel approach, Gene Set Gibbs
Sampling (GSGS), to reverse engineer signaling pathway structures from gene sets related to the pathways. We
hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which
we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information
Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the
underlying IF but not their ordering. GSGS offers a Gibbs sampling like procedure to reconstruct the underlying
signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof-
of-concept studies, our approach is shown to outperform the existing state-of-the-art network inference approaches
using both continuous and discrete data generated from benchmark networks in the DREAM initiative. We perform a
comprehensive sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to
reconstruct signaling mechanisms in breast cancer cells.

EGC
Hash Subgraph Pairwise Kernel for Protein-Protein Interaction Extraction
8250

Extracting protein-protein interaction (PPI) from biomedical literature is an important task in biomedical text mining
(BioTM). In this paper, we propose a hash subgraph pairwise (HSP) kernel-based approach for this task. The key to the
novel kernel is to use the hierarchical hash labels to express the structural information of subgraphs in a linear time. We
apply the graph kernel to compute dependency graphs representing the sentence structure for protein-protein
interaction extraction task, which can efficiently make use of full graph structural information, and particularly capture
the contiguous topological and label information ignored before. We evaluate the proposed approach on five publicly
available PPI corpora. The experimental results show that our approach significantly outperforms all-path kernel
approach on all five corpora and achieves state-of-the-art performance.

EGC
8251 Identification of Essential Proteins Based on Edge Clustering Coefficient



Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for
drug design. The rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein
essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based
on network topology. However, most of them tended to focus only on the location of single protein, but ignored the
relevance between interactions and protein essentiality. In this paper, a new centrality measure for identifying essential
proteins based on edge clustering coefficient, named as NC, is proposed. Different from previous centrality measures,
NC considers both the centrality of a node and the relationship between it and its neighbors. For each interaction in the
network, we calculate its edge clustering coefficient. A node's essentiality is determined by the sum of the edge
clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC takes into account
the modular nature of protein essentiality. NC is applied to three different types of yeast protein-protein interaction
networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The
experimental results on the three different networks show that the number of essential proteins discovered by NC
universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC. Moreover, the
essential proteins discovered by NC show significant cluster effect.

EGC Identifying Gene Pathways Associated with Cancer Characteristics via Sparse Statistical
8252
Methods

Information We propose a statistical method for uncovering gene pathways that characterize cancer heterogeneity. To
incorporate knowledge of the pathways into the model, we define a set of activities of pathways from microarray gene
expression data based on the Sparse Probabilistic Principal Component Analysis (SPPCA). A pathway activity logistic
regression model is then formulated for cancer phenotype. To select pathway activities related to binary cancer
phenotypes, we use the elastic net for the parameter estimation and derive a model selection criterion for selecting
tuning parameters included in the model estimation. Our proposed method can also reverse-engineer gene networks
based on the identified multiple pathways that enables us to discover novel gene-gene associations relating with the
cancer phenotypes. We illustrate the whole process of the proposed method through the analysis of breast cancer gene
expression data.

EGC Inferring Gene Regulatory Networks via Nonlinear State-Space Models and Exploiting
8253
Sparsity

This paper considers the problem of learning the structure of gene regulatory networks from gene expression time
series data. A more realistic scenario when the state space model representing a gene network evolves nonlinearly is
considered while a linear model is assumed for the microarray data. To capture the nonlinearity, a particle filter-based
state estimation algorithm is considered instead of the contemporary linear approximation-based approaches. The
parameters characterizing the regulatory relations among various genes are estimated online using a Kalman filter.
Since a particular gene interacts with a few other genes only, the parameter vector is expected to be sparse. The state
estimates delivered by the particle filter and the observed microarray data are then subjected to a LASSO-based least



squares regression operation which yields a parsimonious and efficient description of the regulatory network by setting
the irrelevant coefficients to zero. The performance of the aforementioned algorithm is compared with the extended
Kalman filter (EKF) and Unscented Kalman Filter (UKF) employing the Mean Square Error (MSE) as the fidelity criterion
in recovering the parameters of gene regulatory networks from synthetic data and real biological data. Extensive
computer simulations illustrate that the proposed particle filter-based network inference algorithm outperforms EKF and
UKF, and therefore, it can serve as a natural framework for modeling gene regulatory networks with nonlinear and
sparse structure.

EGC
8254 Inferring the Number of Contributors to Mixed DNA Profiles

Forensic samples containing DNA from two or more individuals can be difficult to interpret. Even ascertaining the
number of contributors to the sample can be challenging. These uncertainties can dramatically reduce the statistical
weight attached to evidentiary samples. A probabilistic mixture algorithm that takes into account not just the number
and magnitude of the alleles at a locus, but also their frequency of occurrence allows the determination of likelihood
ratios of different hypotheses concerning the number of contributors to a specific mixture. This probabilistic mixture
algorithm can compute the probability of the alleles in a sample being present in a 2-person mixture, 3-person mixture,
etc. The ratio of any two of these probabilities then constitutes a likelihood ratio pertaining to the number of contributors
to such a mixture.

EGC
Intervention in Gene Regulatory Networks viaPhenotypically Constrained Control
8255
PoliciesBased on Long-Run Behavior

A salient purpose for studying gene regulatory networks is to derive intervention strategies to identify potential drug
targets and design gene-based therapeutic intervention. Optimal and approximate intervention strategies based on the
transition probability matrix of the underlying Markov chain have been studied extensively for probabilistic Boolean
networks. While the key goal of control is to reduce the steady-state probability mass of undesirable network states, in
practice it is important to limit collateral damage and this constraint should be taken into account when designing
intervention strategies with network models. In this paper, we propose two new phenotypically constrained stationary
control policies by directly investigating the effects on the network long-run behavior. They are derived to reduce the
risk of visiting undesirable states in conjunction with constraints on the shift of undesirable steady-state mass so that
only limited collateral damage can be introduced. We have studied the performance of the new constrained control
policies together with the previous greedy control policies to randomly generated probabilistic Boolean networks. A
preliminary example for intervening in a metastatic melanoma network is also given to show their potential application in
designing genetic therapeutics to reduce the risk of entering both aberrant phenotypes and other ambiguous states



corresponding to complications or collateral damage. Experiments on both random network ensembles and the
melanoma network demonstrate that, in general, the new proposed control policies exhibit the desired performance. As
shown by intervening in the melanoma network, these control policies can potentially serve as future practical gene
therapeutic intervention strategies.

EGC
8256
Iterative Dictionary Construction for Compression of Large DNA Data Sets

Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical
and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms,
and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-
based dictionary construction method can detect this repeated content and use it to compress collections of sequences.
We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation,
Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities
within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive
process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random
access to individual sequences and subsequences without decompressing the whole data set. Comrad has no
competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even
for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes
compressed to 0.25 bits per base.

EGC
k-Information Gain Scaled Nearest Neighbors:A Novel Approach to Classifying Protein-
8257
Protein Interaction-Related Documents

Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important
resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date
is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning
discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms
that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design
classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them
against one another. One of our systems was the overall best-performing system submitted to the challenge task. It
utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to
those of two support vector machine-based classification systems, one of which was also evaluated in the challenge
task.

EGC
8258
Manifold Adaptive Experimental Design for Text Categorization



In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To
reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e.,
improve the classifier the most if they were labeled. Many active learning techniques have been proposed for text
categorization, such as SVMActive and Transductive Experimental Design. However, most of previous approaches try to
discover the discriminant structure of the data space, whereas the geometrical structure is not well respected. In this
paper, we propose a novel active learning algorithm which is performed in the data manifold adaptive kernel space. The
manifold structure is incorporated into the kernel space by using graph Laplacian. This way, the manifold adaptive
kernel space reflects the underlying geometry of the data. By minimizing the expected error with respect to the optimal
classifier, we can select the most representative and discriminative data points for labeling. Experimental results on text
categorization have demonstrated the effectiveness of our proposed approach.

EGC
8259
Markov Invariants for Phylogenetic Rate Matrices Derived from Embedded Submodels

We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small
number of character states, into a target model on a larger number of character states. Adapting representation-
theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription
for identifying and counting Markov invariants for such "symmetric embedded” models, and we provide enumerations of
these for the first few cases with a small number of character states. The simplest example is a target model on three
states, constructed from a general 2 state model; the "2 hookrightarrow 3” embedding. We show that for 2 taxa, there
exist two invariants of quadratic degree that can be used to directly infer pairwise distances from observed sequences
under this model. A simple simulation study verifies their theoretical expected values, and suggests that, given the
appropriateness of the model class, they have superior statistical properties than the standard (log) Det invariant (which
is of cubic degree for this case).

EGC
8260 Matching Split Distance for Unrooted Binary Phylogenetic Trees

The reconstruction of evolutionary trees is one of the primary objectives in phylogenetics. Such a tree represents the
historical evolutionary relationship between different species or organisms. Tree comparisons are used for multiple
purposes, from unveiling the history of species to deciphering evolutionary associations among organisms and
geographical areas. In this paper, we propose a new method of defining distances between unrooted binary
phylogenetic trees that is especially useful for relatively large phylogenetic trees. Next, we investigate in detail the
properties of one example of these metrics, called the Matching Split distance, and describe how the general method
can be extended to nonbinary trees.

EGC
8261
Memory Efficient Algorithms for Structural Alignment of RNAs with Pseudoknots


Posting In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query
RNA sequence of length m with known secondary structure that may contain simple pseudoknots or embedded simple
pseudoknots. The best known algorithm for solving this problem runs in O(mn3) time for simple pseudoknot or O(mn4)
time for embedded simple pseudoknot with space complexity of O(mn3) for both structures, which require too much
memory making it infeasible for comparing noncoding RNAs (ncRNAs) with length several hundreds or more. We
propose memory efficient algorithms to solve the same problem. We reduce the space complexity to O(n3) for simple
pseudoknot and O(mn2 + n3) for embedded simple pseudoknot while maintaining the same time complexity. We also
show how to modify our algorithm to handle a restricted class of recursive simple pseudoknot which is found abundant
in real data with space complexity of O(mn2 + n3) and time complexity of O(mn4). Experimental results show that our
algorithms are feasible for comparing ncRNAs of length more than 500.

EGC MinePhos: A Literature Mining System for Protein Phoshphorylation Information
8262
Extraction

The rapid growth of scientific literature calls for automatic and efficient ways to facilitate extracting experimental data on
protein phosphorylation. Such information is of great value for biologists in studying cellular processes and diseases
such as cancer and diabetes. Existing approaches like RLIMS-P are mainly rule based. The performance lays much
reliance on the completeness of rules. We propose an SVM-based system known as MinePhos which outperforms
RLIMS-P in both precision and recall of information extraction when tested on a set of articles randomly chosen from
PubMed.

EGC
8263 Molecular Dynamics Trajectory Compressionwith a Coarse-Grained Model

Molecular dynamics trajectories are very data intensive thereby limiting sharing and archival of such data. One possible
solution is compression of trajectory data. Here, trajectory compression based on conversion to the coarse-grained
model PRIMO is proposed. The compressed data are about one third of the original data and fast decompression is
possible with an analytical reconstruction procedure from PRIMO to all-atom representations. This protocol largely
preserves structural features and to a more limited extent also energetic features of the original trajectory.

EGC
Multiobjective Optimization Based-Approach forDiscovering Novel Cancer Therapies
8264

Solid tumors must recruit new blood vessels for growth and maintenance. Discovering drugs that block tumor-induced
development of new blood vessels (angiogenesis) is an important approach in cancer treatment. The complexity of
angiogenesis presents both challenges and opportunities for cancer therapies. Intuitive approaches, such as blocking


Ieee projects 2012 2013 - Bio Informatics

Ieee projects 2012 2013 - Bio Informatics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Ieee projects 2012 2013 - Bio Informatics

Similar to Ieee projects 2012 2013 - Bio Informatics (20)

Recently uploaded

Recently uploaded (20)

Ieee projects 2012 2013 - Bio Informatics