We apply constraints and global optimization to the problem of restricting overlapping of gene predictions for prokaryotic genomes. We investigate existing heuristic methods and show how they may be expressed using Constraint Handling Rules. Furthermore, we integrate existing methods in a global optimization procedure expressed as probabilistic model in the PRISM language. This approach yields an optimal (highest scoring) subset of predictions that satisfy the constraints. Experimental results indicate accuracy comparable to the heuristic approaches.
Constraints and Global Optimization for Gene Prediction Overlap Resolution
1. Constraints and Global Optimization
for Gene Prediction Overlap Resolution
Christian Theil Have
Research group PLIS: Programming, Logic and Intelligent Systems
Department of Communication, Business and Information Technologies
Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark
WCB 2011 in Perugia, September 12, 2011
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
2. Outline
1 Motivation and background
Gene finding
2 Overlap resolution
Local heuristics
Global optimization
3 Implementation
4 Evaluation
5 Summary and future directions
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
3. Motivation and background
Motivation
In the LoSt project we explore the use of Probabilistic Logic
programming (with PRISM) for biological sequence analysis.
In particular, we have been concerned with applying these techniques
to gene finding for prokaryotes.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
4. Motivation and background Gene finding
Gene finding
A traditional view,
Considered as a sequence classification task.
Performed without much context.
Ignores constraints between predicted genes and their placement in
the genome.
Problem with overlapping genes.
Over-prediction of overlapping genes and shadow genes.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
5. Motivation and background Gene finding
Two step approach
A divide and conquer two step approach to gene finding,
1 Gene prediction: A gene finder supplies a set of candidate predictions
p1 . . . pn , called the initial set.
2 Pruning: The initial set is pruned according to certain rules or
constraints. We call the pruned set the final set.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
6. Overlap resolution
Pruning step as a Constraint Satisfaction Problem
Definition
A Constraint Satisfaction Problem is a triplet X , D, C . X is a set of n
variables, X = x1 , . . . , xn , with domains D = D(x1 ), . . . , D(xn ). The
constraints C impose restrictions on possible assignments for sets of
variables. A solution is an assignment of a value v ∈ D(xi ) to each
variable xi ∈ X , consistent with C .
We introduce variables X = xi . . . xn corresponding to each prediction
p1 . . . pn in the initial set. All variables have boolean domains,
∀xi ∈ X , D(xi ) = {true, false} and xi = true ⇔ pi ∈ final set.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
7. Overlap resolution Local heuristics
Local heuristic techniques
Local heuristic pruning rules to post-process a set of gene predictions.
Deletes inconsistent predictions based on various criteria.
Pruning decisions based on the context of only a subset of the
predictions.
CSP formulation
Pruning xi just means D(xi ) = D(xi )true ⇔ D(xi ) = false.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
8. Overlap resolution Local heuristics
Local heuristic techniques as Constraint Handling Rules
Pruning is conveniently expressed as CHR simplification rules.
1 Constraint store starts out as the initial set.
2 Simplification rules prune predictions from the constraint store, until
no more rules apply.
3 Final constraint store represents the final set.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
9. Overlap resolution Local heuristics
Genemark frame-by-frame
M.Shmatkov, Anton and A.Melikyan, Arik and L.Chernousko, Felix
and Borodovsky, Mark
Finding prokaryotic genes by the frame-by-frame algorithm: targeting
gene starts and overlapping genes
Bioinfomatics, 1999
Frame-by-frame
prediction(Left1,Right1), prediction(Left2,Right2) <=>
Left1 =< Left2,
Right1 >= Right2
|
prediction(Left1,Right1).
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
10. Overlap resolution Local heuristics
ECOPARSE (1)
Krogh, Anders and Mian, I. Saira and Haussler, David
A hidden Markov model that finds genes in E.coli DNA.
Nucl. Acids Res, 22, 1994.
Rule 1
pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>
overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),
length_ratio((Left1,Right1),(Left2,Right2),Ratio),
length(Left1,Right1,Length1),
length(Left2,Right2,Length2),
OverlapLen > 15, Score1 > Score2
((Length1 > 400, Length2 > 400) ; Ratio > 0.5),
|
pred(Left1,Right1,Score1).
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
11. Overlap resolution Local heuristics
ECOPARSE (2)
Rule 2
pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>
overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),
length_ratio((Left1,Right1),(Left2,Right2),Ratio),
length(Left1,Right1,Length1),
length(Left2,Right2,Length2),
OverlapLen > 15, Ratio =< 0.5, Length1 =< Length2
|
pred(Left1,Right1,Score1).
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
12. Overlap resolution Local heuristics
ECOPARSE (3) - Ambiguity!
We have two predictions P1 and P2 that overlap each end of a third
prediction P3 by more than 15 bases. If P1 and P3 is are considered
before P2 and P3 then P3 will be removed by the first rule. Consequently
P2 does not overlap and is kept. If they are considered in opposite order,
however, then P2 will be removed by the second rule and subsequently P3
is removed by the first rule.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
13. Overlap resolution Local heuristics
Local heuristics and confluence
the final set will be same ⇔ program is confluent, i.e. it is not
sensitive to the order of execution.
With the simple Genemark rule it does not matter in which order the
predictions are processed.
With slightly more complicated heuristics (such as ECOPARSE),
programs tend not to be confluent.
Conclusion: Execution strategy is important!
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
14. Overlap resolution Global optimization
Pruning step as a Constraint Optimization Problem (1)
CSP may have multiple solutions!
We want the “best” solution, meaning many correct predictions, few
incorrect.
Set of real genes are not known in advance!
Instead, optimize the probability that the predictions are correct
(confidence scores from gene finder).
Extends the problem as a constraint optimization problem.
Definition
A Constraint Optimization Problem (COP) is a CSP where each solution is
associated with a cost and the goal is to find a solution with minimal cost.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
15. Overlap resolution Global optimization
COP formulation
We would like the final set to reflect the relative confidence scores in the
predictions assigned by the gene finder and at the same time be consistent
with the overlap constraints.
COP formulation
Let the scores of p1 . . . pn be s1 . . . sn and si ∈ R+ .
Maximize n si , subject to C.
i=1
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
16. Overlap resolution Global optimization
Variable ordering
Assume an ordering on the variables,
Initial set predictions p1 . . . pn are sorted by the position of their
left-most base, such that
∀pi , pj , i < j ⇒ left-most(pi ) ≤ left-most(pj ).
The variables x1 . . . xn of the CSP/COP are given the same ordering.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
17. Overlap resolution Global optimization
COP implementation with Markov Chain (1)
We propose to use a (constrained) Markov chain for the COP.
The Markov chain has a begin state, an end state and two states for
each variable xi corresponding to its boolean domain D(xi ).
The state corresponding to D(xi ) = true is denoted αi and the state
corresponding to D(xi ) = false is denoted βi .
In this model, a path from the begin state to the end state
corresponds to a potential solution of the CSP.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
18. Overlap resolution Global optimization
From confidence scores to transition probabilities
P(α1 |begin) = σ1
P(β1 |begin) = 1 − σ1
P(end|αn ) = P(end|βn ) = 1.
P(αi |αi−1 ) = P(αi |βi−1 ) = σi
P(βi |αi−1 ) = P(βi |βi−1 ) = 1 − σi
(0.5 − λ) × (si − min(s1 . . . sn ))
σi = 0.5 + λ +
max(s1 . . . sn ) − min(s1 . . . sn )
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
19. Overlap resolution Global optimization
Constraining the Markov chain
Constraints are defined on states that are not allowed to occur
together in a path.
CHR rules similar to those of the local heuristics.
Instead of removing predictions they define conditions for
inconsistency (inconsistency rules).
Inconsistency rules
Inconsistency rules match predictions corresponding to α and β states in
the head of the rule. The guard of the rule ensures that the additional
criteria for rule application are met and the implication of the rules is
always failure.
Such rules are necessarily confluent!
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
20. Overlap resolution Global optimization
Example: Glimmer 3
Delcher, Arthur L. and Bratke, Kirsten A. and Powers, Edwin C. and
Salzberg, Steven L.
Identifying bacterial genes and endosymbiont DNA with Glimmer
Bioinformatics, 23, 2007.
Glimmers inconsistency rule
alpha(Left1,Right1), alpha(Left2,Right2) <=>
overlap_length((Left1,Right1),(Left2,Right2),OverlapLen),
OverlapLen > 110
|
fail.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
21. Overlap resolution Global optimization
Inconsistency rules, Genemark and ECOGENE
Genemark inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=>
Left1 =< Left2, Right1 >= Right2 | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=>
Left1 =< Left2, Right1 >= Right2 | fail.
ECOGENE inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
22. Implementation
Implementation in PRISM
Implemented as a constrained markov Chain in PRISM.
PRISM Markov chain
values(choose(_), [alpha,beta]).
markov_chain(P,_) :-
last_prediction_id(LPID), P > LPID.
markov_chain(P,Store1) :-
last_prediction_id(N), P =< N,
msw(choose(P), AlphaBeta),
X =.. [ AlphaBeta, P ],
check_constraints(X,Store1,Store2),
NextPrediction is P + 1,
markov_chain(NextPrediction,Store2).
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
23. Implementation
Constraint checking and relevant recent states
Maintain a list of recent states (mi ) sorted by the right-most position
of the corresponding predictions.
Constraints are only checked for predictions corresponding to
elements of mi .
In step i, we construct mi as the maximal prefix of xi + mi−1 , such that
xj ∈ mi ⇐⇒ right-most(pj ) ≥ left-most(pi ). If the constraints propagate
fail, then the PRISM derivation fails and the (partial) path it represents
is pruned from the solution space.
The most probable consistent path is found using PRISMs generic
adaptation of the Viterbi algorithm for PRISM programs.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
24. Evaluation
Are the constraints realistic assumptions?
Definition
We consider constraints safe when the constraints prune only false
positives.
Neither of the examined constraints are safe with respect the ”known”
genes in E.coli, (NC 000913).
Genemark constraint removes three completely overlapped known
genes.
Four overlaps longer than 110 bases potentially removed Glimmer 3
constraint.
93 overlaps longer than 15 bases potentially removed by ECOGENE
constraints.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
25. Evaluation
Experimental results
Prediction on E.coli. using simplistic codon frequency based gene
finder.
Pruning using our global optimization approach (with all
inconsistency rules) versus local heuristic rules1 .
Method #predictions Sensitivity Specificity Time (seconds)
initial set 10799 0.7625 0.2926 na
Genemark rules 5823 0.7558 0.5379 1.4
ECOGENE rules 4981 0.7148 0.5947 1.7
global optimization 5222 0.7201 0.5714 75
Sensitivity = fraction of known reference genes predicted.
Specificity = fraction of predicted genes that are correct.
1
Note that the results for the ECOGENE heuristic may vary depending on execution
strategy - in case of above results, predictions with lower left position are considered
first.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
26. Summary and future directions
Summary
We presented
A unified view overlap resolution using Constraint Handling Rules.
a novel way to post-process gene prediction results based on
constrained global optimization
that provides an optimality guarantee wrt. to gene finder confidence
scores
and it has similar accuracy to the heuristic methods.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
27. Summary and future directions
Future directions
Better modeling transition probabilities using assumed gene-finder
false positive rates.
Soft constraints - since constraints are unsafe.
Use technique as a combiner: Include more gene finder predictions in
initial set.
Include and experiment with other constraints.
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
28. Summary and future directions
Questions
Questions?
Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution