Approximating the subtree distance between phylogenies. C Semple
1. Fixed-Parameter and Approximation Algorithms for
the Subtree Distance Between Phylogenies
Charles Semple
Biomathematics Research Centre
Department of Mathematics and Statistics
University of Canterbury
New Zealand
Joint work with Magnus Bordewich, Catherine McCartin
e-Science Institute, Edinburgh 2007
2. Subtree Prune and Regraft (SPR)
Example. r r
1 SPR
bc d
a
a bc d
T1
S
r
1 SPR
a d b
c
T2
3. Applications of SPR
Used
I. As a search tool for selecting the best tree in
reconstruction algorithms.
II. To quantify the dissimilarity between two phylogenetic
trees.
III. To provide a lower bound on the number of reticulation
events in the case of non-tree-like evolution.
For II and III, one wants the minimum number of SPR operations to
transform one phylogeny into another.
This number is the SPR distance between two phylogenies S and T.
4. The Mathematical Problem
MINIMUM SPR
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a minimum length sequence of single SPR operations that
transforms S into T.
Measure: The length of the sequence.
Notation: Use dSPR(S, T) to denote this minimum length.
Theorem (Bordewich, S 2004)
MINIMUM SPR is NP-hard.
Overriding goal is to find (with no restrictions) the exact solution or a
heuristic solution with a guarantee of closeness.
5. Algorithms for NP-Hard Problems
Fixed-parameter algorithms are a practical way to find optimal
solutions if the parameter measuring the hardness of the problem
is small.
Polynomial-time approximation algorithms can efficiently find feasible
solutions that are sometimes arbitrarily close to the optimal
solution.
6. Agreement Forests
A forest of T is a disjoint Example. r
collection of phylogenetic
subtrees whose union of leaf
sets is X U r.
a cd e f
b
S
r
a cd e f
b
F1
7. Agreement Forests
A forest of T is a disjoint Example. r
collection of phylogenetic
subtrees whose union of leaf
sets is X U r.
a cd e f
b
S
r
a cd e f
b
F1
8. Agreement Forests
An agreement forest for S and T is a forest of both S and T.
r r
Example.
a cd e f d fa b c
b e
S T
r r
a cd e f a cd e f
b b
F1 F2
9. Agreement Forests
An agreement forest for S and T is a forest of both S and T.
r r
Example.
a cd e f d fa b c
b e
S T
r r
a cd e f a cd e f
b b
F1 F2
10. Agreement Forests
An agreement forest for S and T is a forest of both S and T.
r r
Example.
e
a cd f d fa b c
b e
S T
r r
a cd e f a cd e f
b b
F1 F2
11. Theorem. (Bordewich, S, 2004)
Let S and T be two binary phylogenetic trees. Then
dSPR(S,T) = size of maximum-agreement forest - 1.
o It’s fast to construct from a maximum-agreement forest for S
and T a sequence of SPR operations that transforms S into T.
13. Fixed-Parameter Algorithms
The underlying idea is to find an algorithm whose running time
separates the size of the problem instance from the parameter of
interest.
One way to obtain such an algorithm is to reduce the size of the
problem instance, while preserving the optimal value (kernalizing
the problem).
Are the subtree and chain reductions enough to kernalize the
problem?
14. Fixed-Parameter Algorithms
Lemma. If n’ denotes the size of the leaf sets of the fully reduced
trees using the subtree and chain reductions, then
n’ < 28dSPR(S,T).
Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parameter
tractable.
1. Repeatedly apply the subtree and chain rules.
2. Exhaustively find a maximum-agreement forest by deleting edges
in S and comparing with T.
Running time is O((56k)k + p(n)) compared with O((2n)k), where
k=dSPR(S,T) and p(n) is the polynomial bound for reducing the
trees using the subtree and chain reductions.
15. Fixed-Parameter Algorithms
A second way to obtain a fixed-parameter algorithm is using the
method of bounded search trees.
Preliminaries
o A rooted triple is a binary tree with three leaves.
a bc
ab|c
o A tree S displays ab|c if the last common ancestor of ab is a
descendant of the last common ancestor of ac.
16. r
Fixed-Parameter Algorithms
r
Observation. If F is an agreement d fa b c
e
forest for S and T, then F can T
be obtained from S by
deleting |F|-1 edges.
a cd e f
b
Recall. If F is maximum, then
S
|F|-1=dSPR(S,T).
17. r
Fixed-Parameter Algorithms
r
Observation. If F is an agreement d fa b c
e
forest for S and T, then F can T
be obtained from S by
deleting |F|-1 edges.
a cd e f
b
Recall. If F is maximum, then
S
|F|-1=dSPR(S,T).
18. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.
19. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.
20. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.
21. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.
22. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
ed
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.
23. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
ed
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
24. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
ed
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
25. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
26. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
er
not in T.
T
2. Branch into 4 computational ee
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
27. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
er
not in T.
T
2. Branch into 4 computational ee
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
28. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
ed
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
29. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
er
2. Branch into 4 computational
ed
paths ex, ey, ez, er, and repeat
ea eb
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
30. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
31. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
32. r
Fixed-Parameter Algorithms vst
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
33. r
Fixed-Parameter Algorithms vst
r
es
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
et
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
er
34. r
Fixed-Parameter Algorithms vst
r
es
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
et
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
es
er
et
35. r
Fixed-Parameter Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components. ed
ea
eb
es
er
et
36. r
Fixed-Parameter Algorithms
r
Key Lemma. dSPR(S,T) <= k if and only d fa b c
e
if one of the computational
T
paths of length at most k
results in an agreement forest
for S and T.
a cd e f
b
Theorem. (Bordewich, McCartin, S S
2007)
The bounded search tree algorithm
has running time O(4kn4).
ed
Combining the methods of
kernalization and bounded ea
search gives an FPT algorithm eb
with running time O(4kk4+p(n)).
es
er
et
37. Approximation Algorithms
Polynomial-time approximation algorithms can efficiently find feasible
solutions that are sometimes arbitrarily close to the optimal
solution.
An r-approximation algorithm A for MINIMUM SPR means that the
size of an agreement forest (minus 1) for S and T returned by A is
at most r times dSPR(S,T).
Example. If r=3, then the size of any agreement forest (minus 1) for S
and T returned by A is at most 3 times dSPR(S,T).
Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta
2006 establish a 5-approximation algorithm (O(n)) for MINIMUM
SPR.
38. r
Approximation Algorithms
r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Delete each of ex, ey, ez, er, and
repeat until no such triple.
3. Find a pair of overlapping
components ts and tt and delete a cd e f
b
es, et. Repeat until no such S
components.
Theorem. (Bordewich, McCartin, S
ed
2007)
This gives a 4-approximation alg. for
MINIMUM SPR (O(n5)). ea
eb
Deleting only ex, ez, er gives a 3-
es
approximation algorithm. er
et
39. No r, unless
P=NP
Exact
(B, S 07) (B, M,S 07)
algorithms
1 + 1/2112 3
r=1
MINIMUM SPR
5
Maximum 1. Travelling
(B S,M, A 06)salesman
independent
set in planar problem
graphs (PTAS)
2. Maximum
clique
40. Modelling Hybridization Events with SPR Operations
Reticulation processes cause … molecular phylogeneticists will
species to be a composite of have failed to find the `true
DNA regions derived from tree’, not because their
different ancestors. methods are inadequate or
because they have chosen
the wrong genes, but
Processes include
because the history of life
o horizontal gene transfer,
cannot be properly
o hybridization, and
represented as a tree.
o recombination.
Ford Doolittle, 1999
(Dalhousie University)
41. Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.
bc d
a
a bc d
T
S
r
a bc d
H
42. Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.
bc d
a
a bc d
T
S
r
a bc d
H
43. Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.
bc d
a
a bc d
T
S
r
a bc d
H
44. Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.
bc d
a
a bc d
T
S
r
r
deleting
hybrid edges
a d b
a bc d c
H F
45. A Fundamental Problem for Biologists
Given an initial set of data that correctly repesents the tree-like
evolution of different parts of various species genomes,
what is the smallest number of reticulation events required that
simultaneously explains the evolution of the species?
This smallest number
o Provides a lower bound on the number of such events.
o Indicates the extent that hybridization has had on the
evolutionary history of the species under consideration.
Since 1930’s botantists have asked the question: How significant has
the effect of hybridization been on the New Zealand flora?
46. Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H by
suppressing degree-2 vertices.
Example.
b d
a bc d a c
T
S
c d c d
a b a b
H1 H2
47. Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H by
suppressing degree-2 vertices.
Example.
b d
a bc d a c
T
S
c d c d
a b a b
H1 H2
48. Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H by
suppressing degree-2 vertices.
Example.
b d
a bc d a c
T
S
c d c d
a b a b
H1 H2
49. The Mathematical Problem
MINIMUM HYBRIDIZATION
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a hybridization network H that explains S and T, and
minimizes the number of hybridization vertices.
Measure: The number of hybridization vertices in H.
Notation: Use h(S, T) to denote this minimum number.
51. o A sequence of SPR operations that avoids creating directed cycles
to make a hybridization network that explains S and T.
o If one minimizes the length of an (acyclic) sequence, does the
resulting network minimize the number of hybridization events to
explain S and T?
o YES, and such a sequence corresponds to an acyclic-agreement
forest.
52. Theorem. (Baroni, Grünewald, Moulton, S, 2005)
Let S and T be two binary phylogenetic trees. Then
h(S,T) = size of maximum-acyclic agreement forest - 1.
o It’s fast to construct from a maximum-acyclic agreement for S
and T a hybridization network that realizes h(S,T).
Theorem. (Bordewich, S, 2007)
MINIMUM HYBRIDIZATION is NP-hard.
54. Fixed-Parameter Algorithms
Are the subtree and chain reductions enough to kernalize the
problem?
Lemma. If n’ denotes the size of the leaf sets of the fully reduced
trees using the subtree and chain reductions, then
n’<14h(S,T).
Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is
fixed-parameter tractable.
Running time is O((28k)k + p(n)) compared with O((2n)k), where
k=h(S,T) and p(n) is the polynomial bound for reducing the trees
using the subtree and chain reductions.
56. A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,
Düsseldorf)
o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous
plant hybrids)
o For each sequence, used fastDNAml to reconstruct a phylogenetic
tree (H. Schmidt).
64. pairwise combination # overlapping h(S,T) running time
taxa 2000 MHz
CPU, 2GB RAM
ndhF phyB 40 14 11h
ndhF rbcL 36 13 11.8h
ndhF rpoC2 34 12 26.3h
ndhF waxy 19 9 320s
ndhF ITS 46 at least 15
phyB rbcL 21 4 1s
phyB rpoC2 21 7 180s
phyB waxy 14 3 1s
phyB ITS 30 8 19s
rbcL rpoC2 26 13 29.5h
rbcL waxy 12 7 230s
rbcL ITS 29 at least 9
rpoC2 waxy 10 1 1s
rpoC2 ITS 31 at least 10
waxy ITS 15 8 620s
Bordewich, Linz, St John, S, 2007
65. Computing dSPR(S,T) and h(S,T)
dSPR(S,T) h(S,T)
1. FPT using kernalization and 1. FPT using kernalization only
bounded search tree (O((28k)k +p(n))). Unknown if
methods (O(4kk4+p(n))). a bounded search tree
method exists.
2. Cluster-based reduction.
2. No cluster-based reduction.
3. Unknown if there even is an
3. 3-approximation algorithm.
approximation algorithm.
66. Reducing the Size of the Instance
Subtree Reduction Chain Reduction
x3
an x2
an
x1
a2 a2 x3 T’
a1 a1 x2
x1
S T
S’