Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter and Approximation Algorithms for
the Subtree Distance Between Phylogenies

Charles Semple
Biomathematics Research Centre
Department of Mathematics and Statistics
University of Canterbury
New Zealand

Joint work with Magnus Bordewich, Catherine McCartin

e-Science Institute, Edinburgh 2007

Subtree Prune and Regraft (SPR)

Example. r r

1 SPR
bc d
a
a bc d
T1
S

r
1 SPR

a d b
c
T2

Applications of SPR

Used
I. As a search tool for selecting the best tree in
reconstruction algorithms.
II. To quantify the dissimilarity between two phylogenetic
trees.
III. To provide a lower bound on the number of reticulation
events in the case of non-tree-like evolution.

For II and III, one wants the minimum number of SPR operations to
transform one phylogeny into another.

This number is the SPR distance between two phylogenies S and T.

The Mathematical Problem

MINIMUM SPR
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a minimum length sequence of single SPR operations that
transforms S into T.
Measure: The length of the sequence.

Notation: Use dSPR(S, T) to denote this minimum length.

Theorem (Bordewich, S 2004)
MINIMUM SPR is NP-hard.

Overriding goal is to find (with no restrictions) the exact solution or a
heuristic solution with a guarantee of closeness.

Algorithms for NP-Hard Problems

Fixed-parameter algorithms are a practical way to find optimal
solutions if the parameter measuring the hardness of the problem
is small.

Polynomial-time approximation algorithms can efficiently find feasible
solutions that are sometimes arbitrarily close to the optimal
solution.

Agreement Forests

A forest of T is a disjoint Example. r
collection of phylogenetic
subtrees whose union of leaf
sets is X U r.

a cd e f
b
S

r

a cd e f
b
F1

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

r r
Example.

a cd e f d fa b c
b e
S T

r r

a cd e f a cd e f
b b
F1 F2

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

r r
Example.

e
a cd f d fa b c
b e
S T

r r

a cd e f a cd e f
b b
F1 F2

Theorem. (Bordewich, S, 2004)
Let S and T be two binary phylogenetic trees. Then
dSPR(S,T) = size of maximum-agreement forest - 1.

o It’s fast to construct from a maximum-agreement forest for S
and T a sequence of SPR operations that transforms S into T.

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

Fixed-Parameter Algorithms

The underlying idea is to find an algorithm whose running time
separates the size of the problem instance from the parameter of
interest.

One way to obtain such an algorithm is to reduce the size of the
problem instance, while preserving the optimal value (kernalizing
the problem).

Are the subtree and chain reductions enough to kernalize the
problem?


Lemma. If n’ denotes the size of the leaf sets of the fully reduced
trees using the subtree and chain reductions, then
n’ < 28dSPR(S,T).

Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parameter
tractable.

1. Repeatedly apply the subtree and chain rules.
2. Exhaustively find a maximum-agreement forest by deleting edges
in S and comparing with T.

Running time is O((56k)k + p(n)) compared with O((2n)k), where
k=dSPR(S,T) and p(n) is the polynomial bound for reducing the
trees using the subtree and chain reductions.


A second way to obtain a fixed-parameter algorithm is using the
method of bounded search trees.

Preliminaries
o A rooted triple is a binary tree with three leaves.
a bc
ab|c

o A tree S displays ab|c if the last common ancestor of ab is a
descendant of the last common ancestor of ac.

r


r
Observation. If F is an agreement d fa b c
e
forest for S and T, then F can T
be obtained from S by
deleting |F|-1 edges.

a cd e f
b
Recall. If F is maximum, then
S
|F|-1=dSPR(S,T).

r


r
1. Find a minimal triple xy|z in S d fa b c
e
not in T.
T
2. Branch into 4 computational
paths ex, ey, ez, er, and repeat
for each path until no such
triple. a cd e f
b
3. For each path, find a pair of S
overlapping components ts and tt
and branch into 2 paths es, et.
Repeat until no such
components.

r


r
e
not in T.
T
ea eb
triple. a cd e f
b
components.

r


r
e
not in T.
T
er
ea eb
triple. a cd e f
b
components.

r


r
e
not in T.
T
er
ed
ea eb
triple. a cd e f
b
components.

r


r
e
not in T.
T
er
ed
ea eb
triple. a cd e f
b
components. ed

ea
eb

er

r


r
e
not in T.
T
triple. a cd e f
b
components. ed

ea
eb

er

r


r
e
er
not in T.
T
2. Branch into 4 computational ee
ea eb
triple. a cd e f
b
components. ed

ea
eb

er

r

Fixed-Parameter Algorithms vst

r
e
not in T.
T
triple. a cd e f
b
components. ed

ea
eb

er

r


r
es
e
not in T.
T
et
triple. a cd e f
b
components. ed

ea
eb

er

r


r
es
e
not in T.
T
et
triple. a cd e f
b
components. ed

ea
eb

es
er

et

r


r
e
not in T.
T
triple. a cd e f
b
components. ed

ea
eb

es
er

et

r


r
Key Lemma. dSPR(S,T) <= k if and only d fa b c
e
if one of the computational
T
paths of length at most k
results in an agreement forest
for S and T.
a cd e f
b
Theorem. (Bordewich, McCartin, S S
2007)
The bounded search tree algorithm
has running time O(4kn4).
ed
Combining the methods of
kernalization and bounded ea
search gives an FPT algorithm eb
with running time O(4kk4+p(n)).
es
er

et

Approximation Algorithms

Polynomial-time approximation algorithms can efficiently find feasible
solutions that are sometimes arbitrarily close to the optimal
solution.

An r-approximation algorithm A for MINIMUM SPR means that the
size of an agreement forest (minus 1) for S and T returned by A is
at most r times dSPR(S,T).

Example. If r=3, then the size of any agreement forest (minus 1) for S
and T returned by A is at most 3 times dSPR(S,T).

Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta
2006 establish a 5-approximation algorithm (O(n)) for MINIMUM
SPR.

r

Approximation Algorithms

r
e
not in T.
T
2. Delete each of ex, ey, ez, er, and
repeat until no such triple.
3. Find a pair of overlapping
components ts and tt and delete a cd e f
b
es, et. Repeat until no such S
components.

Theorem. (Bordewich, McCartin, S
ed
2007)
This gives a 4-approximation alg. for
MINIMUM SPR (O(n5)). ea
eb
Deleting only ex, ez, er gives a 3-
es
approximation algorithm. er

et

No r, unless
P=NP
Exact
(B, S 07) (B, M,S 07)
algorithms
1 + 1/2112 3
r=1
MINIMUM SPR
5
Maximum 1. Travelling
(B S,M, A 06)salesman
independent
set in planar problem
graphs (PTAS)
2. Maximum
clique

Modelling Hybridization Events with SPR Operations

Reticulation processes cause … molecular phylogeneticists will
species to be a composite of have failed to find the `true
DNA regions derived from tree’, not because their
different ancestors. methods are inadequate or
because they have chosen
the wrong genes, but
Processes include
because the history of life
o horizontal gene transfer,
cannot be properly
o hybridization, and
represented as a tree.
o recombination.
Ford Doolittle, 1999
(Dalhousie University)


A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.

bc d
a
a bc d
T
S
r

a bc d
H


A single SPR operation models a single hybridization event (Maddison
1997).
r r
Example.

bc d
a
a bc d
T
S
r

r
deleting
hybrid edges
a d b
a bc d c
H F

A Fundamental Problem for Biologists

Given an initial set of data that correctly repesents the tree-like
evolution of different parts of various species genomes,

what is the smallest number of reticulation events required that
simultaneously explains the evolution of the species?

This smallest number
o Provides a lower bound on the number of such events.
o Indicates the extent that hybridization has had on the
evolutionary history of the species under consideration.

Since 1930’s botantists have asked the question: How significant has
the effect of hybridization been on the New Zealand flora?

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H by
suppressing degree-2 vertices.

Example.

b d
a bc d a c
T
S

c d c d
a b a b
H1 H2

The Mathematical Problem

MINIMUM HYBRIDIZATION
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a hybridization network H that explains S and T, and
minimizes the number of hybridization vertices.
Measure: The number of hybridization vertices in H.

Notation: Use h(S, T) to denote this minimum number.

Example: Arbitrary SPR operations not sufficient.

r

a cd e f
b
F1

o A sequence of SPR operations that avoids creating directed cycles
to make a hybridization network that explains S and T.

o If one minimizes the length of an (acyclic) sequence, does the
resulting network minimize the number of hybridization events to
explain S and T?

o YES, and such a sequence corresponds to an acyclic-agreement
forest.

Theorem. (Baroni, Grünewald, Moulton, S, 2005)
Let S and T be two binary phylogenetic trees. Then
h(S,T) = size of maximum-acyclic agreement forest - 1.

o It’s fast to construct from a maximum-acyclic agreement for S
and T a hybridization network that realizes h(S,T).

Theorem. (Bordewich, S, 2007)
MINIMUM HYBRIDIZATION is NP-hard.


Are the subtree and chain reductions enough to kernalize the
problem?

Lemma. If n’ denotes the size of the leaf sets of the fully reduced
trees using the subtree and chain reductions, then
n’<14h(S,T).

Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is
fixed-parameter tractable.

Running time is O((28k)k + p(n)) compared with O((2n)k), where
k=h(S,T) and p(n) is the polynomial bound for reducing the trees
using the subtree and chain reductions.


Cluster Reduction (Baroni 2004)

+

A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,
Düsseldorf)

o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous
plant hybrids)
o For each sequence, used fastDNAml to reconstruct a phylogenetic
tree (H. Schmidt).

Chloroplast (phyB) sequences Nuclear (ITS) sequences

h=3

h=1

h=4

h=0

Chloroplast (phyB) sequences Nuclear (ITS) sequences

pairwise combination # overlapping h(S,T) running time
taxa 2000 MHz
CPU, 2GB RAM
ndhF phyB 40 14 11h
ndhF rbcL 36 13 11.8h
ndhF rpoC2 34 12 26.3h
ndhF waxy 19 9 320s
ndhF ITS 46 at least 15
phyB rbcL 21 4 1s
phyB rpoC2 21 7 180s
phyB waxy 14 3 1s
phyB ITS 30 8 19s
rbcL rpoC2 26 13 29.5h
rbcL waxy 12 7 230s
rbcL ITS 29 at least 9
rpoC2 waxy 10 1 1s
rpoC2 ITS 31 at least 10
waxy ITS 15 8 620s
Bordewich, Linz, St John, S, 2007

Computing dSPR(S,T) and h(S,T)

dSPR(S,T) h(S,T)

1. FPT using kernalization and 1. FPT using kernalization only
bounded search tree (O((28k)k +p(n))). Unknown if
methods (O(4kk4+p(n))). a bounded search tree
method exists.

2. Cluster-based reduction.
2. No cluster-based reduction.
3. Unknown if there even is an
3. 3-approximation algorithm.
approximation algorithm.


Subtree Reduction Chain Reduction

x3
an x2
an
x1
a2 a2 x3 T’
a1 a1 x2
x1
S T
S’

Approximating the subtree distance between phylogenies. C Semple

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Roderic Page

Mais de Roderic Page (20)

Último

Último (20)

Approximating the subtree distance between phylogenies. C Semple