SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Fixed-Parameter and Approximation Algorithms for
     the Subtree Distance Between Phylogenies


                       Charles Semple
               Biomathematics Research Centre
           Department of Mathematics and Statistics
                   University of Canterbury
                         New Zealand

Joint work with Magnus Bordewich, Catherine McCartin

e-Science Institute, Edinburgh 2007
Subtree Prune and Regraft (SPR)


Example.        r                      r



                           1 SPR
                                       bc          d
                                   a
           a   bc      d
                                       T1
               S

                                           r
                       1 SPR




                                   a           d   b
                                       c
                                           T2
Applications of SPR


Used
   I.    As a search tool for selecting the best tree in
         reconstruction algorithms.
    II. To quantify the dissimilarity between two phylogenetic
         trees.
    III. To provide a lower bound on the number of reticulation
         events in the case of non-tree-like evolution.

For II and III, one wants the minimum number of SPR operations to
      transform one phylogeny into another.

This number is the SPR distance between two phylogenies S and T.
The Mathematical Problem


MINIMUM SPR
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a minimum length sequence of single SPR operations that
   transforms S into T.
Measure: The length of the sequence.

Notation: Use dSPR(S, T) to denote this minimum length.

Theorem (Bordewich, S 2004)
MINIMUM SPR is NP-hard.

Overriding goal is to find (with no restrictions) the exact solution or a
  heuristic solution with a guarantee of closeness.
Algorithms for NP-Hard Problems


Fixed-parameter algorithms are a practical way to find optimal
   solutions if the parameter measuring the hardness of the problem
   is small.

Polynomial-time approximation algorithms can efficiently find feasible
   solutions that are sometimes arbitrarily close to the optimal
   solution.
Agreement Forests


A forest of T is a disjoint       Example.       r
   collection of phylogenetic
   subtrees whose union of leaf
   sets is X U r.



                                        a        cd   e   f
                                             b
                                                 S

                                                 r



                                        a        cd   e   f
                                             b
                                                 F1
Agreement Forests


A forest of T is a disjoint       Example.       r
   collection of phylogenetic
   subtrees whose union of leaf
   sets is X U r.



                                        a        cd   e   f
                                             b
                                                 S

                                                 r



                                        a        cd   e   f
                                             b
                                                 F1
Agreement Forests


An agreement forest for S and T is a forest of both S and T.

                     r                              r
Example.




       a        cd       e   f          d          fa    b     c
            b                                 e
                S                                  T

                 r                             r



       a        cd       e   f           a         cd    e     f
            b                                 b
                F1                                 F2
Agreement Forests


An agreement forest for S and T is a forest of both S and T.

                     r                              r
Example.




       a        cd       e   f          d          fa    b     c
            b                                 e
                S                                  T

                 r                             r



       a        cd       e   f           a         cd    e     f
            b                                 b
                F1                                 F2
Agreement Forests


An agreement forest for S and T is a forest of both S and T.

                     r                              r
Example.




                         e
       a        cd           f          d          fa    b     c
            b                                 e
                S                                  T

                 r                             r



       a        cd       e   f           a         cd    e     f
            b                                 b
                F1                                 F2
Theorem. (Bordewich, S, 2004)
Let S and T be two binary phylogenetic trees. Then
          dSPR(S,T) = size of maximum-agreement forest - 1.

    o It’s fast to construct from a maximum-agreement forest for S
      and T a sequence of SPR operations that transforms S into T.
Reducing the Size of the Instance

Subtree Reduction                   Chain Reduction
Fixed-Parameter Algorithms


The underlying idea is to find an algorithm whose running time
  separates the size of the problem instance from the parameter of
  interest.

One way to obtain such an algorithm is to reduce the size of the
  problem instance, while preserving the optimal value (kernalizing
  the problem).

Are the subtree and chain reductions enough to kernalize the
   problem?
Fixed-Parameter Algorithms


Lemma. If n’ denotes the size of the leaf sets of the fully reduced
    trees using the subtree and chain reductions, then
                           n’ < 28dSPR(S,T).

Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parameter
     tractable.

1.   Repeatedly apply the subtree and chain rules.
2.   Exhaustively find a maximum-agreement forest by deleting edges
     in S and comparing with T.

Running time is O((56k)k + p(n)) compared with O((2n)k), where
    k=dSPR(S,T) and p(n) is the polynomial bound for reducing the
    trees using the subtree and chain reductions.
Fixed-Parameter Algorithms


A second way to obtain a fixed-parameter algorithm is using the
       method of bounded search trees.



Preliminaries
o       A rooted triple is a binary tree with three leaves.
                                                              a     bc
                                                                  ab|c

o      A tree S displays ab|c if the last common ancestor of ab is a
       descendant of the last common ancestor of ac.
r

Fixed-Parameter Algorithms

                                            r
Observation. If F is an agreement                    d           fa   b   c
                                                             e
  forest for S and T, then F can                                 T
  be obtained from S by
  deleting |F|-1 edges.

                                    a       cd   e       f
                                        b
Recall. If F is maximum, then
                                            S
          |F|-1=dSPR(S,T).
r

Fixed-Parameter Algorithms

                                            r
Observation. If F is an agreement                    d           fa   b   c
                                                             e
  forest for S and T, then F can                                 T
  be obtained from S by
  deleting |F|-1 edges.

                                    a       cd   e       f
                                        b
Recall. If F is maximum, then
                                            S
          |F|-1=dSPR(S,T).
r

Fixed-Parameter Algorithms

                                                r
1.   Find a minimal triple xy|z in S                     d           fa   b   c
                                                                 e
     not in T.
                                                                     T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd   e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.
r

Fixed-Parameter Algorithms

                                                r
1.   Find a minimal triple xy|z in S                     d           fa   b   c
                                                                 e
     not in T.
                                                                     T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd   e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.
r

Fixed-Parameter Algorithms

                                                  r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd   e       f
                                             b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.
r

Fixed-Parameter Algorithms

                                                  r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
                                             er
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd   e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.
r

Fixed-Parameter Algorithms

                                                  r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
                                             er
2.   Branch into 4 computational
                                                  ed
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd   e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.
r

Fixed-Parameter Algorithms

                                                   r
1.   Find a minimal triple xy|z in S                         d           fa   b   c
                                                                     e
     not in T.
                                                                         T
                                             er
2.   Branch into 4 computational
                                                   ed
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd     e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                  ed

                                                    ea
                                                   eb

                                                  er
r

Fixed-Parameter Algorithms

                                                   r
1.   Find a minimal triple xy|z in S                         d           fa   b   c
                                                                     e
     not in T.
                                                                         T
                                             er
2.   Branch into 4 computational
                                                   ed
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd     e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                  ed

                                                    ea
                                                   eb

                                                  er
r

Fixed-Parameter Algorithms

                                                 r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd     e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                ed

                                                  ea
                                                 eb

                                                er
r

Fixed-Parameter Algorithms

                                                        r
1.   Find a minimal triple xy|z in S                                   d           fa   b   c
                                                                               e
                                                  er
     not in T.
                                                                                   T
2.   Branch into 4 computational                              ee
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a          cd              e       f
                                             b
3.   For each path, find a pair of                 S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                       ed

                                                         ea
                                                        eb

                                                       er
r

Fixed-Parameter Algorithms

                                                        r
1.   Find a minimal triple xy|z in S                                   d           fa   b   c
                                                                               e
                                                  er
     not in T.
                                                                                   T
2.   Branch into 4 computational                              ee
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a          cd              e       f
                                             b
3.   For each path, find a pair of                 S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                       ed

                                                         ea
                                                        eb

                                                       er
r

Fixed-Parameter Algorithms

                                                   r
1.   Find a minimal triple xy|z in S                         d           fa   b   c
                                                                     e
     not in T.
                                                                         T
                                             er
2.   Branch into 4 computational
                                                   ed
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd     e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                  ed

                                                    ea
                                                   eb

                                                  er
r

Fixed-Parameter Algorithms

                                                   r
1.   Find a minimal triple xy|z in S                         d           fa   b   c
                                                                     e
     not in T.
                                                                         T
                                             er
2.   Branch into 4 computational
                                                   ed
     paths ex, ey, ez, er, and repeat
                                        ea   eb
     for each path until no such
     triple.                            a         cd     e       f
                                              b
3.   For each path, find a pair of                S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                  ed

                                                    ea
                                                   eb

                                                  er
r

Fixed-Parameter Algorithms

                                                 r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd     e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                ed

                                                  ea
                                                 eb

                                                er
r

Fixed-Parameter Algorithms

                                                 r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd     e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                ed

                                                  ea
                                                 eb

                                                er
r

Fixed-Parameter Algorithms                                         vst

                                                 r
1.   Find a minimal triple xy|z in S                       d             fa   b   c
                                                                   e
     not in T.
                                                                         T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd     e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                ed

                                                  ea
                                                 eb

                                                er
r

Fixed-Parameter Algorithms                                          vst

                                                  r
                                                 es
1.   Find a minimal triple xy|z in S                        d             fa   b   c
                                                                    e
     not in T.
                                                                          T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
                                        et
     for each path until no such
     triple.                            a        cd     e       f
                                             b
3.   For each path, find a pair of               S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                 ed

                                                   ea
                                                  eb

                                                 er
r

Fixed-Parameter Algorithms                                          vst

                                                  r
                                                 es
1.   Find a minimal triple xy|z in S                        d             fa   b   c
                                                                    e
     not in T.
                                                                          T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
                                        et
     for each path until no such
     triple.                            a        cd     e       f
                                             b
3.   For each path, find a pair of               S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                 ed

                                                   ea
                                                  eb

                                                        es
                                                 er

                                                         et
r

Fixed-Parameter Algorithms

                                                 r
1.   Find a minimal triple xy|z in S                       d           fa   b   c
                                                                   e
     not in T.
                                                                       T
2.   Branch into 4 computational
     paths ex, ey, ez, er, and repeat
     for each path until no such
     triple.                            a       cd     e       f
                                            b
3.   For each path, find a pair of              S
     overlapping components ts and tt
     and branch into 2 paths es, et.
     Repeat until no such
     components.                                ed

                                                  ea
                                                 eb

                                                       es
                                                er

                                                        et
r

Fixed-Parameter Algorithms

                                                 r
Key Lemma. dSPR(S,T) <= k if and only                      d           fa   b   c
                                                                   e
     if one of the computational
                                                                       T
     paths of length at most k
     results in an agreement forest
     for S and T.
                                        a       cd     e       f
                                            b
Theorem. (Bordewich, McCartin, S                S
     2007)
The bounded search tree algorithm
     has running time O(4kn4).
                                                ed
Combining the methods of
    kernalization and bounded                     ea
    search gives an FPT algorithm                eb
    with running time O(4kk4+p(n)).
                                                       es
                                                er

                                                        et
Approximation Algorithms


Polynomial-time approximation algorithms can efficiently find feasible
   solutions that are sometimes arbitrarily close to the optimal
   solution.

An r-approximation algorithm A for MINIMUM SPR means that the
   size of an agreement forest (minus 1) for S and T returned by A is
   at most r times dSPR(S,T).

Example. If r=3, then the size of any agreement forest (minus 1) for S
   and T returned by A is at most 3 times dSPR(S,T).

Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta
   2006 establish a 5-approximation algorithm (O(n)) for MINIMUM
   SPR.
r

Approximation Algorithms

                                                    r
1.    Find a minimal triple xy|z in S                         d           fa   b   c
                                                                      e
      not in T.
                                                                          T
2.    Delete each of ex, ey, ez, er, and
      repeat until no such triple.
3.    Find a pair of overlapping
      components ts and tt and delete      a       cd     e       f
                                               b
      es, et. Repeat until no such                 S
      components.

Theorem. (Bordewich, McCartin, S
                                                   ed
      2007)
This gives a 4-approximation alg. for
      MINIMUM SPR (O(n5)).                           ea
                                                    eb
Deleting only ex, ez, er gives a 3-
                                                          es
     approximation algorithm.                      er

                                                           et
No r, unless
                                                         P=NP
Exact
             (B, S 07)                  (B, M,S 07)
algorithms
             1 + 1/2112                 3
r=1
                          MINIMUM SPR
                                            5
       Maximum                                           1. Travelling
                                            (B S,M, A 06)salesman
       independent
       set in planar                                     problem
       graphs (PTAS)
                                                        2. Maximum
                                                        clique
Modelling Hybridization Events with SPR Operations


Reticulation processes cause        … molecular phylogeneticists will
     species to be a composite of       have failed to find the `true
     DNA regions derived from           tree’, not because their
     different ancestors.               methods are inadequate or
                                        because they have chosen
                                        the wrong genes, but
Processes include
                                        because the history of life
    o horizontal gene transfer,
                                        cannot be properly
    o hybridization, and
                                        represented as a tree.
    o recombination.
                                                 Ford Doolittle, 1999
                                               (Dalhousie University)
Modelling Hybridization Events with SPR Operations


A single SPR operation models a single hybridization event (Maddison
   1997).
                    r                       r
Example.



                                           bc       d
                                     a
            a       bc      d
                                           T
                    S
                r




        a       bc      d
                H
Modelling Hybridization Events with SPR Operations


A single SPR operation models a single hybridization event (Maddison
   1997).
                    r                       r
Example.



                                           bc       d
                                     a
            a       bc      d
                                           T
                    S
                r




        a       bc      d
                H
Modelling Hybridization Events with SPR Operations


A single SPR operation models a single hybridization event (Maddison
   1997).
                    r                       r
Example.



                                           bc       d
                                     a
            a       bc      d
                                           T
                    S
                r




        a       bc      d
                H
Modelling Hybridization Events with SPR Operations


A single SPR operation models a single hybridization event (Maddison
   1997).
                    r                       r
Example.



                                           bc               d
                                     a
            a       bc      d
                                           T
                    S
                r

                                                    r
                            deleting
                            hybrid edges
                                                a       d       b
        a       bc      d                                           c
                H                                           F
A Fundamental Problem for Biologists


Given an initial set of data that correctly repesents the tree-like
   evolution of different parts of various species genomes,

what is the smallest number of reticulation events required that
  simultaneously explains the evolution of the species?

This smallest number
    o Provides a lower bound on the number of such events.
    o Indicates the extent that hybridization has had on the
       evolutionary history of the species under consideration.

Since 1930’s botantists have asked the question: How significant has
   the effect of hybridization been on the New Zealand flora?
Trees and Hybridization Networks


H explains T if T can be obtained from a rooted subtree of H by
   suppressing degree-2 vertices.

Example.



                                                      b   d
            a     bc          d         a     c
                                                  T
                  S




                          c   d                    c      d
           a     b                      a      b
                     H1                         H2
Trees and Hybridization Networks


H explains T if T can be obtained from a rooted subtree of H by
   suppressing degree-2 vertices.

Example.



                                                      b   d
            a     bc          d         a     c
                                                  T
                  S




                          c   d                    c      d
           a     b                      a      b
                     H1                         H2
Trees and Hybridization Networks


H explains T if T can be obtained from a rooted subtree of H by
   suppressing degree-2 vertices.

Example.



                                                      b   d
            a     bc          d         a     c
                                                  T
                  S




                          c   d                    c      d
           a     b                      a      b
                     H1                         H2
The Mathematical Problem


MINIMUM HYBRIDIZATION
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a hybridization network H that explains S and T, and
   minimizes the number of hybridization vertices.
Measure: The number of hybridization vertices in H.



Notation: Use h(S, T) to denote this minimum number.
Example: Arbitrary SPR operations not sufficient.




          r



 a        cd    e   f
      b
          F1
o A sequence of SPR operations that avoids creating directed cycles
  to make a hybridization network that explains S and T.

o If one minimizes the length of an (acyclic) sequence, does the
  resulting network minimize the number of hybridization events to
  explain S and T?

o YES, and such a sequence corresponds to an acyclic-agreement
  forest.
Theorem. (Baroni, Grünewald, Moulton, S, 2005)
Let S and T be two binary phylogenetic trees. Then
        h(S,T) = size of maximum-acyclic agreement forest - 1.

    o It’s fast to construct from a maximum-acyclic agreement for S
      and T a hybridization network that realizes h(S,T).

Theorem. (Bordewich, S, 2007)
MINIMUM HYBRIDIZATION is NP-hard.
Reducing the Size of the Instance

Subtree Reduction                   Chain Reduction
Fixed-Parameter Algorithms


Are the subtree and chain reductions enough to kernalize the
   problem?

Lemma. If n’ denotes the size of the leaf sets of the fully reduced
   trees using the subtree and chain reductions, then
                             n’<14h(S,T).

Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is
   fixed-parameter tractable.

Running time is O((28k)k + p(n)) compared with O((2n)k), where
   k=h(S,T) and p(n) is the polynomial bound for reducing the trees
   using the subtree and chain reductions.
Reducing the Size of the Instance

Cluster Reduction (Baroni 2004)




                                  +
A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,
Düsseldorf)




o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous
  plant hybrids)
o For each sequence, used fastDNAml to reconstruct a phylogenetic
  tree (H. Schmidt).
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
Chloroplast (phyB) sequences   Nuclear (ITS) sequences
h=3




                                                         h=1




                                                         h=4




                                                         h=0

Chloroplast (phyB) sequences   Nuclear (ITS) sequences
pairwise combination   # overlapping     h(S,T)      running time
                             taxa                        2000 MHz
                                                       CPU, 2GB RAM
ndhF        phyB              40             14            11h
ndhF        rbcL              36             13           11.8h
ndhF        rpoC2             34             12           26.3h
ndhF        waxy              19              9           320s
ndhF        ITS               46         at least 15
phyB        rbcL              21              4             1s
phyB        rpoC2             21              7           180s
phyB        waxy              14              3             1s
phyB        ITS               30              8            19s
rbcL        rpoC2             26             13           29.5h
rbcL        waxy              12              7           230s
rbcL        ITS               29          at least 9
rpoC2       waxy              10              1             1s
rpoC2       ITS               31         at least 10
waxy        ITS               15              8           620s
                                   Bordewich, Linz, St John, S, 2007
Computing dSPR(S,T) and h(S,T)


dSPR(S,T)                          h(S,T)

1.   FPT using kernalization and   1.   FPT using kernalization only
     bounded search tree                (O((28k)k +p(n))). Unknown if
     methods (O(4kk4+p(n))).            a bounded search tree
                                        method exists.

                                   2.   Cluster-based reduction.
2.   No cluster-based reduction.
                                   3.   Unknown if there even is an
3.   3-approximation algorithm.
                                        approximation algorithm.
Reducing the Size of the Instance

Subtree Reduction                            Chain Reduction




                                                                  x3
               an                                                      x2
                          an
                                                                            x1
      a2                           a2                        x3        T’
 a1                                     a1              x2
                                                   x1
           S                   T
                                                        S’

Mais conteúdo relacionado

Mais de Roderic Page

The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talkRoderic Page
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataRoderic Page
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...Roderic Page
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyondRoderic Page
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital CatapultRoderic Page
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stRoderic Page
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responsesRoderic Page
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talkRoderic Page
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge GraphsRoderic Page
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal viewRoderic Page
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldRoderic Page
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Roderic Page
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaRoderic Page
 
Building the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphBuilding the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphRoderic Page
 
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Roderic Page
 
Something about links
Something about linksSomething about links
Something about linksRoderic Page
 
Why I blog instead of writing papers
Why I blog instead of writing papersWhy I blog instead of writing papers
Why I blog instead of writing papersRoderic Page
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomyRoderic Page
 

Mais de Roderic Page (20)

The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talk
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long data
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyond
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital Catapult
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21st
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responses
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal view
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living world
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
 
Building the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphBuilding the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge Graph
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
 
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
 
Something about links
Something about linksSomething about links
Something about links
 
Why I blog instead of writing papers
Why I blog instead of writing papersWhy I blog instead of writing papers
Why I blog instead of writing papers
 
Social media
Social mediaSocial media
Social media
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomy
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Approximating the subtree distance between phylogenies. C Semple

  • 1. Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies Charles Semple Biomathematics Research Centre Department of Mathematics and Statistics University of Canterbury New Zealand Joint work with Magnus Bordewich, Catherine McCartin e-Science Institute, Edinburgh 2007
  • 2. Subtree Prune and Regraft (SPR) Example. r r 1 SPR bc d a a bc d T1 S r 1 SPR a d b c T2
  • 3. Applications of SPR Used I. As a search tool for selecting the best tree in reconstruction algorithms. II. To quantify the dissimilarity between two phylogenetic trees. III. To provide a lower bound on the number of reticulation events in the case of non-tree-like evolution. For II and III, one wants the minimum number of SPR operations to transform one phylogeny into another. This number is the SPR distance between two phylogenies S and T.
  • 4. The Mathematical Problem MINIMUM SPR Instance: Two rooted binary phylogenetic trees S and T. Goal: Find a minimum length sequence of single SPR operations that transforms S into T. Measure: The length of the sequence. Notation: Use dSPR(S, T) to denote this minimum length. Theorem (Bordewich, S 2004) MINIMUM SPR is NP-hard. Overriding goal is to find (with no restrictions) the exact solution or a heuristic solution with a guarantee of closeness.
  • 5. Algorithms for NP-Hard Problems Fixed-parameter algorithms are a practical way to find optimal solutions if the parameter measuring the hardness of the problem is small. Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution.
  • 6. Agreement Forests A forest of T is a disjoint Example. r collection of phylogenetic subtrees whose union of leaf sets is X U r. a cd e f b S r a cd e f b F1
  • 7. Agreement Forests A forest of T is a disjoint Example. r collection of phylogenetic subtrees whose union of leaf sets is X U r. a cd e f b S r a cd e f b F1
  • 8. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. a cd e f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
  • 9. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. a cd e f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
  • 10. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. e a cd f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
  • 11. Theorem. (Bordewich, S, 2004) Let S and T be two binary phylogenetic trees. Then dSPR(S,T) = size of maximum-agreement forest - 1. o It’s fast to construct from a maximum-agreement forest for S and T a sequence of SPR operations that transforms S into T.
  • 12. Reducing the Size of the Instance Subtree Reduction Chain Reduction
  • 13. Fixed-Parameter Algorithms The underlying idea is to find an algorithm whose running time separates the size of the problem instance from the parameter of interest. One way to obtain such an algorithm is to reduce the size of the problem instance, while preserving the optimal value (kernalizing the problem). Are the subtree and chain reductions enough to kernalize the problem?
  • 14. Fixed-Parameter Algorithms Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’ < 28dSPR(S,T). Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parameter tractable. 1. Repeatedly apply the subtree and chain rules. 2. Exhaustively find a maximum-agreement forest by deleting edges in S and comparing with T. Running time is O((56k)k + p(n)) compared with O((2n)k), where k=dSPR(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.
  • 15. Fixed-Parameter Algorithms A second way to obtain a fixed-parameter algorithm is using the method of bounded search trees. Preliminaries o A rooted triple is a binary tree with three leaves. a bc ab|c o A tree S displays ab|c if the last common ancestor of ab is a descendant of the last common ancestor of ac.
  • 16. r Fixed-Parameter Algorithms r Observation. If F is an agreement d fa b c e forest for S and T, then F can T be obtained from S by deleting |F|-1 edges. a cd e f b Recall. If F is maximum, then S |F|-1=dSPR(S,T).
  • 17. r Fixed-Parameter Algorithms r Observation. If F is an agreement d fa b c e forest for S and T, then F can T be obtained from S by deleting |F|-1 edges. a cd e f b Recall. If F is maximum, then S |F|-1=dSPR(S,T).
  • 18. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
  • 19. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
  • 20. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
  • 21. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
  • 22. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
  • 23. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 24. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 25. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 26. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e er not in T. T 2. Branch into 4 computational ee paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 27. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e er not in T. T 2. Branch into 4 computational ee paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 28. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 29. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 30. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 31. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 32. r Fixed-Parameter Algorithms vst r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 33. r Fixed-Parameter Algorithms vst r es 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat et for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
  • 34. r Fixed-Parameter Algorithms vst r es 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat et for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb es er et
  • 35. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb es er et
  • 36. r Fixed-Parameter Algorithms r Key Lemma. dSPR(S,T) <= k if and only d fa b c e if one of the computational T paths of length at most k results in an agreement forest for S and T. a cd e f b Theorem. (Bordewich, McCartin, S S 2007) The bounded search tree algorithm has running time O(4kn4). ed Combining the methods of kernalization and bounded ea search gives an FPT algorithm eb with running time O(4kk4+p(n)). es er et
  • 37. Approximation Algorithms Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution. An r-approximation algorithm A for MINIMUM SPR means that the size of an agreement forest (minus 1) for S and T returned by A is at most r times dSPR(S,T). Example. If r=3, then the size of any agreement forest (minus 1) for S and T returned by A is at most 3 times dSPR(S,T). Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta 2006 establish a 5-approximation algorithm (O(n)) for MINIMUM SPR.
  • 38. r Approximation Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Delete each of ex, ey, ez, er, and repeat until no such triple. 3. Find a pair of overlapping components ts and tt and delete a cd e f b es, et. Repeat until no such S components. Theorem. (Bordewich, McCartin, S ed 2007) This gives a 4-approximation alg. for MINIMUM SPR (O(n5)). ea eb Deleting only ex, ez, er gives a 3- es approximation algorithm. er et
  • 39. No r, unless P=NP Exact (B, S 07) (B, M,S 07) algorithms 1 + 1/2112 3 r=1 MINIMUM SPR 5 Maximum 1. Travelling (B S,M, A 06)salesman independent set in planar problem graphs (PTAS) 2. Maximum clique
  • 40. Modelling Hybridization Events with SPR Operations Reticulation processes cause … molecular phylogeneticists will species to be a composite of have failed to find the `true DNA regions derived from tree’, not because their different ancestors. methods are inadequate or because they have chosen the wrong genes, but Processes include because the history of life o horizontal gene transfer, cannot be properly o hybridization, and represented as a tree. o recombination. Ford Doolittle, 1999 (Dalhousie University)
  • 41. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
  • 42. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
  • 43. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
  • 44. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r r deleting hybrid edges a d b a bc d c H F
  • 45. A Fundamental Problem for Biologists Given an initial set of data that correctly repesents the tree-like evolution of different parts of various species genomes, what is the smallest number of reticulation events required that simultaneously explains the evolution of the species? This smallest number o Provides a lower bound on the number of such events. o Indicates the extent that hybridization has had on the evolutionary history of the species under consideration. Since 1930’s botantists have asked the question: How significant has the effect of hybridization been on the New Zealand flora?
  • 46. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
  • 47. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
  • 48. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
  • 49. The Mathematical Problem MINIMUM HYBRIDIZATION Instance: Two rooted binary phylogenetic trees S and T. Goal: Find a hybridization network H that explains S and T, and minimizes the number of hybridization vertices. Measure: The number of hybridization vertices in H. Notation: Use h(S, T) to denote this minimum number.
  • 50. Example: Arbitrary SPR operations not sufficient. r a cd e f b F1
  • 51. o A sequence of SPR operations that avoids creating directed cycles to make a hybridization network that explains S and T. o If one minimizes the length of an (acyclic) sequence, does the resulting network minimize the number of hybridization events to explain S and T? o YES, and such a sequence corresponds to an acyclic-agreement forest.
  • 52. Theorem. (Baroni, Grünewald, Moulton, S, 2005) Let S and T be two binary phylogenetic trees. Then h(S,T) = size of maximum-acyclic agreement forest - 1. o It’s fast to construct from a maximum-acyclic agreement for S and T a hybridization network that realizes h(S,T). Theorem. (Bordewich, S, 2007) MINIMUM HYBRIDIZATION is NP-hard.
  • 53. Reducing the Size of the Instance Subtree Reduction Chain Reduction
  • 54. Fixed-Parameter Algorithms Are the subtree and chain reductions enough to kernalize the problem? Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’<14h(S,T). Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is fixed-parameter tractable. Running time is O((28k)k + p(n)) compared with O((2n)k), where k=h(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.
  • 55. Reducing the Size of the Instance Cluster Reduction (Baroni 2004) +
  • 56. A Grass (Poaceae) Dataset (Grass Phylogeny Working Group, Düsseldorf) o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous plant hybrids) o For each sequence, used fastDNAml to reconstruct a phylogenetic tree (H. Schmidt).
  • 57. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 58. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 59. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 60. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 61. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 62. Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 63. h=3 h=1 h=4 h=0 Chloroplast (phyB) sequences Nuclear (ITS) sequences
  • 64. pairwise combination # overlapping h(S,T) running time taxa 2000 MHz CPU, 2GB RAM ndhF phyB 40 14 11h ndhF rbcL 36 13 11.8h ndhF rpoC2 34 12 26.3h ndhF waxy 19 9 320s ndhF ITS 46 at least 15 phyB rbcL 21 4 1s phyB rpoC2 21 7 180s phyB waxy 14 3 1s phyB ITS 30 8 19s rbcL rpoC2 26 13 29.5h rbcL waxy 12 7 230s rbcL ITS 29 at least 9 rpoC2 waxy 10 1 1s rpoC2 ITS 31 at least 10 waxy ITS 15 8 620s Bordewich, Linz, St John, S, 2007
  • 65. Computing dSPR(S,T) and h(S,T) dSPR(S,T) h(S,T) 1. FPT using kernalization and 1. FPT using kernalization only bounded search tree (O((28k)k +p(n))). Unknown if methods (O(4kk4+p(n))). a bounded search tree method exists. 2. Cluster-based reduction. 2. No cluster-based reduction. 3. Unknown if there even is an 3. 3-approximation algorithm. approximation algorithm.
  • 66. Reducing the Size of the Instance Subtree Reduction Chain Reduction x3 an x2 an x1 a2 a2 x3 T’ a1 a1 x2 x1 S T S’