The Ultimate Guide to Choosing WordPress Pros and Cons
Quantifying MCMC exploration of phylogenetic tree space
1. Quantifying MCMC exploration
of phylogenetic tree space
Christopher Whidden and Frederick “Erick” A. Matsen IV
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org
@ematsen
3. Phylogenetics helps us learn how HIV-1 came to be
Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host &
Microbe, 2013
4. We are fond of statistical approaches to phylogenetics
These are important when one would like a clear notion of
uncertainty (like medicine, epidemiology, and biodefense!)
5. We are fond of statistical approaches to phylogenetics
In particular, Bayesian methods fall into this category and have
become quite popular.
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
We can’t solve for this posterior distribution, but we can satisfy
our needs by getting a big sample from it.
6. Markov chain Monte Carlo (MCMC)
Metropolis et al., 1953.
Set up a simulation such that the amount of time spent in a given
state is proportional to the posterior probability of that state.
7. Here we want a posterior on trees
If we want to use the same strategy to get a posterior on
phylogenetic trees. . .
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
we need a way to move from one phylogenetic tree to another.
9. The set of trees as a graph connected by SPR moves
(Figure from Mossel and Vigoda, Science, 2005).
10. This graph is connected, and every tree has nonzero
posterior probability, so MCMC works†
We are guaranteed to converge to the posterior distribution on
trees by using Metropolis-Hastings moves built on these SPRs.
That is, by bouncing around “tree space” we can get a good idea
of a set of good trees.
†
That is, it works if we run the MCMC forever
11. We can’t run it forever.
News flash:
5 million < ∞
13. We wanted to know: does this happen in real data sets?
Lots of discussion in literature, but few clear conclusions.
In order to understand the reasons differentiating “easy” and
“difficult” data sets for phylogenetic MCMC, we wanted to make it
possible to visualize tree space with a relevant geometry.
So, what trees are close to each other in terms of SPR moves?
14. dSPR : how many SPR moves from one tree to another?
Say T1
T2 if there is an SPR transformation of T1 to T2 .
dSPR (T , S) =
T1
min
···
Tk =S
k
This distance is NP-hard to compute. That’s no fun!
15. Meet Chris Whidden, algorithms strongman
In a series of four very technical papers, Chris took exact
computation of dSPR from O(infeasible) to O(feasible).
Then he joined my group!
16. Let’s take some common data sets and see what we see
These are completely standard data sets of the sort that biologists
analyze every day: slowly evolving nuclear, mitochondrial, or
chloroplast genes.
Also used as examples in:
Lakner et al., Syst. Biol., 2008
Hohna and Drummond, Syst. Biol., 2012
Larget, Syst. Biol., 2013
18. Summarize by subsetting to high probability nodes
node size proportional to
posterior probability, and
color shows distance to
the highest PP tree.
20. The top 4096 trees for a data set
What's up with this stuff?
Is it important? Is it difficult
for the MCMC to see?
21. Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probability tree and back?
Any round trip path counts!
22. Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probability tree and back?
Any round trip path counts!
24. The separation is problematic indeed
Yep, those parts of the posterior
are important and MCMC has
trouble entering them.
25. Trees with 95% of posterior probability for another data set
26. We can use our methods to identify source of bottlenecks
Hyla_cinerea
Hyla_cinerea
Bufo_valliceps
Bufo_valliceps
Nesomantis_thomasseti
Hypogeophis_rostratus
Eleutherodactylus_cuneatus
Grandisonia_alternans
Gastrophryne_carolinensis
Amphiuma_tridactylum
Hypogeophis_rostratus
Ichthyophis_bannanicus
Grandisonia_alternans
Ambystoma_mexicanum
Amphiuma_tridactylum
Siren_intermedia
Ichthyophis_bannanicus
Typhlonectes_natans
Plethodon_yonhalossee
Discoglossus_pictus
Scaphiopus_holbrooki
Plethodon_yonhalossee
Discoglossus_pictus
Scaphiopus_holbrooki
Ambystoma_mexicanum
Nesomantis_thomasseti
Siren_intermedia
Eleutherodactylus_cuneatus
Typhlonectes_natans
Gastrophryne_carolinensis
Xenopus_laevis
Xenopus_laevis
Homo_sapiens
Homo_sapiens
Mus_musculus
Mus_musculus
Rattus_norvegicus
Rattus_norvegicus
Oryctolagus_cuniculus
Oryctolagus_cuniculus
Turdus_migratorius
Turdus_migratorius
Gallus_gallus
Gallus_gallus
Heterodon_platyrhinos
Heterodon_platyrhinos
Sceloporus_undulatus
Sceloporus_undulatus
Alligator_mississippiensis
Alligator_mississippiensis
Trachemys_scripta
Trachemys_scripta
Latimeria_chalumnae
Latimeria_chalumnae
These are the trees at the two peaks of the connected components.
Indeed, it’s very tricky to get between them!
29. Our applications: it’s party time
Automatic identification of (multiple) peaks in posteriors
Performance of Metropolis-coupled Markov chain Monte Carlo
for getting between peaks
Accuracy of new “mean-field” posterior probability
approximations
The first topological convergence diagnostic
These empirical investigations set the stage for additional
theoretical development, and suggest new ways to move around
tree space.
This will translate into better phylogenetic uncertainty estimates,
and hence better preparedness and response to biological threats.
30. Thank you
Robert Beiko (Dalhousie University)
Aaron Darling (University of Technology, Sydney)
Connor McCoy (Fred Hutchinson Cancer Research Center)
NSF award 1223057