SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Quantifying MCMC exploration
of phylogenetic tree space
Christopher Whidden and Frederick “Erick” A. Matsen IV
Fred Hutchinson Cancer Research Center

http://matsen.fhcrc.org

@ematsen
Phylogenetics: reconstruct evolutionary history from DNA

armadillo

DNA or RNA
sequence data

"phylogenetics"

human

rat

giraffe
Phylogenetics helps us learn how HIV-1 came to be

Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host &
Microbe, 2013
We are fond of statistical approaches to phylogenetics

These are important when one would like a clear notion of
uncertainty (like medicine, epidemiology, and biodefense!)
We are fond of statistical approaches to phylogenetics
In particular, Bayesian methods fall into this category and have
become quite popular.
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...

...

We can’t solve for this posterior distribution, but we can satisfy
our needs by getting a big sample from it.
Markov chain Monte Carlo (MCMC)

Metropolis et al., 1953.

Set up a simulation such that the amount of time spent in a given
state is proportional to the posterior probability of that state.
Here we want a posterior on trees
If we want to use the same strategy to get a posterior on
phylogenetic trees. . .
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...

...

we need a way to move from one phylogenetic tree to another.
Subtree-prune-regraft (SPR) definition

1 2 3 4 5 6

1

4 5 6
2 3

1 4 5 2 3 6
The set of trees as a graph connected by SPR moves
(Figure from Mossel and Vigoda, Science, 2005).
This graph is connected, and every tree has nonzero
posterior probability, so MCMC works†

We are guaranteed to converge to the posterior distribution on
trees by using Metropolis-Hastings moves built on these SPRs.
That is, by bouncing around “tree space” we can get a good idea
of a set of good trees.

†

That is, it works if we run the MCMC forever
We can’t run it forever.

News flash:
5 million < ∞
With pathological data, can be hard to traverse peaks
goodness
We wanted to know: does this happen in real data sets?

Lots of discussion in literature, but few clear conclusions.

In order to understand the reasons differentiating “easy” and
“difficult” data sets for phylogenetic MCMC, we wanted to make it
possible to visualize tree space with a relevant geometry.

So, what trees are close to each other in terms of SPR moves?
dSPR : how many SPR moves from one tree to another?
Say T1

T2 if there is an SPR transformation of T1 to T2 .
dSPR (T , S) =

T1

min

···

Tk =S

k

This distance is NP-hard to compute. That’s no fun!
Meet Chris Whidden, algorithms strongman

In a series of four very technical papers, Chris took exact
computation of dSPR from O(infeasible) to O(feasible).
Then he joined my group!
Let’s take some common data sets and see what we see

These are completely standard data sets of the sort that biologists
analyze every day: slowly evolving nuclear, mitochondrial, or
chloroplast genes.

Also used as examples in:
Lakner et al., Syst. Biol., 2008
Hohna and Drummond, Syst. Biol., 2012
Larget, Syst. Biol., 2013
Interested in high probability subsets of the SPR graph
Summarize by subsetting to high probability nodes

node size proportional to
posterior probability, and
color shows distance to
the highest PP tree.
The top 4096 trees for a data set
The top 4096 trees for a data set

What's up with this stuff?
Is it important? Is it difficult
for the MCMC to see?
Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probability tree and back?

Any round trip path counts!
Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probability tree and back?

Any round trip path counts!
Commute time plot for this data set
The separation is problematic indeed

Yep, those parts of the posterior
are important and MCMC has
trouble entering them.
Trees with 95% of posterior probability for another data set
We can use our methods to identify source of bottlenecks
Hyla_cinerea

Hyla_cinerea

Bufo_valliceps

Bufo_valliceps

Nesomantis_thomasseti

Hypogeophis_rostratus

Eleutherodactylus_cuneatus

Grandisonia_alternans

Gastrophryne_carolinensis

Amphiuma_tridactylum

Hypogeophis_rostratus

Ichthyophis_bannanicus

Grandisonia_alternans

Ambystoma_mexicanum

Amphiuma_tridactylum

Siren_intermedia

Ichthyophis_bannanicus

Typhlonectes_natans

Plethodon_yonhalossee

Discoglossus_pictus

Scaphiopus_holbrooki

Plethodon_yonhalossee

Discoglossus_pictus

Scaphiopus_holbrooki

Ambystoma_mexicanum

Nesomantis_thomasseti

Siren_intermedia

Eleutherodactylus_cuneatus

Typhlonectes_natans

Gastrophryne_carolinensis

Xenopus_laevis

Xenopus_laevis

Homo_sapiens

Homo_sapiens

Mus_musculus

Mus_musculus

Rattus_norvegicus

Rattus_norvegicus

Oryctolagus_cuniculus

Oryctolagus_cuniculus

Turdus_migratorius

Turdus_migratorius

Gallus_gallus

Gallus_gallus

Heterodon_platyrhinos

Heterodon_platyrhinos

Sceloporus_undulatus

Sceloporus_undulatus

Alligator_mississippiensis

Alligator_mississippiensis

Trachemys_scripta

Trachemys_scripta

Latimeria_chalumnae

Latimeria_chalumnae

These are the trees at the two peaks of the connected components.
Indeed, it’s very tricky to get between them!
Multidimensional scaling visualizations via dSPR
In general, a new way to explore tree space
Our applications: it’s party time
Automatic identification of (multiple) peaks in posteriors
Performance of Metropolis-coupled Markov chain Monte Carlo
for getting between peaks
Accuracy of new “mean-field” posterior probability
approximations
The first topological convergence diagnostic

These empirical investigations set the stage for additional
theoretical development, and suggest new ways to move around
tree space.

This will translate into better phylogenetic uncertainty estimates,
and hence better preparedness and response to biological threats.
Thank you

Robert Beiko (Dalhousie University)
Aaron Darling (University of Technology, Sydney)
Connor McCoy (Fred Hutchinson Cancer Research Center)
NSF award 1223057

Mais conteúdo relacionado

Semelhante a Quantifying MCMC exploration of phylogenetic tree space

(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...
(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...
(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...EdizonJambormias2
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondRoderic Page
 
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλου
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλουΜοντέλα διάχυσης καρκινικών όγκων εγκεφάλου
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλουManolis Vavalis
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsJoe Parker
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstractsJoseph Park
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
What is a contingency table Provide an example.Solution .pdf
What is a contingency table Provide an example.Solution        .pdfWhat is a contingency table Provide an example.Solution        .pdf
What is a contingency table Provide an example.Solution .pdfgowravy1
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vuploadProf. Wim Van Criekinge
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
Qt Computer Invasion
Qt  Computer InvasionQt  Computer Invasion
Qt Computer Invasiontbadri
 
EMBL John Kendrew Award Lecture 2018
EMBL John Kendrew Award Lecture 2018EMBL John Kendrew Award Lecture 2018
EMBL John Kendrew Award Lecture 2018Nils Gehlenborg
 
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Jim McCusker
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Karen Cranston
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitMarco Antoniotti
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 

Semelhante a Quantifying MCMC exploration of phylogenetic tree space (20)

(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...
(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...
(The Ima Volumes in Mathematics and Its Applications) Terry Speed (editor), M...
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-Emond
 
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλου
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλουΜοντέλα διάχυσης καρκινικών όγκων εγκεφάλου
Μοντέλα διάχυσης καρκινικών όγκων εγκεφάλου
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstracts
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
What is a contingency table Provide an example.Solution .pdf
What is a contingency table Provide an example.Solution        .pdfWhat is a contingency table Provide an example.Solution        .pdf
What is a contingency table Provide an example.Solution .pdf
 
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013
 
Qt Computer Invasion
Qt  Computer InvasionQt  Computer Invasion
Qt Computer Invasion
 
EMBL John Kendrew Award Lecture 2018
EMBL John Kendrew Award Lecture 2018EMBL John Kendrew Award Lecture 2018
EMBL John Kendrew Award Lecture 2018
 
Ch10 molevo
Ch10 molevoCh10 molevo
Ch10 molevo
 
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms a...
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution trait
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Quantifying MCMC exploration of phylogenetic tree space

  • 1. Quantifying MCMC exploration of phylogenetic tree space Christopher Whidden and Frederick “Erick” A. Matsen IV Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org @ematsen
  • 2. Phylogenetics: reconstruct evolutionary history from DNA armadillo DNA or RNA sequence data "phylogenetics" human rat giraffe
  • 3. Phylogenetics helps us learn how HIV-1 came to be Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host & Microbe, 2013
  • 4. We are fond of statistical approaches to phylogenetics These are important when one would like a clear notion of uncertainty (like medicine, epidemiology, and biodefense!)
  • 5. We are fond of statistical approaches to phylogenetics In particular, Bayesian methods fall into this category and have become quite popular. ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... We can’t solve for this posterior distribution, but we can satisfy our needs by getting a big sample from it.
  • 6. Markov chain Monte Carlo (MCMC) Metropolis et al., 1953. Set up a simulation such that the amount of time spent in a given state is proportional to the posterior probability of that state.
  • 7. Here we want a posterior on trees If we want to use the same strategy to get a posterior on phylogenetic trees. . . ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... we need a way to move from one phylogenetic tree to another.
  • 8. Subtree-prune-regraft (SPR) definition 1 2 3 4 5 6 1 4 5 6 2 3 1 4 5 2 3 6
  • 9. The set of trees as a graph connected by SPR moves (Figure from Mossel and Vigoda, Science, 2005).
  • 10. This graph is connected, and every tree has nonzero posterior probability, so MCMC works† We are guaranteed to converge to the posterior distribution on trees by using Metropolis-Hastings moves built on these SPRs. That is, by bouncing around “tree space” we can get a good idea of a set of good trees. † That is, it works if we run the MCMC forever
  • 11. We can’t run it forever. News flash: 5 million < ∞
  • 12. With pathological data, can be hard to traverse peaks goodness
  • 13. We wanted to know: does this happen in real data sets? Lots of discussion in literature, but few clear conclusions. In order to understand the reasons differentiating “easy” and “difficult” data sets for phylogenetic MCMC, we wanted to make it possible to visualize tree space with a relevant geometry. So, what trees are close to each other in terms of SPR moves?
  • 14. dSPR : how many SPR moves from one tree to another? Say T1 T2 if there is an SPR transformation of T1 to T2 . dSPR (T , S) = T1 min ··· Tk =S k This distance is NP-hard to compute. That’s no fun!
  • 15. Meet Chris Whidden, algorithms strongman In a series of four very technical papers, Chris took exact computation of dSPR from O(infeasible) to O(feasible). Then he joined my group!
  • 16. Let’s take some common data sets and see what we see These are completely standard data sets of the sort that biologists analyze every day: slowly evolving nuclear, mitochondrial, or chloroplast genes. Also used as examples in: Lakner et al., Syst. Biol., 2008 Hohna and Drummond, Syst. Biol., 2012 Larget, Syst. Biol., 2013
  • 17. Interested in high probability subsets of the SPR graph
  • 18. Summarize by subsetting to high probability nodes node size proportional to posterior probability, and color shows distance to the highest PP tree.
  • 19. The top 4096 trees for a data set
  • 20. The top 4096 trees for a data set What's up with this stuff? Is it important? Is it difficult for the MCMC to see?
  • 21. Commute time definition Commute time for a node y : how long to make the round trip from y to the highest posterior probability tree and back? Any round trip path counts!
  • 22. Commute time definition Commute time for a node y : how long to make the round trip from y to the highest posterior probability tree and back? Any round trip path counts!
  • 23. Commute time plot for this data set
  • 24. The separation is problematic indeed Yep, those parts of the posterior are important and MCMC has trouble entering them.
  • 25. Trees with 95% of posterior probability for another data set
  • 26. We can use our methods to identify source of bottlenecks Hyla_cinerea Hyla_cinerea Bufo_valliceps Bufo_valliceps Nesomantis_thomasseti Hypogeophis_rostratus Eleutherodactylus_cuneatus Grandisonia_alternans Gastrophryne_carolinensis Amphiuma_tridactylum Hypogeophis_rostratus Ichthyophis_bannanicus Grandisonia_alternans Ambystoma_mexicanum Amphiuma_tridactylum Siren_intermedia Ichthyophis_bannanicus Typhlonectes_natans Plethodon_yonhalossee Discoglossus_pictus Scaphiopus_holbrooki Plethodon_yonhalossee Discoglossus_pictus Scaphiopus_holbrooki Ambystoma_mexicanum Nesomantis_thomasseti Siren_intermedia Eleutherodactylus_cuneatus Typhlonectes_natans Gastrophryne_carolinensis Xenopus_laevis Xenopus_laevis Homo_sapiens Homo_sapiens Mus_musculus Mus_musculus Rattus_norvegicus Rattus_norvegicus Oryctolagus_cuniculus Oryctolagus_cuniculus Turdus_migratorius Turdus_migratorius Gallus_gallus Gallus_gallus Heterodon_platyrhinos Heterodon_platyrhinos Sceloporus_undulatus Sceloporus_undulatus Alligator_mississippiensis Alligator_mississippiensis Trachemys_scripta Trachemys_scripta Latimeria_chalumnae Latimeria_chalumnae These are the trees at the two peaks of the connected components. Indeed, it’s very tricky to get between them!
  • 28. In general, a new way to explore tree space
  • 29. Our applications: it’s party time Automatic identification of (multiple) peaks in posteriors Performance of Metropolis-coupled Markov chain Monte Carlo for getting between peaks Accuracy of new “mean-field” posterior probability approximations The first topological convergence diagnostic These empirical investigations set the stage for additional theoretical development, and suggest new ways to move around tree space. This will translate into better phylogenetic uncertainty estimates, and hence better preparedness and response to biological threats.
  • 30. Thank you Robert Beiko (Dalhousie University) Aaron Darling (University of Technology, Sydney) Connor McCoy (Fred Hutchinson Cancer Research Center) NSF award 1223057