SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
A probabilistic parsimonious model
for species tree reconstruction

Leonardo de Oliveira Martins
David Posada

●

leomrtns@uvigo.es

●

dposada@uvigo.es

with invaluable help from Klaus Schliep and Diego Mallo
What do we want
●

To estimate species trees given arbitrary gene families ←

can contain paralogous, missing data, etc.

To account for uncertainty in gene tree and species tree
estimation ← some gene families may be more informative, or
●

maybe we don't have signal at all

●

To allow for several sources of disagreement ← real data

seldomly can be explained by just one biological phenomenon

●

Fast computation ← improvement provided by slower, fully

probabilistic methods may be elusive, and they can benefit from
our output nonetheless
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Model for the evolution of gene families
S

G1
D1

G2
D2

Gn
Dn

.
.
.
Model for the evolution of gene families
S

G1
D1

We just need to consider the
simplest explanation for the

P(G/S)

Our assumption:

difference between the gene
and species trees
we may use several such
simple explanations
●

distance between G and S
Model for the evolution of gene families
S

G1
D1

We just need to consider the
simplest explanation for the
difference between the gene
and species trees

P(G/S)

Our assumption:

Rodrigo and Steel.
2008. SystBiol 57: 243
ML supertrees

we may use several such
simple explanations
●

work with unrooted gene
trees
●

penalize gene trees very
different from species tree
●

distance between G and S
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Quantifying the disagreement
assuming deepcoal:

gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:

1 dup
3 losses
assuming HGT:

1 event
Quantifying the disagreement
assuming deepcoal:

gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:

1 dup
3 losses
assuming HGT:

1 event
Stochastic error/nonparametric
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Quantifying the disagreement – other measures

mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from
Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
Quantifying the disagreement – other measures

de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the
Distance between Trees. PLoS ONE 3(7): e2651.
Quantifying the disagreement – other measures

see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints
1:e18v1
Quantifying the disagreement – other measures

Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two
alternative phylogenetic trees. Bioinformatics 22: 117-119
Now we have estimates for these
assuming deepcoal:

1 deepcoal
assuming duplosses:

1 dup
3 losses
assuming HGT:

1 event
Stochastic error/nonparametric
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

1 dup
3 losses
assuming HGT:

1 event
Stochastic error/nonparametric
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony
1 dup
3 losses
assuming HGT:

1 event
Stochastic error/nonparametric
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony
1 dup
3 losses
assuming HGT:

(approximate) dSPR
1 event
Stochastic error/nonparametric
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony
1 dup
3 losses
assuming HGT:

(approximate) dSPR
1 event
RF, Hdist

Stochastic error/nonparametric
Considering several measures of disagreement:

Thus we can incorporate e.g. duplications
and losses while accounting for HGT and
random errors

Easy to include other
distances in the future
Considering several measures of disagreement:

Thus we can incorporate e.g. duplications
and losses while accounting for HGT and
random errors

Easy to include other
distances in the future

Problem: the normalization constant
Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426

Solution: importance sampling estimate of Z(.)
E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating
Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26:
1663-1676.
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Distribution of gene trees: probabilistic model
G1
D1
Q1

.
.
.
Gn

Dn
Qn

S
Distribution of gene trees: probabilistic model
G1

S
λdup1

D1
Q1

.
.
.

λdupprior
Gn

Dn
Qn

λdupn
Distribution of gene trees: probabilistic model
G1

S
λdup1

D1
Q1

λloss1

.
.
.

λspr1

λdupprior
Gn

Dn
Qn

.
.
.

λdupn
λlossn .
.
λsprn .

λlossprior
λsprprior
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference

λloss1

.
.
.

λspr1

.
.
.

λdupprior
Gn

λdupn
λlossn .
.
λsprn .

λlossprior
λsprprior
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference

λloss1

.
.
.

λspr1

.
.
.

λdupprior
Gn

λdupn
λlossn .
.
λsprn .

Input

λlossprior
λsprprior
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference

λloss1

.
.
.

λspr1

.
.
.

λdupprior
Gn

λdupn
λlossn .
.
λsprn .

Output

λlossprior
λsprprior
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference
We should not rely on
single estimates of gene
phylogenies

λloss1

.
.
.

λspr1

.
.
.

λdupprior
Gn

λdupn
λlossn .
.
λsprn .

λlossprior
λsprprior

Output

E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of
species and gene trees. Genome research 23: 323-330.
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Example: distances between gene families
●

567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong
phylogenetic signals. Nature 497: 327–331

●

Analysis under a model where only RF, Hdist and dSPR are considered
●

Not interested in data set per se (unreliable)

●

Use it just as a didactical tool about how the model works
Example: distances between gene families
●

567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong
phylogenetic signals. Nature 497: 327–331

●

Analysis under a model where only RF, Hdist and dSPR are considered
●

Not interested in data set per se (unreliable)

●

Use it just as a didactical tool about how the model works

RF

Hdist

SPR
Example: distances between gene families

RF

Hdist

SPR
Example: distances between gene families
Posterior samples

RF

Hdist

SPR
Example: distances between gene families
Posterior samples
best estimate

RF

Hdist

SPR
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Analysis of simulated data sets
●

Fully probabilistic simulation of gene trees by Diego Mallo and

David Posada
●

Birth and death of new loci, conditioned on a multispecies

coalescent, followed by sequence evolution

We use gene trees only, and simulate
tree inference error

Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using
a locus tree. Genome Res. 22: 755-765
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

●

(TreeFam database has 14250 informative gene families)
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

●

(TreeFam database has 14250 informative gene families)
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

Estimated species tree:

●

Root location uncertain
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

Estimated species tree:

●

Root location uncertain

●

Only one unrooted topology
Large gene families from Drosophila (TreeFam)
●

43 gene families with 102~295 tips
Large gene families from Drosophila (TreeFam)
●

43 gene families with 102~295 tips
best species tree:

~100%
To recap, our model can
●

Estimate species trees given arbitrary gene families ← can

contain paralogous, missing data, etc.

The larger, the better – specially for rooting the species tree

Account for uncertainty in gene tree and species tree
estimation ← some gene families may be more informative, or
●

maybe we don't have signal at all
Do not assume gene trees are known – embrace ignorance!
●

Allow for several sources of disagreement ← real data

seldomly can be explained by just one biological phenomenon
Different gene families may be product of distinct processes
●

Be fast ← improvement provided by slower, fully probabilistic

methods may be elusive, and they can benefit from our output
nonetheless

It's parallelized, and all distances can be calculated very fast.
Check out http://darwin.uvigo.es for announcements, code, slides...

Thank you!

Mais conteúdo relacionado

Semelhante a A probabilistic parsimonious model for species tree reconstruction

Data enriched linear regression
Data enriched linear regressionData enriched linear regression
Data enriched linear regressionSunny Kr
 
Holder and Koch ievobio-2013 ascertainment biases
Holder and Koch ievobio-2013 ascertainment biasesHolder and Koch ievobio-2013 ascertainment biases
Holder and Koch ievobio-2013 ascertainment biasesMark Holder
 
Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009Matthew Magistrado
 
Es credit scoring_2020
Es credit scoring_2020Es credit scoring_2020
Es credit scoring_2020Eero Siljander
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
CS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptxCS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptxAbhishekSingh43430
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsColleen Farrelly
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataNick Stauner
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Scientific applications of machine learning
Scientific applications of machine learningScientific applications of machine learning
Scientific applications of machine learningbutest
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004Salford Systems
 
art%3A10.1007%2Fs00122-016-2798-8
art%3A10.1007%2Fs00122-016-2798-8art%3A10.1007%2Fs00122-016-2798-8
art%3A10.1007%2Fs00122-016-2798-8Peter Vos
 

Semelhante a A probabilistic parsimonious model for species tree reconstruction (20)

Tools in phylogeny
Tools in phylogeny Tools in phylogeny
Tools in phylogeny
 
Data enriched linear regression
Data enriched linear regressionData enriched linear regression
Data enriched linear regression
 
Data in science
Data in science Data in science
Data in science
 
Holder and Koch ievobio-2013 ascertainment biases
Holder and Koch ievobio-2013 ascertainment biasesHolder and Koch ievobio-2013 ascertainment biases
Holder and Koch ievobio-2013 ascertainment biases
 
Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009
 
Es credit scoring_2020
Es credit scoring_2020Es credit scoring_2020
Es credit scoring_2020
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
phy prAC.pptx
phy prAC.pptxphy prAC.pptx
phy prAC.pptx
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
CS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptxCS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptx
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Scientific applications of machine learning
Scientific applications of machine learningScientific applications of machine learning
Scientific applications of machine learning
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
art%3A10.1007%2Fs00122-016-2798-8
art%3A10.1007%2Fs00122-016-2798-8art%3A10.1007%2Fs00122-016-2798-8
art%3A10.1007%2Fs00122-016-2798-8
 

Último

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

A probabilistic parsimonious model for species tree reconstruction

  • 1. A probabilistic parsimonious model for species tree reconstruction Leonardo de Oliveira Martins David Posada ● leomrtns@uvigo.es ● dposada@uvigo.es with invaluable help from Klaus Schliep and Diego Mallo
  • 2. What do we want ● To estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all ● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon ● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
  • 3. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 4. Model for the evolution of gene families S G1 D1 G2 D2 Gn Dn . . .
  • 5. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the P(G/S) Our assumption: difference between the gene and species trees we may use several such simple explanations ● distance between G and S
  • 6. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the difference between the gene and species trees P(G/S) Our assumption: Rodrigo and Steel. 2008. SystBiol 57: 243 ML supertrees we may use several such simple explanations ● work with unrooted gene trees ● penalize gene trees very different from species tree ● distance between G and S
  • 7. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 8. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event
  • 9. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 10. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 11. Quantifying the disagreement – other measures mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
  • 12. Quantifying the disagreement – other measures de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.
  • 13. Quantifying the disagreement – other measures see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1
  • 14. Quantifying the disagreement – other measures Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119
  • 15. Now we have estimates for these assuming deepcoal: 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 16. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 17. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 18. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event Stochastic error/nonparametric
  • 19. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event RF, Hdist Stochastic error/nonparametric
  • 20. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future
  • 21. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future Problem: the normalization constant Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426 Solution: importance sampling estimate of Z(.) E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.
  • 22. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 23. Distribution of gene trees: probabilistic model G1 D1 Q1 . . . Gn Dn Qn S
  • 24. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 . . . λdupprior Gn Dn Qn λdupn
  • 25. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 λloss1 . . . λspr1 λdupprior Gn Dn Qn . . . λdupn λlossn . . λsprn . λlossprior λsprprior
  • 26. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior
  • 27. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Input λlossprior λsprprior
  • 28. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Output λlossprior λsprprior
  • 29. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference We should not rely on single estimates of gene phylogenies λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior Output E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.
  • 30. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 31. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works
  • 32. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works RF Hdist SPR
  • 33. Example: distances between gene families RF Hdist SPR
  • 34. Example: distances between gene families Posterior samples RF Hdist SPR
  • 35. Example: distances between gene families Posterior samples best estimate RF Hdist SPR
  • 36. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 37. Analysis of simulated data sets ● Fully probabilistic simulation of gene trees by Diego Mallo and David Posada ● Birth and death of new loci, conditioned on a multispecies coalescent, followed by sequence evolution We use gene trees only, and simulate tree inference error Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765
  • 38. Analysis of simulated data sets – results
  • 39. Analysis of simulated data sets – results
  • 40. Analysis of simulated data sets – results
  • 41. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 42. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  • 43. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  • 44. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain
  • 45. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain ● Only one unrooted topology
  • 46. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips
  • 47. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips best species tree: ~100%
  • 48. To recap, our model can ● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. The larger, the better – specially for rooting the species tree Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all Do not assume gene trees are known – embrace ignorance! ● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon Different gene families may be product of distinct processes ● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless It's parallelized, and all distances can be calculated very fast.
  • 49. Check out http://darwin.uvigo.es for announcements, code, slides... Thank you!