Phylogenetic approaches to metagenomic analysis #KSMicro talk by Jonathan Eisen
1. Phylogenetic & Phylogenomic
Approaches to
Metagenomic Analysis
Jonathan A. Eisen
UC Davis
Keystone Meeting #KSMicro
March 26, 2011
2. Outline
• Introduction to phylogeny
• Phylogeny and metagenomics
– Phylotyping
– Phylogenetic Binning
– Functional diversity and prediction
– Phylogenetic ecology
– Selecting species or genes for study
3. T. H. Dobzhansky (1973)
“Nothing in biology makes sense
except in the light of evolution.”
4. Evolutionary Perspective and
Comparative Biology
• Comparative biology is the analysis of differences
and similarities between species.
• An evolutionary perspective is useful in such studies
because it allows one to focus on how and why
similarities and differences came to be.
• In other words, biological objects have a history and
understanding that history is important
5. Phylogeny
• Phylogeny is a description of the
evolutionary history of
relationships among organisms (or
their parts).
• This is frequently portrayed in a
diagram called a phylogenetic tree.
• Phylogenies can be more complex
than a bifurcating tree (e.g., lateral
gene transfer, recombination,
hybridization)
• History allows one to distinguish
homology from convergence; tease
apart issues with rate variation
12. rRNA Phylotyping
• Note - using a tree does
not mean phylogeny
always matters per se
• But allows one to test
whether and how it
impacts biology, ecology,
etc
• When it does =
homology
• When it does not =
convergence, HGT, etc
13. rRNA Phylotyping in Sargasso Sea
Metagenomic Metagenomic Data
Venter et al., Science
304: 66. 2004
15. PhylOTU: A High-Throughput Procedure Quantifies
Microbial Community Diversity and Resolves Novel Taxa
from Metagenomic Data
Thomas J. Sharpton1*, Samantha J. Riesenfeld1, Steven W. Kembel2, Joshua Ladau1, James P.
O’Dwyer2,3, Jessica L. Green2, Jonathan A. Eisen4, Katherine S. Pollard1,5
1 The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America, 2 Center for Ecology and Evolutionary
Biology, University of Oregon, Eugene, Oregon, United States of America, 3 Institute of Integrative and Comparative Biology, University of Leeds, Leeds, United Kingdom,
4 Department of Evolution and Ecology, University of California Davis, Davis, California, United States of America, 5 Institute for Human Genetics & Division of Biostatistics,
Finding Metagenomic OTUs
University of California San Francisco, San Francisco, California, United States of America
Abstract
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic
units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in
priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids
amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-
finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs
from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles.
Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons
of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries
identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In
addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by
analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the
biosphere currently hidden from PCR-based surveys of diversity?
Citation: Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O’Dwyer JP, et al. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community
Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
Editor: Oded Be ` , Technion-Israel Institute of Technology, Israel
´ja
Received July 22, 2010; Accepted December 17, 2010; Published January 20, 2011
Copyright: ß 2011 Sharpton et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
16. Shotgun Sequencing Allows Use of
Alternative Anchors (e.g., RecA)
Venter et al., Science
304: 66. 2004
17. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o ro
bi
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c oc ia
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
19. 0
0.1750
0.3500
0.5250
0.7000
Al
ph
ap
r
Be ote
ta o
pr bac
G ot t
am eo eria
m ba
ap ct
D ro
te
er
ia
el
ta ob
pr
Ep ot act
si eo er
U lo ba ia
nc np ct
la ro er
ss te ia
ifi ob
ed ac
Pr te
C ot ria
yae
o
nob
a
bac
C cter
hl teriia
am a
Ac yd
ia
id
ob e
Ba act
ct er
er ia
oi
Ac de
tin te
ob s
ac
Aq ter
ui ia
Pl fic
an ae
ct
om
Sp yc
iro ete
ch s
a
Fi ete
rm s
ic
ut
C es
hl
or
of
le
C xi
U hl
or
nc
la ob
ss i
ifi
ed
Ba
ct
er
ia
frr
tsf
pgk
rpsI
rplL
rplT
rplF
rplE
rplS
infC
rplP
rplA
rplK
rplB
rplN
rplD
rplC
rpsJ
rplM
rpsE
rpsS
rpsK
rpsB
rpsC
pyrG
rpoB
nusA
rpsM
rpmA
dnaG
smpB
20. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
21. rRNA Tree of Life
Bacteria Wu et al. (2011) PLoS ONE
6(3): e18011. doi:10.1371/
journal.pone.0018011
Archaea
??????
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
22. rRNA Tree of Life
Bacteria
Archaea
Scanned through
GOS data for
rRNAs that fit
this pattern
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
23. rRNA Tree of Life
Bacteria
Archaea
Found many, but
closer
examination
revealed all to
have issues
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
24. Mol Evol (1m) 41:110S-1123
PURNALOF OLECULA
EVOLUTIO
@ S!""'"I',rVcr1.g N"" Yorll", 19')'
RecA
The RecA Protein as a Model Molecule for Molecular Systematic Studies
of Bacteria: Comparison of Trees of RecAs and 16S rRNAs from the
Same Species
Jonathan A. Eisen
Depanmenl of Biological ScierM:es,Stanford Universily. SIaDfORi,CA 9430S-S020. USA (email: jeisen@leI8IM1sranfOld.edu)
Received: I July 199.5/ Accepted: 2S July 199.5
Abstract. The evolution of the RecA protein was an- Introduction
alyzed using molecularphylogenetictechniques. Phylo-
Molecular systematicsbas becomethe primary way to
genetic trees of all currently available complete RecA
detennineevolutionaryrelationships amongmicroorgan
proteins were inferred using multiple maximum parsi-
isms because morphologicaland other phenotypicchar
mony and distance matrix methods. Comparison and
actersareeither absent changetoo rapidly to be usefu
or
analysisof the treesrevealthat the inferredrelationships
for phylogeneticinference(Woese 1987).Not all mole
amongtheseproteinsare highly robust The RecA trees
culesare equaIly useful for molecularsystematicstudie
show consistentsubdivisionscorresponding many of
to
and the molecule of choice for most such studies o
41. Uses of Phylogeny
in Metagenomics
Example 3:
Functional Diversity and
Functional Predictions
42. Predicting Function
• Identification of motifs
– Short regions of sequence similarity that are indicative of
general activity
– e.g., ATP binding
• Homology/similarity based methods
– Gene sequence is searched against a databases of other
sequences
– If significant similar genes are found, their functional
information is used
• Problem
– Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
43. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2A 3A 1B 2B 3B 1 2 3 4 5 6
1A
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Based on Eisen,
1998 Genome
Duplication
Res 8: 163-167.
46. Uses of Phylogeny
in Metagenomics
Example 5:
Selecting Organisms for Study
47. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
51. Protein Family Rarefaction
Curves
• Take data set of multiple complete genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
60. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o ro
bi
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c oc ia
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
61. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o ro
bi
without good
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
Cannot be done
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c ia
sampling of genomes
oc
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
62. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o ro
bi
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
o ro
fle
Sp xi
iro
ch
Phylogenetic Binning
ae
Fu te
so s
D ba
ei ct
no er
c oc ia
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
63. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o ro
bi
without good
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
Cannot be done
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c ia
sampling of genomes
oc
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
64. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
e
improves
Ac s
tin
ob
ac
te
C ria
hl
o ro
bi
C
GEBA Project
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c oc ia
cu
metagenomic analysis
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
65. Weighted % of Clones
0
0.1250
0.2500
0.3750
0.5000
Al
ph
ap
ro
te
Be ob
ta ac
pr te
ot ria
G eo
am ba
m ct
ap er
ro ia
Ep te
si ob
lo ac
np te
ro ria
D te
el ob
ta ac
pr te
ot ria
eo
C ba
ya ct
no er
b ia
ac
te
Fi ria
rm
ic
ut
Ac e s
tin
ob
ac
te
C ria
hl
o
But not a lot
ro
bi
C
FB
Major Phylogenetic Group
Sargasso Phylotypes
C
hl
o ro
fle
Sp xi
iro
ch
ae
Fu te
so s
D ba
ei ct
no er
c oc ia
cu
s-
Eu Th
ry erm
ar
ch us
C ae
re ot
na a
rc
ha
eo
ta
Shotgun Sequencing Allows Use of Other Markers
EFG
Venter et al., Science 304: 66-74. 2004
EFTu
rRNA
RecA
RpoB
HSP70
66. Phylogeny and Metagenomics
Future 1
Need to adapt genomic and
metagenomic methods to make better
use of data
69. • Build AMPHORA ALL
reference
tree with
concatenated
alignment
• Align reads
that match
any of the
HMMs to
concatenated
alignment
• Place reads
into
reference
tree one at a
time
71. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
76. GEBA uncultured
Number of SAGs from Candidate Phyla
406
1
OD1
OP1
OP3
SAR
Site A: Hydrothermal vent 4 1 - -
Site B: Gold Mine 6 13 2 -
Site C: Tropical gyres (Mesopelagic) - - - 2
Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
72
77. Earth Microbiome Project
• Goal – to systema-cally approach the problem of
characterizing microbial life on earth
• Strategy:
– Explore microbes in environmental parameter space
– Design ‘ideal’ strategy to interrogate these biomes
– Acquire samples and sequence broad and deep both DNA, mRNA
and rRNA
– Define microbial community structure and the protein universe
• Gilbert et al., 2010a,b SIGS
•
Phylogenetic analysis of rRNAs led to the discovery of archaea\n
Phylogenetic analysis of rRNAs led to the discovery of archaea\n
This is a tree of a rRNA gene that was found on a large DNA fragment isolated from the Monterey Bay. This rRNA gene groups in a tree with genes from members of the gamma Proteobacteria a group that includes E. coli as well as many environmental bacteria. This rRNA phylotype has been found to be a dominant species in many ocean ecosystems.\n\n clone from the Sargasso Sea. This shows that this \n
\n
\n
\n
\n
\n
Gets better with more markers - but we do not have lots of sequences for these markers. We can get them from genomes. The more diverse the genomes, thebeter the marker set will be\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Sites of imaginative potential – iconic locations – Iconic Sampling – \n\nLife is strange – microbes are stranger – how do we capitalize on this?\n