Talk by Jonathan Eisen March 7, 2012 at the National Academy of Sciences Institute of Medicine "Forum on Microbial Threats" meeting on the "Social Biology of Microbes"
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metagenomes
1. Phylogenetic and Phylogenomic
Approaches to the
Study of Microbial Communities
March 7, 2012
IOM Forum on Microbial Threats
Social Biology of Microbes
Jonathan A. Eisen
University of California, Davis
Wednesday, March 7, 12
2. Acknowledgements
• $$$
• DOE
• NSF
• GBMF
• Sloan
• DARPA
• DSMZ
• DHS
• People, places
• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides
• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell Neches,
Jenna Morgan-Lang
• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak, Jack
Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
Wednesday, March 7, 12
4. Phylogeny
• Phylogeny is a description of the
evolutionary history of
relationships among organisms (or
their parts).
• This is frequently portrayed in a
diagram called a phylogenetic tree.
• Phylogenies can be more complex
than a bifurcating tree (e.g.,
lateral gene transfer,
recombination, hybridization)
Wednesday, March 7, 12
5. Whatever the History: Trying to Incorporate it is Critical
Four Models for Rooting TOL
from Lake et al. doi: 10.1098/rstb.2009.0035
Wednesday, March 7, 12
6. Uses of Phylogeny
in Genomics and Metagenomics
Example 1:
Phylotyping and Phylogenetic
Ecology
Wednesday, March 7, 12
7. rRNA Phylotyping
• Collect DNA from
environment
• PCR amplify rRNA genes
using broad (so-called
universal) primers
• Sequence
• Align to others
• Infer evolutionary tree
• Unknowns “identified” by
placement on tree
Wednesday, March 7, 12
9. Three Major Issues in Phylotpying
Beyond Moore’s Law Metagenomics
Short reads
Wednesday, March 7, 12
10. rRNA Phylotyping in
Sargasso Sea
Metagenomic
Metagenomic Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
11. RecA
Phylotyping in
Sargasso Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
12. RecA
Phylotyping in
Sargasso Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
13. Sargasso Phylotypes
0.500
EFG EFTu HSP70 RecA RpoB rRNA
0.375
Weighted % of Clones
0.250
0.125
0
ia
ia
ria
s
i
xi
ia
a
ob
te
ot
le
er
er
er
e
u
or
ae
of
ct
ct
ct
ct
ic
hl
or
ba
ba
ba
ba
ch
rm
C
hl
ar
eo
eo
eo
so
Fi
C
ry
Fu
t
t
ot
ro
ro
Eu
pr
ap
ap
lta
ph
m
De
am
Al
G
Major Phylogenetic Group
Venter et al., Science 304: 66-74. 2004
Wednesday, March 7, 12
14. Solution: More Automation
• BLAST????
• Composition/word frequencies
• Automation of trees
Wednesday, March 7, 12
16. STAP
Wu et al. 2008 PLoS One
Figure 1. A flow chart of the STAP pipeline.
Wednesday, March 7, 12
17. STAP
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
a
w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t
t
s
Each sequence T
c
a
analyzed separately q
c
e
b
b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a
PLoS ONE | www.plosone.org 5
Wu et al. 2008 PLoS One
Figure 1. A flow chart of the STAP pipeline.
Wednesday, March 7, 12
18. AMPHORA
Wu and Eisen
Genome Biology
2008 9:R151 doi:
10.1186/
gb-2008-9-10-r151
Wednesday, March 7, 12
19. WGT
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
20. AMPHORA
Wu and Eisen
Genome Biology
2008 9:R151 doi:
10.1186/
gb-2008-9-10-r151
Guide tree
Wednesday, March 7, 12
21. Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
22. Comparison of the phylotyping performance by AMPHORA and MEGAN. The sensitivity and specificity of the phylotyping
methods were measured across taxonomic ranks using simulated Sanger shotgun sequences of 31 genes from 100
representative bacterial genomes. The figure shows that AMPHORA significantly outperforms MEGAN in sensitivity without
sacrificing specificity.
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
24. Metagenomic Phylogenetic challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything
Wednesday, March 7, 12
25. Metagenomic Phylogenetic challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything
Wednesday, March 7, 12
26. rRNA Phylotyping in
Sargasso Sea
Metagenomic
Metagenomic Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
27. Combine all into
one alignment
Figure 1. A flow chart of the STAP pipeline.
Wednesday, March 7, 12
29. RecA
Phylotyping in
Sargasso Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
30. RecA
Phylotyping in
Sargasso Data
Venter et al., Science
304: 66. 2004
Wednesday, March 7, 12
31. Sargasso Phylotypes
0.500
EFG EFTu HSP70 RecA RpoB rRNA
0.375
Weighted % of Clones
0.250
0.125
0
ia
ia
ria
s
i
xi
ia
a
ob
te
ot
le
er
er
er
e
u
or
ae
of
ct
ct
ct
ct
ic
hl
or
ba
ba
ba
ba
ch
rm
C
hl
ar
eo
eo
eo
so
Fi
C
ry
Fu
t
t
ot
ro
ro
Eu
pr
ap
ap
lta
ph
m
De
am
Al
G
Major Phylogenetic Group
Venter et al., Science 304: 66-74. 2004
Wednesday, March 7, 12
33. Metagenomic Phylogenetic challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything
Wednesday, March 7, 12
35. Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylin
PhylOTU - Sharpton et al. PLoS Comp. Bio 2011
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Wednesday, March 7, 12
41. the communities combined (18), is a quantitative measure that
accounts for different levels of divergence between sequences.
The phylogenetic test (P test), which measures the significance
of the association between environment and phylogeny (18), is
typically used as a qualitative measure because duplicate se-
quences are usually removed from the tree. However, the P
test may be used in a semiquantitative manner if all clones,
even those with identical or near-identical sequences, are in-
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that we
call “weighted UniFrac.” We show that weighted UniFrac be-
haves similarly to the FST test in situations where both are
FIG. 1. Calculation of the unweighted and the weighted UniFrac
measures. Squares and circles represent sequences from two different
environments. (a) In unweighted UniFrac, the distance between the
circle and square communities is calculated as the fraction of the
branch length that has descendants from either the square or the circle
environment (black) but not both (gray). (b) In weighted UniFrac,
branch lengths are weighted by the relative abundance of sequences in
the square and circle communities; square sequences are weighted
twice as much as circle sequences because there are twice as many total
circle sequences in the data set. The width of branches is proportional
to the degree to which each branch is weighted in the calculations, and
gray branches have no weight. Branches 1 and 2 have heavy weights
since the descendants are biased toward the square and circles, respec-
tively. Branch 3 contributes no value since it has an equal contribution
from circle and square sequences after normalization.
Figure 3. Taxonomic diversity and standardized phylogenetic diversity versus
depth in environmental samples along an oceanic depth gradient at the HOT ALO
site.
Wednesday, March 7, 12
42. AutoPhylotyping 5:
Novel lineages and decluttering
Wednesday, March 7, 12
43. RecA Tree of Life
Bacteria
Archaea
Other lineages?
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
Wednesday, March 7, 12
49. Sulcia makes amino acids
Baumannia makes vitamins and cofactors
Wu et al. 2006 PLoS Biology 4: e188.
Wednesday, March 7, 12
50. Uses of Phylogeny
in Genomics and Metagenomics
Example 2:
Functional Diversity and
Functional Predictions
Wednesday, March 7, 12
51. Predicting Function
• Key step in genome projects
• More accurate predictions help guide
experimental and computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type
analyses that integrate evolutionary
reconstructions and understanding of how new
functions evolve
Wednesday, March 7, 12
52. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2B 3B 1 2 3 4 5 6
1A 2A 3A 1B
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
1A 1B 1 2 3 4 5 6
2A 2B 3A 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Based on Eisen,
1998 Genome
Duplication
Res 8: 163-167.
Wednesday, March 7, 12
53. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2B 3B 1 2 3 4 5 6
1A 2A 3A 1B
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
1A 1B 1 2 3 4 5 6
2A 2B 3A 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Based on Eisen,
1998 Genome
Duplication
Res 8: 163-167.
Wednesday, March 7, 12
54. 0.01
Legend: Halorubrum lacusprofundi
0.32 Haloquadratum walsbyi
Dataset genes 0.93 Halogeometricum borinquense
MA ammonialyase 0.55 0.83 Haloferax mediterranei
MA mutase S subunit 1 Haloferax mucosum
0.90 Haloferax volcanii
MA mutase E subunit
0.34 Haloferax sulfurifontis
PHA synthatase Haloferax denitrificans
cellulase
0.41
Halalkalicoccus jeotgali
CRISPRs 1 Halopiger xanaduensis
0.52 Natrialba magadii
CAS
0.32 Haloterrigena turkmenica
1 Halobacterium sp. NRC 1
Color ranges:
0.08 Halobacterium salinarum R1
Natronomonas pharaonis
New Genomes 0.23
Halorhabdus utahensis
0.79
0.52
Halomicrobium mukohataei
1
Haloarcula vallismortis
0.71 Haloarcula marismortui
0.24 Haloarcula sinaiiensis
Haloarcula californiae
Wednesday, March 7, 12
55. !"#
Haloarchaea TBPs
!"E# $%&'?)*%7.1)5+()**%-)+.D
!"HJ $%&'?)*%7.1/2'8/1.#
!"MD $%&'?)*%7.>'&2%-++.#
!"NL $%&'?)*%7.8/&?/*+?'-(+8.#
!"HH $%&'?)*%7.5)-+(*+?+2%-8.#
$%&'A/%5*%(/1.B%&84C+.D
$%&',)'1)(*+2/1.4'*+-A/)-8).#
!"#D # $%&'4%2()*+/1.86".3:; #.(46=
$%&'4%2()*+/1.8%&+-%*/1.:#.(46=
!"DK!"H#
!"KL $%&'()**+,)-%.(/*01)-+2%.#
!"J# 3%(*+%&4%.1%,%5++.#
!"DE $%&'6+,)*.7%-%5/)-8+8.#
!"ED $%&%&0%&+2'22/8.9)'(,%&+.#
!"JH !"L! # $%&'4%2()*+/1.86".3:; #.(46<
$%&'4%2()*+/1.8%&+-%*/1.:#.(46<
!"ML 3%(*'-'1'-%8.6@%*%'-+8.#
!"ED $%&'*@%45/8./(%@)-8+8.#
!"EE
$%&'1+2*'4+/1.1/0'@%(%)+.#
#
$%&'%*2/&%.1%*+81'*(/+.#
!"DJ $%&'%*2/&%.>%&&+81'*(+8.#
!"#J $%&'%*2/&%.8+-%++)-8+8.#
$%&'%*2/&%.2%&+?'*-+%).#
$%&'*/4*/1.&%2/86*'?/-5+.E
# $%&'4%2()*+/1.86".3:; #.(46F
$%&'4%2()*+/1.8%&+-%*/1.:#.(46F#
!"JD !"NK $%&'4%2()*+/1.86".3:; #.(46G
!"L# $%&'4%2()*+/1.8%&+-%*/1.:#.E
!"N! $%&'4%2()*+/1.8%&+-%*/1.:#.H
!"JN $%&'4%2()*+/1.8%&+-%*/1.:#.#
!"JH $%&'4%2()*+/1.86".3:; #.(46I
$%&'4%2()*+/1.8%&+-%*/1.:#.D
$%&'*/4*/1.&%2/86*'?/-5+.D
!"NH!"K# $%&',)'1)(*+2/1.4'*+-A/)-8).D
!"E! $%&'*/4*/1.&%2/86*'?/-5+.H
!"MJ $%&'A/%5*%(/1.B%&84C+.#
!"M!
$%&'?)*%7.1/2'8/1.D
!"NJ
$%&'?)*%7.1)5+()**%-)+.#
!"LN $%&'?)*%7.>'&2%-++.D
!"NJ !"K# $%&'?)*%7.8/&?/*+?'-(+8.D
$%&'?)*%7.5)-+(*+?+2%-8.D
!"J! $%&'?)*%7.5)-+(*+?+2%-8.E
!"MN $%&'?)*%7.1/2'8/1.H
$%&'?)*%7.1)5+()**%-)+.H
!"KN
!"JM $%&'?)*%7.1/2'8/1.E
!"MD $%&'?)*%7.1)5+()**%-)+.E
!"NM $%&'?)*%7.>'&2%-++.E
!"MJ $%&'?)*%7.8/&?/*+?'-(+8.H
!"EK $%&'?)*%7.5)-+(*+?+2%-8.H
!"J! $%&'%*2/&%.2%&+?'*-+%).D
# $%&'?)*%7.>'&2%-++.H
# $%&'*/4*/1.&%2/86*'?/-5+.#
# $%&'4%2()*+/1.86".3:; #.(46;
$%&'4%2()*+/1.8%&+-%*/1.:#.(46;#
Figure 8. Independent expansion of the TATA-binding protein family in two haloarchaeal genera. Phylogeny of TATA-binding protein (TBP) homologs identified by RAST with Bootstrap values
shown. Colored branches represent duplication events (with the dark blue branch representing four duplications). Ancestral TBP (found in all genomes) is shown on the purple branch. Successive
duplications are shown in darkening shades of green (Halobacterium) or blue (Haloferax).
Lynch et al. in preparation
Wednesday, March 7, 12
56. Massive Diversity of Proteorhodopsins
Venter et al., 2004
Wednesday, March 7, 12
57. Characterizing the niche-space distributions of components Metagenomics DARPA
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .2 0 .4 0 .6 0 .8 1 .0
Polyne sia Archipe la gos_ G S 0 4 8 a _ C ora l R e e f
India n O ce a n_ G S 1 2 0 _ O pe n O ce a n
Polyne sia Archipe la gos_ G S 0 4 9 _ C oa sta l
G a la pa gos Isla nds_ G S 0 2 6 _ O pe n O ce a n
India n O ce a n_ G S 1 1 9 _ O pe n O ce a n
G e ne ra l
C a ribbe a n S e a _ G S 0 1 5 _ C oa sta l
C a ribbe a n S e a _ G S 0 1 9 _ C oa sta l
India n O ce a n_ G S 1 1 4 _ O pe n O ce a n H igh
E a ste rn Tropica l Pa cific_ G S 0 2 3 _ O pe n O ce a n M e dium
India n O ce a n_ G S 1 1 0 a _ O pe n O ce a n
India n O ce a n_ G S 1 0 8 a _ La goon R e e f Low
C a ribbe a n S e a _ G S 0 1 8 _ O pe n O ce a n NA
G a la pa gos Isla nds_ G S 0 3 4 _ C oa sta l
India n O ce a n_ G S 1 2 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 2 1 _ O pe n O ce a n
C a ribbe a n S e a _ G S 0 1 7 _ O pe n O ce a n
India n O ce a n_ G S 1 1 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 1 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 8 _ F ringing R e e f
C a ribbe a n S e a _ G S 0 1 6 _ C oa sta l S e a
India n O ce a n_ G S 1 2 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 9 _ H a rbor
G a la pa gos Isla nds_ G S 0 2 7 _ C oa sta l
E a ste rn Tropica l Pa cific_ G S 0 2 2 _ O pe n O ce a n W a te r de pth
S ites
S a rga sso S e a _ G S 0 0 1 c_ O pe n O ce a n
G a la pa gos Isla nds_ G S 0 3 5 _ C oa sta l
G a la pa gos Isla nds_ G S 0 3 0 _ W a rm S e e p
G a la pa gos Isla nds_ G S 0 2 9 _ C oa sta l >4000m
G a la pa gos Isla nds_ G S 0 3 1 _ C oa sta l upwe lling
India n O ce a n_ G S 1 1 7 a _ C oa sta l sa m ple
2000!4000m
G a la pa gos Isla nds_ G S 0 2 8 _ C oa sta l 900!2000m
G a la pa gos Isla nds_ G S 0 3 6 _ C oa sta l 100!200m
Polyne sia Archipe la gos_ G S 0 5 1 _ C ora l R e e f Atoll
N orth Am e rica n E a st C oa st_ G S 0 1 4 _ C oa sta l 20!100m
N orth Am e rica n E a st C oa st_ G S 0 0 6 _ E stua ry 0!20m
E a ste rn Tropica l Pa cific_ G S 0 2 1 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 9 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 1 _ E stua ry
N orth Am e rica n E a st C oa st_ G S 0 0 8 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 4 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 7 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 2 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 5 _ E m baym e nt
Co Co Co Co Co
Chlorophyll
Water Depth
Salinity
Temperature
Sample Depth
Insolation
mp mp mp mp mp
on on on on on
en en en en en
t1 t2 t3 t4 t5
(a) (b) (c)
Figure 3: a) Niche-space distributions for our five components (H T ); b) the site-
ˆ ˆ
similarity matrix (H T H); c) environmental variables for the sites. The matrices are
aligned so that the same row corresponds to the same site in each matrix. Sites are
ordered by applying spectral reordering to the similarity matrix (see Materials and
Methods). Rows are aligned across the three matrices.
Wednesday, March 7, 12
58. Uses of Phylogeny
in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
Wednesday, March 7, 12
59. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science
276:734-740
Wednesday, March 7, 12
60. As of 2002 Proteobacteria
TM6
OS-K • At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
Wednesday, March 7, 12
61. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
Wednesday, March 7, 12
62. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some studies
Deferribacteres
Chrysiogenetes in other phyla
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
Wednesday, March 7, 12
63. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroflexi
TM7
Eukaryotes
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
Wednesday, March 7, 12
64. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroflexi
TM7
Viruses
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
Wednesday, March 7, 12
67. GEBA Pilot Project: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,
Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat
Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer,
Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova,
Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)
Wednesday, March 7, 12
68. GEBA Lesson 1:
Phylogeny driven genome selection (and
phylogenetics) improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary” based
predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction
Wednesday, March 7, 12
69. GEBA Lesson 2
Phylogeny-driven genome selection
helps discover new genetic diversity
Wednesday, March 7, 12
70. Protein Family Rarefaction
Curves
• Take data set of multiple complete genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
Wednesday, March 7, 12
71. Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
72. Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
73. Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
74. Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
75. Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
78. GEBA Lesson 3
Improves analysis of genome data from
uncultured organisms
Wednesday, March 7, 12
79. Shotgun Sequencing Allows Use of Other Markers
Sargasso Phylotypes
0.500
0.375 GEBA Project
Weighted % of Clones
0.250
improves EFG
EFTu
HSP70
metagenomic analysis RecA
RpoB
rRNA
0.125
0
ia
ia
ria
s
i
xi
ia
a
ob
te
ot
le
er
er
er
e
u
or
ae
of
ct
ct
ct
ct
ic
hl
or
ba
ba
ba
ba
ch
rm
C
hl
ar
eo
eo
eo
so
Fi
C
ry
Fu
t
t
ot
ro
ro
Eu
pr
ap
ap
lta
ph
m
De
am
Al
G
Major Phylogenetic Group
Venter et al., Science 304: 66-74. 2004
Wednesday, March 7, 12
80. Shotgun Sequencing Allows Use of Other Markers
Sargasso Phylotypes
0.500
0.375 But not a lot
Weighted % of Clones
0.250 EFG
EFTu
HSP70
RecA
RpoB
rRNA
0.125
0
ia
ia
ria
s
i
xi
ia
a
ob
te
ot
le
er
er
er
e
u
or
ae
of
ct
ct
ct
ct
ic
hl
or
ba
ba
ba
ba
ch
rm
C
hl
ar
eo
eo
eo
so
Fi
C
ry
Fu
t
t
ot
ro
ro
Eu
pr
ap
ap
lta
ph
m
De
am
Al
G
Major Phylogenetic Group
Venter et al., Science 304: 66-74. 2004
Wednesday, March 7, 12
81. Phylogeny and Metagenomics
Future 1
Need to adapt genomic and
metagenomic methods to make better
use of data
Wednesday, March 7, 12