Lecture 15: Era IV Shotgun Metagenomics

Lecture 14:

EVE 161: 
Microbial Phylogenomics
!

Lecture #15:
Era IV: Shotgun Metagenomics
!
UC Davis, Winter 2014
Instructor: Jonathan Eisen

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

!1

Where we are going and where we have been

• Previous lecture:
! 14: Era IV: Metagenomics
• Current Lecture:
! 15: Era IV: Shotgun Metagenomics
! Next Lecture:
! 16: Era IV: Function in Metagenomics


!2

Era IV: Genomes in the environment

Era IV:
Shotgun Metagenomics


Environmental Shotgun Sequencing

•
•

ESS first applied to endosymbiont genomes

•
•

Buchnera genome sequenced with ESS

Endosymbionts relatively clonal within one host and
even within one species sometimes

Many others too


Wolbachia Metagenomic Sequencing

shotgun

sequence


Wolbachia pipientis wMel

Wu et al., 2004. Collaboration between Jonathan Eisen and Scott O’Neill (Yale, U. Queensland).


articles

Community structure and metabolism
through reconstruction of microbial
genomes from the environment
Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,
Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2
1

Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,
Berkeley, California 94720, USA
4
Joint Genome Institute, Walnut Creek, California 94598, USA

RESEARCH ARTICLE

...........................................................................................................................................................................................................................

Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their
roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report
reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other
genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of
genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different
individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.
The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance
variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous
recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the
pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme
J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3
environment.
2
2
3

Environmental Genome Shotgun
Sequencing of the Sargasso Sea

Aaron L. Halpern, Doug Rusch, Jonathan A. Eisen,
Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3
The study of microbial evolution and ecology has been revolutio- fluorescence3in situ hybridization Anthony H. Knap,6 biofilms
Derrick E. Fouts, Samuel Levy,2 (FISH) revealed that all
nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in
Michael W. Lomas,6 Ken Nealson,5 Owen White,3 and other
been the main source of sequence data, and only a small fraction of a few cases, Acidimicrobium) and1archaea (Ferroplasma 6
Jeremy Peterson,3 Thermoplasmatales). The genome of one
microorganisms have been cultivated4–6. Consequently, focus has members of theJeff Hoffman, Rachel Parsons, of these
shifted towards the analysis of uncultivated microorganisms via archaea, Ferroplasma acidarmanus fer1, isolated fromRogers,4
Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui the Richmond
5
cloning of conserved genes and genome fragments directly from mine, has been sequenced previously (http://www.jgi.doe.gov/JGI_
Hamilton O. Smith1
the environment7–9. To date, only a small fraction of genes have been microbial/html/ferroplasma/ferro_homepage.html).
Slides for UC Davis EVE161 Course biofilm (Fig.Jonathan Eisen Winter 2014 was
recovered from individual environments, limiting the analysis of
A pink Taught by 1a) typical of AMD communities

chlorococcus, tha
photosynthetic bio
Surface water
were collected ab
from three sites o
February 2003. A
lected aboard the S
station S” in May
are indicated on F
S1; sampling prot
one expedition to
was extracted from
genomic libraries w
2 to 6 kb were m
prepared plasmid
both ends to!11
provi

Shotgun metagenomics

shotgun
sequence


!12

articles

Community structure and metabolism
through reconstruction of microbial
genomes from the environment
Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,
Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2
1

Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,
Berkeley, California 94720, USA
4
Joint Genome Institute, Walnut Creek, California 94598, USA

...........................................................................................................................................................................................................................

Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their
roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report
reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other
genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of
genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different
individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.
The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance
variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous
recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the
pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme
environment.
The study of microbial evolution and ecology has been revolutio- fluorescence in situ hybridization (FISH) revealed that all biofilms
nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in
Slides for UC Davis a small fraction of a few cases, Acidimicrobium) and archaea
been the main source of sequence data, and onlyEVE161 Course Taught by Jonathan Eisen Winter 2014 (Ferroplasma and other

is internally self consistent, with 97.2% of end pairs from fer1. We designate
uncultured Ferroplasma species distinct from the same
Acid Mine Drainage 2004 the appropriate orientation and separation, as
this as Ferroplasma type II. The dominance of this organism type
clone assembled with
was unexpected before the genomic analysis.
We assigned the (tracking and chimaericto
scaffolds
expected for a low rate of mispairing error roughly 3£ coverage, high GþC(474 scaffolds
Leptospirillum group III on the basis of rRNA markers
up to 31 kb, totalling 2.66 Mb). Comparison of these scaffolds with
clones).
those assigned to Leptospirillum group II indicates significant
sequence divergence and only locally conserved gene order, conThe first step in assignment of scaffolds to organism types was to
The first step in assignment of scaffolds to organism types was to

Figure 1 The pink biofilm. a, Photograph of the biofilm in the Richmond mine (hand
included for scale). b, FISH image of a. Probes targeting bacteria (EUBmix; fluorescein
isothiocyanate (green)) and archaea (ARC915; Cy5 (blue)) were used in combination with a
probe targeting the Leptospirillum genus (LF655; Cy3 (red)). Overlap of red and green
(yellow) indicates Leptospirillum cells and shows the dominance of Leptospirillum.
c, Relative microbial abundances determined using quantitative FISH counts.
2

firming that the scaffolds belong to a relatively distant relative of
Leptospirillum group II. A partial 16S rRNA gene sequence from
Sulfobacillus thermosulfidooxidans was identified in the unassembled reads, suggesting very low coverage of this organism. If
any Sulfobacillus scaffolds .2 kb were assembled, they would be
grouped with the Leptospirillum group III scaffolds.
We compared the 3£ coverage, low GþC scaffolds (580 scaffolds,
4.12 Mb) to the fer1 genome in order to assign them to organism
types (Supplementary Fig. S6). Scaffolds with $96% nucleotide
identity to fer1 were assigned to an environmental Ferroplasma type
I genome (170 scaffolds up to 47 kb in length and comprising
1.48 Mb of sequence). The remaining low-coverage, low GþC
scaffolds are tentatively assigned to G-plasma. The largest scaffold
in this bin (62 kb) contains the G-plasma 16S rRNA gene. The 410
scaffolds assigned to G-plasma comprise 2.65 Mb of sequence. A
partial 16S rRNA gene sequence from A-plasma was identified in the
unassembled reads, suggesting low coverage of this organism. Any
scaffolds from A-plasma .2 kb would be included in the G-plasma
bin. Although eukaryotes are present in the AMD system, they were
in low abundance in the biofilm studied. So far, no scaffolds from
eukaryotes have been detected.
As independent evidence that the Leptospirillum group II and
Ferroplasma type II genomes are nearly complete, we located a full
complement of transfer RNA synthetases in each genome data set.
An almost complete set of these genes was also recovered from
Leptospirillum group III. The G-plasma bin contains more than a full
set of tRNA synthetases, consistent with inclusion of some A-plasma
scaffolds. In addition, we established that the Leptospirillum
group II, Leptospirillum group III, Ferroplasma type I, Ferroplasma
type II and G-plasma bins contained only one set of rRNA genes.

NATURE | doi:10.1038/nature02340 | www.nature.com/nature
Slides for UC Davis EVE1612004 Nature Publishing Jonathan Eisen Winter 2014
© Course Taught by Group

le
u
c
re
u
th
w

L
u
th
se
fi
L
S
a
a
g

4
ty
!14 id

Methods
• Plasmid library
• Shotgun sequence
• Assembled
• Binning
! GC content
! Coverage
• Potential “nearly” complete genomes
! Leptospirillum group II
! Ferroplasma type II
! Evidence for completeness: housekeeping genes
• Annotation, population analysis

Leptospirillum group II genome may reﬂect strong recent environmental selection for this genome type or be the result of a founder
effect.

undergone homologous recombination. It is unlikely that the reads
with pattern transitions represent variants that arose simply
through accumulation of nucleotide polymorphisms, because this

Figure 2 Segment of the Ferroplasma type II composite genome. a, A 4.2-kb region
showing annotated open reading frames (ORFs) (red), average read depth (blue line), and
the number of nucleotide polymorphisms in the ‘green’ and ‘yellow’ relative to the ‘pink’
strain (green and yellow lines) averaged over 60-bp windows. Black dots indicate

recombination sites. b, Alignment of individual reads (XYG) for a 96-bp region in a. Letters
indicate nucleotide polymorphisms in the green and yellow strains relative to the pink
strain. Note the recombinant sequence (XYG48207). c, Evolutionary distance tree inferred
from the ancestral strain sequences in a.

NATURE | doi:10.1038/nature02340 | www.nature.com/nature

©2004 Nature Publishing Group


3

tein-coding sequences yields a very large number of genomic

limited evidence
integrases). We c
genes in order to
system is large e
transfer. Identical
plasma and Ferro
contexts), sugges
both lineages. Sim
with identical ad
genomic contexts
indicating that a
groups.

Metabolic analy

Figure 3 Schematic diagram illustrating a diversity of mosaic genome types within the
Ferroplasma type II population that are inferred to have arisen by homologous recombination
between three closely related ancestral genome types (pink, yellow and green).
4

We recovered nea
members of the
group II are par
phylum member
the metabolic pa
Ferroplasma type
plementary Infor
logical roles of
understanding of
The acidophi
that grow in th

©2004 Nature Publishing Group

genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxylase–oxygenase). All genomes recovered from the AMD system

fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type

Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell

drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,

Slides for UC Davis EVE161 Course pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
Taught by Jonathan Eisen Winter 2014
carboxylase–oxygenase. THF, tetrahydrofolate.

!18

RESEARCH ARTICLE
Environmental Genome Shotgun
Sequencing of the Sargasso Sea
J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3
Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3
Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3
Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6
Michael W. Lomas,6 Ken Nealson,5 Owen White,3
Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6
Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4
Hamilton O. Smith1
We have applied “whole-genome shotgun sequencing” to microbial populations
collected en masse on tangential flow and impact filters from seawater samples
collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs
http://www.sciencemag.org/content/304/5667/66
of nonredundant sequence was generated, annotated, and analyzed to elucidate
the gene content, diversity, and relative abundance of the organisms within
these environmental samples. These data are estimated to derive from at least
1800 genomic species based on sequence relatedness, including 148 previously
unknown bacterial phylotypes. We have identified over 1.2 million previously
unknown genes represented in these samples,by Jonathanmore than 782 new
Slides for UC Davis EVE161 Course Taught including Eisen Winter 2014

chlorococcus, th
photosynthetic bi
Surface wate
were collected a
from three sites
February 2003. A
lected aboard the
station S” in Ma
are indicated on
S1; sampling pro
one expedition to
was extracted fro
genomic libraries
2 to 6 kb were
prepared plasmid
both ends to prov
Craig Venter Sc
nology Center on
ers (Applied Bi
Whole-genome ra
the Weatherbird II
4) produced 1.66
in length, for a tota
microbial DNA se
sequences were g
!19

two groups of scaffolds representing two disSargasso Sea related to the published
tinct strains closely

at depths ranging from 4ϫ to 36ϫ (indicated
with shading in table S3 with nine depicted in
Fig. 1. MODIS-Aqua satellite image of
ocean chlorophyll in the Sargasso Sea grid
about the BATS site from 22 February
2003. The station locations are overlain
with their respective identiﬁcations. Note
the elevated levels of chlorophyll (green
color shades) around station 3, which are
not present around stations 11 and 13.


Fig. 2. Gene conserSlides
vation among closely for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

!20

• Sampling Protocols. Sampling on the RV Weatherbird II was done as follows: Seawater (170
liters) from stations 11 and 13 was directly filtered through a 0.8µm Supor membrane disc
filter (Pall Life Sciences) followed in series by a 0.22µm Supor membrane disc filter (Pall Life
Sciences). The sample from station 3 was pumped into a 250 L carboy prior to being filtered
through the impact filters. The length of time from collection of the sample until the end of the
filtration step was approximately one hour. Filters were placed in 5ml of sucrose lysis buffer
(20mM EDTA, 400mM NaCl, 0.75 M Sucrose, 50mM Tris-HCl, pH 9.0) and stored in liquid
nitrogen on the Weatherbird then placed at -80oC until DNA extractions were done.
Alternatively seawater (340 liters) was collected from 5 meters below the surface into a
carboy then filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences), followed
by concentration to 1 liter using a Pellicon tangential flow filtration system (Millipore) with a
0.1µm Durapore VVPP cartridge (Millipore); again the total time for the filtration and
concentration was approximately one hour. Cells were pelleted at 10,000 rpm, 4oC for 30
minutes. ). The impact filters and the retentate from the TFF were then handled as described
above. The carboys, tubing and filter systems were cleaned with a 10% hydrochloric acid
wash prior to each leg of the sampling. Any of the sampling equipment (tubing, etc.) that
could reasonably be soaked was soaked in an acid bath is for at least 24 hours. Sampling
carboys were filled with the acid wash and “soaked” for at least 24 hours as well. All acid
washed items were subsequently rinsed very liberally with Milli-Q water. A liberal Milli-Q water
rinse was also conducted between samples on the same leg. All spigots from the carboys
were covered with a ziploc bag until needed. Tubing was stored in clean ziploc bags until
needed.


Sample preparation. The impact ﬁlters were cut into quarters and placed in
individual 50 ml conical tubes. TE buffer (5 ml, pH 8) containing 150 ug/ml lysozyme
was added to each tube. The tubes were incubated at 37oC for 2 hours. SDS was
added to 0.1% and the samples were then put through three freeze/thaw cycles.
The lysate was then treated with Proteinase K (100 ug/mL) for one hour at 55oC
followed by three aqueous phenol extractions and one extraction with phenol/
chloroform. The supernatant was then precipitated with two volumes of 100%
ethanol and the DNA pellet washed with 70% ethanol.


DNA preparation. DNA was randomly sheared, end-polished with consecutive BAL31
nuclease and T4 DNA polymerase treatments, and size-selected by electrophoresis on
1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen, catalog no.!
N408-18), DNA was puriﬁed by three rounds of gel electrophoresis to remove excess
adapters, and the fragments, now with 3'-CACA overhangs, were inserted into Bst XIlinearized plasmid vector with 3'-TGTG overhangs. Fragments were cloned in a mediumcopy pBR322 derivative.


Sequence assembly. With default parameter settings, the highly covered genome sequences would
have been treated as repetitive DNA by the Celera Assembler. Since the Celera Assembler
constructs scaffolds only from a backbone of sequence heuristically classified as unique, these
organisms would not have been eligible for scaffolding and would have been absent from the final
assembly. However, by tuning the threshold parameter for classifying unique sequence, we were
able to compensate for the apparent repetitiveness of these genomic regions, and scaffold them
appropriately. This was accomplished by identifying the most deeply assembling, obviously nonrepetitive contigs in an initial run of the assembler (in this case, the strong assemblies at 21-36x
coverage which were identified as gene-rich Burkholderia-like and plasmid scaffolds), and using a
value slightly below the calculated “A-statistic” (an empirical uniqueness measure within the
Assembler) of these contigs as the threshold parameter in a subsequent run. This allows the deep
contigs to be treated as unique sequence, when they would otherwise be labeled as repetitive. At
the other end of the spectrum, rare organisms in the sample have been sampled by sequencing
only to a shallow depth of coverage. Routine assembly would not have considered the small
fragment overlap based assemblies with shallow coverage as an eligible basis for scaffolding, due
to a minimum length requirement of 1000bp, which is typically in place for efficiency. Therefore, in
the present use case, the organisms represented by these sequences would not have been ordered
and oriented with mate-pairs without adjusting the default minimum length to compensate for the
low anticipated coverage depth and assembly length. With this selection of parameters, more
suitable to the enivironmental project at hand, we were able to adequately assemble both the
dominant and rare species simultaneously.


Methods
• Plasmid library
• Shotgun sequence
• Assembled
• No Major Binning
• Potential “nearly” complete genomes
• Annotation, population analysis, phylogenetic analysis


e relatively limited depth of serage given the level of diversity
ple.
genome shotgun (WGS) assembly
sited at DDBJ/EMBL/GenBank
ect accession AACY00000000,
have been deposited in a correeDB trace archive. The version
his paper is the first version,
00. Unlike a conventional WGS
deposited not just contigs and
e unassembled paired singletons
singletons in order to accuratediversity in the sample and
across the entire sample withabase.
and large assemblies. Our
ocused on the well-sampled geacterizing scaffolds with at least
depth. There were 333 scaffolds
26 contigs and spanning 30.9
his criterion (table S3), accounty 410,000 reads, or 25% of the
ly data set. From this set of wellal, we were able to cluster and
blies by organism; from the rare
ample, we used sequence similarods together with computational
obtain both qualitative and quans of genomic and functional diverparticular marine environment.
yed several criteria to sort the
y pieces into tentative organism
nclude depth of coverage, oligo-

Fig. 2. Gene conservation among closely
related Prochlorococcus. The outermost
concentric circle of
the diagram depicts
the competed genomic sequence of Prochlorococcus marinus
MED4 (11). Fragments
from environmental
sequencing were compared to this completed Prochlorococcus genome and are shown in
the inner concentric
circles and were given
boxed outlines. Genes
for the outermost circle have been assigned psuedospectrum colors based on
the position of those
genes along the chromosome, where genes
nearer to the start of
the genome are colored in red, and genes
nearer to the end of the genome are colored in blue. Fragments from environmental sequencing
were subjected to an analysis that identiﬁes conserved gene order between those fragments and
the completed Prochlorococcus MED4 genome. Genes on the environmental genome segments
that exhibited conserved gene order are colored with the same color assignments as the
Prochlorococcus MED4 chromosome. Colored regions on the environmental segments exhibiting
color differences from the adjacent outermost concentric circle are the result of conserved gene
order with other MED4 regions and probably represent chromosomal rearrangements. Genes that
did not exhibit conserved gene order are colored in black.

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004

67

RESEARCH ARTICLE
Fig. 3. Comparison of
Sargasso Sea scaffolds to Crenarchaeal
clone 4B7. Predicted
proteins from 4B7
and the scaffolds
showing signiﬁcant
homology to 4B7 by
tBLASTx are arrayed
in positional order
along the x and y
axes. Colored boxes
represent
BLASTp
matches scoring at
least 25% similarity
and with an e value
of better than 1e-5.
Black vertical and
horizontal lines delineate scaffold borders.


Fig. 4). Oth
separated, p
nation of sh
nomic signa
greater dive
genomes (9
Discrete
continuum
scaffolds (21
and 9.35 M
single nucl
10,000 base
ence of disc
the remaini
SNP rate ran
a length-we
We closely
alignments
and were ab
distinct clas
related hap
creasing th
(10), and re
homogenou
consensus w
haplotypes,
fold region
cus scaffold

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward direction
are shown in the outer concentric circle; reverse coding genes are shown in the inner concentric
circle. The genes have been given role category assignment and colored accordingly: amino acid
biosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cell
envelope, light green; cellular processes, red; central intermediary metabolism, brown; DNA
metabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;
protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;
regulatory functions and signal transduction, olive; transcription, dark green; transport and binding
proteins, blue-green; genes with no known homology to other proteins and genes with homology
to genes with no known function, white; genes of unknown function, gray; Tick marks are placed
on 10-kb intervals.

68

homogenous blend of discrepancies from
consensus without any apparent separation
haplotypes, such as the Prochlorococcus s
fold region (Fig. 5). Indeed, the Prochloroc
cus scaffolds display considerable heteroge
ity not only at the nucleotide sequence le
(Fig. 5) but also at the genomic level, wh
multiple scaffolds align with the same regio
the MED4 (11) genome but differ due to g
or genomic island insertion, deletion, rearran
ment events. This observation is consistent w
previous findings (12). For instance, scaffo
2221918 and 2223700 share gene synteny w
each other and MED4 but differ by the inser
of 15 genes of probable phage origin, lik
representing an integrated bacteriophage. Th
genomic differences are displayed graphic
in Fig. 2, where it is evident that up to f
conflicting scaffolds can align with the sa
region of the MED4 genome. More than 8
of the Prochlorococcus MED4 genome can
aligned with Sargasso Sea scaffolds gre
than 10 kb; however, there appear to b
couple of regions of MED4 that are not rep
sented in the 10-kb scaffolds (Fig. 2).
larger of these two regions (PMM1187
PMM1277) consists primarily of a gene clu
coding for surface polysaccharide biosynthe
which may represent a MED4-specific poly
charide absent or highly diverged in our S
gasso Sea Prochlorococcus bacteria. The he
ogeneity of the Prochlorococcus scaffolds sug
that the scaffolds are not derived from a sin
discrete strain, but instead probably represen
conglomerate assembled from a population
closely related Prochlorococcus biotypes.
The gene complement of the Sargas
The heterogeneity of the Sargasso sequen
complicates the identification of micro
genes. The typical approach for microbial
notation, model-based gene finding, relies
tirely on training with a subset of manu

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org


frames (5). A total of 69,901 novel genes belonging to 15,601 single link clusters were identified. The predicted genes were categorized
Table 1. Gene count breakdown by TIGR role
category. Gene set includes those found on assemblies from samples 1 to 4 and fragment reads
from samples 5 to 7. A more detailed table, separating Weatherbird II samples from the Sorcerer II
samples is presented in the SOM (table S4). Note
that there are 28,023 genes which were classified
in more than one role category.
TIGR role category
Amino acid biosynthesis
Biosynthesis of cofactors,
prosthetic groups, and carriers
Cell envelope
Cellular processes
Central intermediary metabolism
DNA metabolism
Energy metabolism
Fatty acid and phospholipid
metabolism
Mobile and extrachromosomal
element functions
Protein fate
Protein synthesis
Purines, pyrimidines, nucleosides,
and nucleotides
Regulatory functions
Signal transduction
Transcription
Transport and binding proteins
Unknown function
Miscellaneous
Conserved hypothetical

Total
genes
37,118
25,905
27,883
17,260
13,639
25,346
69,718
18,558
1,061
28,768
48,012
19,912
8,392
4,817
12,756
49,185
38,067
1,864
794,061

Total number of roles assigned

1,242,230

Total number of genes

1,214,207

Fig. 5. Prochlorococcus-related scaffold 2223290 illustra
nity of closely related organisms, distinctly nonpunctat
global structure of Scaffold 2223290 with respect to asse
sequence alignment. Blue segments, contigs; green segm
stages of the assembly of fragments into the resulting
fragments were initially assembled in several different
form the final contig structure. The multiple sequenc
homogenous blend of haplotypes, none with sufficie
separate assembly.

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004


d curated genes. With the vast maSargasso sequence in short (less
unassociated scaffolds and singleundreds of different organisms, it is
o apply this approach. Instead, we
n evidence-based gene finder (5).
ence in the form of protein alignquences in the bacterial portion of
ndant amino acid (nraa) data set
sed to determine the most likely
e. Likewise, approximate start and
s were determined from the boundtes of the alignments and refined to
cific start and stop codons. This
entified 1,214,207 genes covering
B of the total data set. This repreximately an order of magnitude
nces than currently archived in the
Slides for UC
ssProt database (14), which con- Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

RESEA

rRNA phylotyping from metagenomics



!32

Shotgun Sequencing Allows Alternative Anchors (e.g., RecA)



!33

nomic group using the phylogenetic analysis
described for rRNA. For example, our data set

marker genes, is roughly comparable to the
97% cutoff traditionally used for rRNA. Thus


Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. The
relative contribution of organisms from different major phylogenetic groups (phylotypes) was
measured using multiple phylogenetic markers that have been used previously in phylogenetic
studies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). The
relative proportion of different phylotypes for each sequence (weighted by the depth of coverage
of the contigs from which those sequences came) is shown. The phylotype distribution was
determined as follows: (i) Sequences in the Sargasso data set corresponding to each of these genes
were identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for each
phylogenetic marker identified in the Sargasso data separately compared with all members of that
gene family in all complete genome sequences (only complete genomes were used to control for
the differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of each
sequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.

Slides for UC Davis
RIL 2004 VOL 304 SCIENCE www.sciencemag.org EVE161 Course Taught by Jonathan Eisen Winter 2014

!34

method based on fitting the observed depth of
coverage to a theoretical model of assembly
progress for a sample corresponding to a mix-

that a minimum of 12-fold deeper sampling
would be required to obtain 95% of the unique
sequence. However, these are only lower

Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column uses
symbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. Sequence
ID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTx
e-value chosen to identify orthologs when querying the E. coli sequence against the complete Sargasso
Sea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at the
point along the query for which this value is maximal. Observed “species” shows the number of distinct
clusters of reads from the maximum fragment depth column, after grouping reads whose containing
assemblies had an overlap of at least 40 bp with Ͼ 94% nucleotide identity (single-link clustering).
Singleton “species” shows the number of distinct clusters from the observed “species” column that
consist of a single read. Most abundant column shows the fraction of the maximum fragment depth that
consists of single largest cluster.

Protein

Sequence ID

Ortholog
cutoff

AtpD
GyrB
Hsp70
RecA
RpoB
TufA

NTL01EC03653
NTL01EC03620
NT01EC0015
NTL01EC02639
NTL01EC03885
NTL01EC03262

1e-32
1e-11
1e-31
1e-21
1e-41
1e-41

Max.
fragment
depth

Observed
“species”

Singleton
“species”

Most
abundant
(%)

836
924
812
592
669
597

456
569
515
341
428
397

317
429
394
244
331
307

6
4
4
8
7
3

of se
ever
nity.
resen
know
scaff
cont
even
SAR
cove
fold,
21,0
popu
uted
V
key
proa
men
the r
isms
half
men
equa
colle

Table 3. Diversity models based on depth of coverage. Each row correcolumn) in the sample. The thi
sponds to an abundance class of organisms. The first column in each
a genome expected to be s
model “fr(asm)” gives the fraction of the assembly consensus modeled
gives the resulting estimat

Figure S6. Accumulation curve for rpoB. Observed (black) OTU counts for rpoB (based
on the fragment grouping summarized in Table 2), as well as the Chao1-corrected
estimate of total species (red; see (3)). Points are mean values of 1000 shufflings of the
observed data, while bars show 90% confidence intervals.

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised

Figure S7. Each point in the figure corresponds to a scaffold from the assembly
(restricted to scaffolds > 10kb). Scaffolds were placed in separate panels of the figure
according to the most closely related organism as indicated by the BLAST searches
described in the text. Within a panel, a scaffold is shown with x coordinate equal to its
length, y coordinate equal to its estimated depth of coverage, and color determined by
which of 6 k-mer composition clusters it was assigned to. Depth of coverage was
estimated as the total base pairs in reads belonging to a given assembly piece divided by
the length of the consensus sequence for the piece. K-mer composition clusters were
determined by representing each scaffold as a vector of the frequencies of all possible 4mers, considering both the forward and reverse strands of the sequence, and then
applying the K-means clustering algorithm.


Functional Diversity of Proteorhodopsins?



!38

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised

Figure S10. Scaffold 2217664, containing the gene encoding Proteorhodopsin. Genes are
colored using color assignments described in Fig. 2, and contig boundaries are indicated
with red vertical lines. In this scaffold, rhodopsin is associated with a DNA-directed
RNA polymerase, sigma subunit (rpoD) originating in the CFB group.

Binning challenge

A
B
C
D
E
F
G

T
U
V
W
X
Y
Z

!40

Binning challenge

A
B
C
D
E
F
G

T
U
V
W
X
Y
Z
Best binning method: reference genomes

!41

Glassy Winged Sharpshooter
• Feeds on xylem sap

• Vector for Pierce’s Disease

• Potential bioterror agent

• Collaboration with Nancy
Moran to sequence
symbiont genomes

• Funded by NSF

• Published in PLOS Biology
2006


Wu et al. 2006 PLoS Biology 4: e188.

Sharpshooter Shotgun Sequencing

shotgun

Collaboration with Nancy Moran’s
lab

Binning challenge

A

B

C

D

E

F

G

No reference genome? What do you do?

!
Phylogeny ....

T

U

V

W

X

Y

Z

CFB Phyla


Sulcia makes vitamins and cofactors

Baumannia makes amino acids


48

Sorcerer II GOS Expedition

Figure 1. Sampling Sites
Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the
eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the
Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part
of this study.
doi:10.1371/journal.pbio.0050077.g001

environments as well as a few nonmarine aquatic samples for
the pilot Sargasso Sea study, 200 l surface seawater was
contrast (Table Eisen Winter 2014
ﬁltered to isolate microorganisms UC Davis EVE161analysis. Taught by Jonathan1).
Slides for for metagenomic Course

Stalking the Fourth Domain in Metagenomic Data:
Searching for, Discovering, and Interpreting Novel, Deep
Branches in Marker Gene Phylogenetic Trees
Dongying Wu1, Martin Wu1,4, Aaron Halpern2,3, Douglas B. Rusch2,3, Shibu Yooseph2,3, Marvin Frazier2,3,
J. Craig Venter2,3, Jonathan A. Eisen1*
1 Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, University of California Davis Genome Center, University of California
Davis, Davis, California, United States of America, 2 The J. Craig Venter Institute, Rockville, Maryland, United States of America, 3 The J. Craig Venter Institute, La Jolla,
California, United States of America, 4 University of Virginia, Charlottesville, Virginia, United States of America

Abstract
Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data
associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and
culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated
directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we
argue here, in studies of very early events in the evolution of gene families and of species.
Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used
them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly
used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies.
Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in
making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel
branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these
novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.
Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from
uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third
possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which
sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree
of life, we suggest that methods such as those described herein currently offer the best way to search for them.
Citation: Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and
Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Editor: Robert Fleischer, Smithsonian Institution National Zoological Park, United States of America
Received October 25, 2010; Accepted February 20, 2011; Published March 18, 2011
This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public
domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.


Funding: The development and main work on this project was supported by the National Science Foundation via an ‘‘Assembling the Tree of Life’’ grant
(number 0228651) to to Jonathan A. Eisen and Naomi Ward. The final work on this project was funded by the Gordon and Betty Moore Foundation (through

Stalking the Fourth Domain

Figure 1. Phylogenetic tree of the RecA superfamily. All RecA sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA
superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial
analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in
the text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily.
doi:10.1371/journal.pone.0018011.g001

PLoS ONE | www.plosone.org

5

March 2011 | Volume 6 | Issue 3 | e18011


Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of
these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits
against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.
doi:10.1371/journal.pone.0018011.t002

Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily
Unknown 2). This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to
known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2
subfamily (cluster ID 9).

Slides for UC
PLoS ONE | www.plosone.org Davis

EVE161 Course7Taught by Jonathan Eisen| Winter 2014 3
March 2011 Volume 6 | Issue

| e18011

Stalking the Fourth Domain

Figure 3. Phylogenetic tree of the RpoB superfamily. All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB
superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored
panels.

Methods

these 340 sequences were extracted from the European Ribosomal

[66] and then
Slides forIdentification of deeply-branching ss-rRNA sequences by Jonathan than 90% gaps or with 2014remove
UC Davis EVE161 Course Taught RNA databasemore Eisen manually curated toalignment
Winter poor
columns with

Lecture 15: Era IV Shotgun Metagenomics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Lecture 15: Era IV Shotgun Metagenomics

Semelhante a Lecture 15: Era IV Shotgun Metagenomics (20)

Mais de Jonathan Eisen

Mais de Jonathan Eisen (20)

Último

Último (20)

Lecture 15: Era IV Shotgun Metagenomics