This document summarizes a lecture on shotgun metagenomics from a course on microbial phylogenomics. The lecture discusses how shotgun sequencing was applied to sequence microbial communities directly from environmental samples, without culturing. This allowed reconstruction of near-complete genomes from dominant species in an acid mine drainage biofilm sample. The sample was dominated by a few microbial populations, and shotgun sequencing generated enough data to assemble genomes representing Leptospirillum group II and Ferroplasma type II. Analysis of the assembled genomes provided insights into the metabolic pathways and survival strategies of these uncultivated organisms inhabiting an extreme environment.
1. Lecture 14:
EVE 161:
Microbial Phylogenomics
!
Lecture #15:
Era IV: Shotgun Metagenomics
!
UC Davis, Winter 2014
Instructor: Jonathan Eisen
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!1
2. Where we are going and where we have been
• Previous lecture:
! 14: Era IV: Metagenomics
• Current Lecture:
! 15: Era IV: Shotgun Metagenomics
! Next Lecture:
! 16: Era IV: Function in Metagenomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!2
3. Era IV: Genomes in the environment
Era IV:
Shotgun Metagenomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
4. Environmental Shotgun Sequencing
•
•
ESS first applied to endosymbiont genomes
•
•
Buchnera genome sequenced with ESS
Endosymbionts relatively clonal within one host and
even within one species sometimes
Many others too
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
6. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
7. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
8. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
9. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
10. Wolbachia pipientis wMel
Wu et al., 2004. Collaboration between Jonathan Eisen and Scott O’Neill (Yale, U. Queensland).
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
11. articles
Community structure and metabolism
through reconstruction of microbial
genomes from the environment
Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,
Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2
1
Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,
Berkeley, California 94720, USA
4
Joint Genome Institute, Walnut Creek, California 94598, USA
RESEARCH ARTICLE
...........................................................................................................................................................................................................................
Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their
roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report
reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other
genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of
genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different
individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.
The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance
variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous
recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the
pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme
J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3
environment.
2
2
3
Environmental Genome Shotgun
Sequencing of the Sargasso Sea
Aaron L. Halpern, Doug Rusch, Jonathan A. Eisen,
Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3
The study of microbial evolution and ecology has been revolutio- fluorescence3in situ hybridization Anthony H. Knap,6 biofilms
Derrick E. Fouts, Samuel Levy,2 (FISH) revealed that all
nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in
Michael W. Lomas,6 Ken Nealson,5 Owen White,3 and other
been the main source of sequence data, and only a small fraction of a few cases, Acidimicrobium) and1archaea (Ferroplasma 6
Jeremy Peterson,3 Thermoplasmatales). The genome of one
microorganisms have been cultivated4–6. Consequently, focus has members of theJeff Hoffman, Rachel Parsons, of these
shifted towards the analysis of uncultivated microorganisms via archaea, Ferroplasma acidarmanus fer1, isolated fromRogers,4
Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui the Richmond
5
cloning of conserved genes and genome fragments directly from mine, has been sequenced previously (http://www.jgi.doe.gov/JGI_
Hamilton O. Smith1
the environment7–9. To date, only a small fraction of genes have been microbial/html/ferroplasma/ferro_homepage.html).
Slides for UC Davis EVE161 Course biofilm (Fig.Jonathan Eisen Winter 2014 was
recovered from individual environments, limiting the analysis of
A pink Taught by 1a) typical of AMD communities
chlorococcus, tha
photosynthetic bio
Surface water
were collected ab
from three sites o
February 2003. A
lected aboard the S
station S” in May
are indicated on F
S1; sampling prot
one expedition to
was extracted from
genomic libraries w
2 to 6 kb were m
prepared plasmid
both ends to!11
provi
13. articles
Community structure and metabolism
through reconstruction of microbial
genomes from the environment
Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,
Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2
1
Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,
Berkeley, California 94720, USA
4
Joint Genome Institute, Walnut Creek, California 94598, USA
...........................................................................................................................................................................................................................
Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their
roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report
reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other
genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of
genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different
individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.
The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance
variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous
recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the
pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme
environment.
The study of microbial evolution and ecology has been revolutio- fluorescence in situ hybridization (FISH) revealed that all biofilms
nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in
Slides for UC Davis a small fraction of a few cases, Acidimicrobium) and archaea
been the main source of sequence data, and onlyEVE161 Course Taught by Jonathan Eisen Winter 2014 (Ferroplasma and other
18. genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxylase–oxygenase). All genomes recovered from the AMD system
fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type
Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell
drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,
Slides for UC Davis EVE161 Course pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
Taught by Jonathan Eisen Winter 2014
carboxylase–oxygenase. THF, tetrahydrofolate.
!18
19. RESEARCH ARTICLE
Environmental Genome Shotgun
Sequencing of the Sargasso Sea
J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3
Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3
Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3
Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6
Michael W. Lomas,6 Ken Nealson,5 Owen White,3
Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6
Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4
Hamilton O. Smith1
We have applied “whole-genome shotgun sequencing” to microbial populations
collected en masse on tangential flow and impact filters from seawater samples
collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs
http://www.sciencemag.org/content/304/5667/66
of nonredundant sequence was generated, annotated, and analyzed to elucidate
the gene content, diversity, and relative abundance of the organisms within
these environmental samples. These data are estimated to derive from at least
1800 genomic species based on sequence relatedness, including 148 previously
unknown bacterial phylotypes. We have identified over 1.2 million previously
unknown genes represented in these samples,by Jonathanmore than 782 new
Slides for UC Davis EVE161 Course Taught including Eisen Winter 2014
chlorococcus, th
photosynthetic bi
Surface wate
were collected a
from three sites
February 2003. A
lected aboard the
station S” in Ma
are indicated on
S1; sampling pro
one expedition to
was extracted fro
genomic libraries
2 to 6 kb were
prepared plasmid
both ends to prov
Craig Venter Sc
nology Center on
ers (Applied Bi
Whole-genome ra
the Weatherbird II
4) produced 1.66
in length, for a tota
microbial DNA se
sequences were g
!19
20. two groups of scaffolds representing two disSargasso Sea related to the published
tinct strains closely
at depths ranging from 4ϫ to 36ϫ (indicated
with shading in table S3 with nine depicted in
Fig. 1. MODIS-Aqua satellite image of
ocean chlorophyll in the Sargasso Sea grid
about the BATS site from 22 February
2003. The station locations are overlain
with their respective identifications. Note
the elevated levels of chlorophyll (green
color shades) around station 3, which are
not present around stations 11 and 13.
http://www.sciencemag.org/content/304/5667/66
Fig. 2. Gene conserSlides
vation among closely for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!20
21. • Sampling Protocols. Sampling on the RV Weatherbird II was done as follows: Seawater (170
liters) from stations 11 and 13 was directly filtered through a 0.8µm Supor membrane disc
filter (Pall Life Sciences) followed in series by a 0.22µm Supor membrane disc filter (Pall Life
Sciences). The sample from station 3 was pumped into a 250 L carboy prior to being filtered
through the impact filters. The length of time from collection of the sample until the end of the
filtration step was approximately one hour. Filters were placed in 5ml of sucrose lysis buffer
(20mM EDTA, 400mM NaCl, 0.75 M Sucrose, 50mM Tris-HCl, pH 9.0) and stored in liquid
nitrogen on the Weatherbird then placed at -80oC until DNA extractions were done.
Alternatively seawater (340 liters) was collected from 5 meters below the surface into a
carboy then filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences), followed
by concentration to 1 liter using a Pellicon tangential flow filtration system (Millipore) with a
0.1µm Durapore VVPP cartridge (Millipore); again the total time for the filtration and
concentration was approximately one hour. Cells were pelleted at 10,000 rpm, 4oC for 30
minutes. ). The impact filters and the retentate from the TFF were then handled as described
above. The carboys, tubing and filter systems were cleaned with a 10% hydrochloric acid
wash prior to each leg of the sampling. Any of the sampling equipment (tubing, etc.) that
could reasonably be soaked was soaked in an acid bath is for at least 24 hours. Sampling
carboys were filled with the acid wash and “soaked” for at least 24 hours as well. All acid
washed items were subsequently rinsed very liberally with Milli-Q water. A liberal Milli-Q water
rinse was also conducted between samples on the same leg. All spigots from the carboys
were covered with a ziploc bag until needed. Tubing was stored in clean ziploc bags until
needed.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
22. Sample preparation. The impact filters were cut into quarters and placed in
individual 50 ml conical tubes. TE buffer (5 ml, pH 8) containing 150 ug/ml lysozyme
was added to each tube. The tubes were incubated at 37oC for 2 hours. SDS was
added to 0.1% and the samples were then put through three freeze/thaw cycles.
The lysate was then treated with Proteinase K (100 ug/mL) for one hour at 55oC
followed by three aqueous phenol extractions and one extraction with phenol/
chloroform. The supernatant was then precipitated with two volumes of 100%
ethanol and the DNA pellet washed with 70% ethanol.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
23. DNA preparation. DNA was randomly sheared, end-polished with consecutive BAL31
nuclease and T4 DNA polymerase treatments, and size-selected by electrophoresis on
1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen, catalog no.!
N408-18), DNA was purified by three rounds of gel electrophoresis to remove excess
adapters, and the fragments, now with 3'-CACA overhangs, were inserted into Bst XIlinearized plasmid vector with 3'-TGTG overhangs. Fragments were cloned in a mediumcopy pBR322 derivative.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
24. Sequence assembly. With default parameter settings, the highly covered genome sequences would
have been treated as repetitive DNA by the Celera Assembler. Since the Celera Assembler
constructs scaffolds only from a backbone of sequence heuristically classified as unique, these
organisms would not have been eligible for scaffolding and would have been absent from the final
assembly. However, by tuning the threshold parameter for classifying unique sequence, we were
able to compensate for the apparent repetitiveness of these genomic regions, and scaffold them
appropriately. This was accomplished by identifying the most deeply assembling, obviously nonrepetitive contigs in an initial run of the assembler (in this case, the strong assemblies at 21-36x
coverage which were identified as gene-rich Burkholderia-like and plasmid scaffolds), and using a
value slightly below the calculated “A-statistic” (an empirical uniqueness measure within the
Assembler) of these contigs as the threshold parameter in a subsequent run. This allows the deep
contigs to be treated as unique sequence, when they would otherwise be labeled as repetitive. At
the other end of the spectrum, rare organisms in the sample have been sampled by sequencing
only to a shallow depth of coverage. Routine assembly would not have considered the small
fragment overlap based assemblies with shallow coverage as an eligible basis for scaffolding, due
to a minimum length requirement of 1000bp, which is typically in place for efficiency. Therefore, in
the present use case, the organisms represented by these sequences would not have been ordered
and oriented with mate-pairs without adjusting the default minimum length to compensate for the
low anticipated coverage depth and assembly length. With this selection of parameters, more
suitable to the enivironmental project at hand, we were able to adequately assemble both the
dominant and rare species simultaneously.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
25. Methods
• Plasmid library
• Shotgun sequence
• Assembled
• No Major Binning
• Potential “nearly” complete genomes
• Annotation, population analysis, phylogenetic analysis
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
26. e relatively limited depth of serage given the level of diversity
ple.
genome shotgun (WGS) assembly
sited at DDBJ/EMBL/GenBank
ect accession AACY00000000,
have been deposited in a correeDB trace archive. The version
his paper is the first version,
00. Unlike a conventional WGS
deposited not just contigs and
e unassembled paired singletons
singletons in order to accuratediversity in the sample and
across the entire sample withabase.
and large assemblies. Our
ocused on the well-sampled geacterizing scaffolds with at least
depth. There were 333 scaffolds
26 contigs and spanning 30.9
his criterion (table S3), accounty 410,000 reads, or 25% of the
ly data set. From this set of wellal, we were able to cluster and
blies by organism; from the rare
ample, we used sequence similarods together with computational
obtain both qualitative and quans of genomic and functional diverparticular marine environment.
yed several criteria to sort the
y pieces into tentative organism
nclude depth of coverage, oligo-
Fig. 2. Gene conservation among closely
related Prochlorococcus. The outermost
concentric circle of
the diagram depicts
the competed genomic sequence of Prochlorococcus marinus
MED4 (11). Fragments
from environmental
sequencing were compared to this completed Prochlorococcus genome and are shown in
the inner concentric
circles and were given
boxed outlines. Genes
for the outermost circle have been assigned psuedospectrum colors based on
the position of those
genes along the chromosome, where genes
nearer to the start of
the genome are colored in red, and genes
nearer to the end of the genome are colored in blue. Fragments from environmental sequencing
were subjected to an analysis that identifies conserved gene order between those fragments and
the completed Prochlorococcus MED4 genome. Genes on the environmental genome segments
that exhibited conserved gene order are colored with the same color assignments as the
Prochlorococcus MED4 chromosome. Colored regions on the environmental segments exhibiting
color differences from the adjacent outermost concentric circle are the result of conserved gene
order with other MED4 regions and probably represent chromosomal rearrangements. Genes that
did not exhibit conserved gene order are colored in black.
http://www.sciencemag.org/content/304/5667/66
www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
67
27. RESEARCH ARTICLE
Fig. 3. Comparison of
Sargasso Sea scaffolds to Crenarchaeal
clone 4B7. Predicted
proteins from 4B7
and the scaffolds
showing significant
homology to 4B7 by
tBLASTx are arrayed
in positional order
along the x and y
axes. Colored boxes
represent
BLASTp
matches scoring at
least 25% similarity
and with an e value
of better than 1e-5.
Black vertical and
horizontal lines delineate scaffold borders.
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Fig. 4). Oth
separated, p
nation of sh
nomic signa
greater dive
genomes (9
Discrete
continuum
scaffolds (21
and 9.35 M
single nucl
10,000 base
ence of disc
the remaini
SNP rate ran
a length-we
We closely
alignments
and were ab
distinct clas
related hap
creasing th
(10), and re
homogenou
consensus w
haplotypes,
fold region
cus scaffold
28. Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward direction
are shown in the outer concentric circle; reverse coding genes are shown in the inner concentric
circle. The genes have been given role category assignment and colored accordingly: amino acid
biosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cell
envelope, light green; cellular processes, red; central intermediary metabolism, brown; DNA
metabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;
protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;
regulatory functions and signal transduction, olive; transcription, dark green; transport and binding
proteins, blue-green; genes with no known homology to other proteins and genes with homology
to genes with no known function, white; genes of unknown function, gray; Tick marks are placed
on 10-kb intervals.
68
homogenous blend of discrepancies from
consensus without any apparent separation
haplotypes, such as the Prochlorococcus s
fold region (Fig. 5). Indeed, the Prochloroc
cus scaffolds display considerable heteroge
ity not only at the nucleotide sequence le
(Fig. 5) but also at the genomic level, wh
multiple scaffolds align with the same regio
the MED4 (11) genome but differ due to g
or genomic island insertion, deletion, rearran
ment events. This observation is consistent w
previous findings (12). For instance, scaffo
2221918 and 2223700 share gene synteny w
each other and MED4 but differ by the inser
of 15 genes of probable phage origin, lik
representing an integrated bacteriophage. Th
genomic differences are displayed graphic
in Fig. 2, where it is evident that up to f
conflicting scaffolds can align with the sa
region of the MED4 genome. More than 8
of the Prochlorococcus MED4 genome can
aligned with Sargasso Sea scaffolds gre
than 10 kb; however, there appear to b
couple of regions of MED4 that are not rep
sented in the 10-kb scaffolds (Fig. 2).
larger of these two regions (PMM1187
PMM1277) consists primarily of a gene clu
coding for surface polysaccharide biosynthe
which may represent a MED4-specific poly
charide absent or highly diverged in our S
gasso Sea Prochlorococcus bacteria. The he
ogeneity of the Prochlorococcus scaffolds sug
that the scaffolds are not derived from a sin
discrete strain, but instead probably represen
conglomerate assembled from a population
closely related Prochlorococcus biotypes.
The gene complement of the Sargas
The heterogeneity of the Sargasso sequen
complicates the identification of micro
genes. The typical approach for microbial
notation, model-based gene finding, relies
tirely on training with a subset of manu
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
29. frames (5). A total of 69,901 novel genes belonging to 15,601 single link clusters were identified. The predicted genes were categorized
Table 1. Gene count breakdown by TIGR role
category. Gene set includes those found on assemblies from samples 1 to 4 and fragment reads
from samples 5 to 7. A more detailed table, separating Weatherbird II samples from the Sorcerer II
samples is presented in the SOM (table S4). Note
that there are 28,023 genes which were classified
in more than one role category.
TIGR role category
Amino acid biosynthesis
Biosynthesis of cofactors,
prosthetic groups, and carriers
Cell envelope
Cellular processes
Central intermediary metabolism
DNA metabolism
Energy metabolism
Fatty acid and phospholipid
metabolism
Mobile and extrachromosomal
element functions
Protein fate
Protein synthesis
Purines, pyrimidines, nucleosides,
and nucleotides
Regulatory functions
Signal transduction
Transcription
Transport and binding proteins
Unknown function
Miscellaneous
Conserved hypothetical
Total
genes
37,118
25,905
27,883
17,260
13,639
25,346
69,718
18,558
1,061
28,768
48,012
19,912
8,392
4,817
12,756
49,185
38,067
1,864
794,061
Total number of roles assigned
1,242,230
Total number of genes
1,214,207
Fig. 5. Prochlorococcus-related scaffold 2223290 illustra
nity of closely related organisms, distinctly nonpunctat
global structure of Scaffold 2223290 with respect to asse
sequence alignment. Blue segments, contigs; green segm
stages of the assembly of fragments into the resulting
fragments were initially assembled in several different
form the final contig structure. The multiple sequenc
homogenous blend of haplotypes, none with sufficie
separate assembly.
http://www.sciencemag.org/content/304/5667/66
www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
30. d curated genes. With the vast maSargasso sequence in short (less
unassociated scaffolds and singleundreds of different organisms, it is
o apply this approach. Instead, we
n evidence-based gene finder (5).
ence in the form of protein alignquences in the bacterial portion of
ndant amino acid (nraa) data set
sed to determine the most likely
e. Likewise, approximate start and
s were determined from the boundtes of the alignments and refined to
cific start and stop codons. This
entified 1,214,207 genes covering
B of the total data set. This repreximately an order of magnitude
http://www.sciencemag.org/content/304/5667/66
nces than currently archived in the
Slides for UC
ssProt database (14), which con- Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
RESEA
31. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
32. rRNA phylotyping from metagenomics
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!32
33. Shotgun Sequencing Allows Alternative Anchors (e.g., RecA)
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!33
34. nomic group using the phylogenetic analysis
described for rRNA. For example, our data set
marker genes, is roughly comparable to the
97% cutoff traditionally used for rRNA. Thus
http://www.sciencemag.org/content/304/5667/66
Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. The
relative contribution of organisms from different major phylogenetic groups (phylotypes) was
measured using multiple phylogenetic markers that have been used previously in phylogenetic
studies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). The
relative proportion of different phylotypes for each sequence (weighted by the depth of coverage
of the contigs from which those sequences came) is shown. The phylotype distribution was
determined as follows: (i) Sequences in the Sargasso data set corresponding to each of these genes
were identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for each
phylogenetic marker identified in the Sargasso data separately compared with all members of that
gene family in all complete genome sequences (only complete genomes were used to control for
the differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of each
sequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.
Slides for UC Davis
RIL 2004 VOL 304 SCIENCE www.sciencemag.org EVE161 Course Taught by Jonathan Eisen Winter 2014
!34
35. method based on fitting the observed depth of
coverage to a theoretical model of assembly
progress for a sample corresponding to a mix-
that a minimum of 12-fold deeper sampling
would be required to obtain 95% of the unique
sequence. However, these are only lower
Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column uses
symbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. Sequence
ID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTx
e-value chosen to identify orthologs when querying the E. coli sequence against the complete Sargasso
Sea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at the
point along the query for which this value is maximal. Observed “species” shows the number of distinct
clusters of reads from the maximum fragment depth column, after grouping reads whose containing
assemblies had an overlap of at least 40 bp with Ͼ 94% nucleotide identity (single-link clustering).
Singleton “species” shows the number of distinct clusters from the observed “species” column that
consist of a single read. Most abundant column shows the fraction of the maximum fragment depth that
consists of single largest cluster.
Protein
Sequence ID
Ortholog
cutoff
AtpD
GyrB
Hsp70
RecA
RpoB
TufA
NTL01EC03653
NTL01EC03620
NT01EC0015
NTL01EC02639
NTL01EC03885
NTL01EC03262
1e-32
1e-11
1e-31
1e-21
1e-41
1e-41
Max.
fragment
depth
Observed
“species”
Singleton
“species”
Most
abundant
(%)
836
924
812
592
669
597
456
569
515
341
428
397
317
429
394
244
331
307
6
4
4
8
7
3
of se
ever
nity.
resen
know
scaff
cont
even
SAR
cove
fold,
21,0
popu
uted
V
key
proa
men
the r
isms
half
men
equa
colle
Table 3. Diversity models based on depth of coverage. Each row correcolumn) in the sample. The thi
http://www.sciencemag.org/content/304/5667/66
sponds to an abundance class of organisms. The first column in each
a genome expected to be s
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
model “fr(asm)” gives the fraction of the assembly consensus modeled
gives the resulting estimat
36. Figure S6. Accumulation curve for rpoB. Observed (black) OTU counts for rpoB (based
on the fragment grouping summarized in Table 2), as well as the Chao1-corrected
estimate of total species (red; see (3)). Points are mean values of 1000 shufflings of the
observed data, while bars show 90% confidence intervals.
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
37. MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised
Figure S7. Each point in the figure corresponds to a scaffold from the assembly
(restricted to scaffolds > 10kb). Scaffolds were placed in separate panels of the figure
according to the most closely related organism as indicated by the BLAST searches
described in the text. Within a panel, a scaffold is shown with x coordinate equal to its
length, y coordinate equal to its estimated depth of coverage, and color determined by
which of 6 k-mer composition clusters it was assigned to. Depth of coverage was
estimated as the total base pairs in reads belonging to a given assembly piece divided by
the length of the consensus sequence for the piece. K-mer composition clusters were
determined by representing each scaffold as a vector of the frequencies of all possible 4mers, considering both the forward and reverse strands of the sequence, and then
applying the K-means clustering algorithm.
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
38. Functional Diversity of Proteorhodopsins?
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!38
39. MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised
Figure S10. Scaffold 2217664, containing the gene encoding Proteorhodopsin. Genes are
colored using color assignments described in Fig. 2, and contig boundaries are indicated
with red vertical lines. In this scaffold, rhodopsin is associated with a DNA-directed
RNA polymerase, sigma subunit (rpoD) originating in the CFB group.
http://www.sciencemag.org/content/304/5667/66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
42. Glassy Winged Sharpshooter
• Feeds on xylem sap
• Vector for Pierce’s Disease
• Potential bioterror agent
• Collaboration with Nancy
Moran to sequence
symbiont genomes
• Funded by NSF
• Published in PLOS Biology
2006
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
43. Wu et al. 2006 PLoS Biology 4: e188.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
47. CFB Phyla
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
48. Sulcia makes vitamins and cofactors
Baumannia makes amino acids
Wu et al. 2006 PLoS Biology 4: e188.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
48
49. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
50. Sorcerer II GOS Expedition
Figure 1. Sampling Sites
Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the
eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the
Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part
of this study.
doi:10.1371/journal.pbio.0050077.g001
environments as well as a few nonmarine aquatic samples for
the pilot Sargasso Sea study, 200 l surface seawater was
contrast (Table Eisen Winter 2014
filtered to isolate microorganisms UC Davis EVE161analysis. Taught by Jonathan1).
Slides for for metagenomic Course
51. Stalking the Fourth Domain in Metagenomic Data:
Searching for, Discovering, and Interpreting Novel, Deep
Branches in Marker Gene Phylogenetic Trees
Dongying Wu1, Martin Wu1,4, Aaron Halpern2,3, Douglas B. Rusch2,3, Shibu Yooseph2,3, Marvin Frazier2,3,
J. Craig Venter2,3, Jonathan A. Eisen1*
1 Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, University of California Davis Genome Center, University of California
Davis, Davis, California, United States of America, 2 The J. Craig Venter Institute, Rockville, Maryland, United States of America, 3 The J. Craig Venter Institute, La Jolla,
California, United States of America, 4 University of Virginia, Charlottesville, Virginia, United States of America
Abstract
Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data
associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and
culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated
directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we
argue here, in studies of very early events in the evolution of gene families and of species.
Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used
them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly
used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies.
Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in
making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel
branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these
novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.
Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from
uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third
possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which
sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree
of life, we suggest that methods such as those described herein currently offer the best way to search for them.
Citation: Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and
Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Editor: Robert Fleischer, Smithsonian Institution National Zoological Park, United States of America
Received October 25, 2010; Accepted February 20, 2011; Published March 18, 2011
This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public
domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Funding: The development and main work on this project was supported by the National Science Foundation via an ‘‘Assembling the Tree of Life’’ grant
(number 0228651) to to Jonathan A. Eisen and Naomi Ward. The final work on this project was funded by the Gordon and Betty Moore Foundation (through
52. Stalking the Fourth Domain
Figure 1. Phylogenetic tree of the RecA superfamily. All RecA sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA
superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial
analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in
the text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily.
doi:10.1371/journal.pone.0018011.g001
PLoS ONE | www.plosone.org
5
March 2011 | Volume 6 | Issue 3 | e18011
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
53. Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of
these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits
against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.
doi:10.1371/journal.pone.0018011.t002
Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily
Unknown 2). This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to
known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2
subfamily (cluster ID 9).
doi:10.1371/journal.pone.0018011.g002
Slides for UC
PLoS ONE | www.plosone.org Davis
EVE161 Course7Taught by Jonathan Eisen| Winter 2014 3
March 2011 Volume 6 | Issue
| e18011
54. Stalking the Fourth Domain
Figure 3. Phylogenetic tree of the RpoB superfamily. All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives
of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment
using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB
superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored
panels.
doi:10.1371/journal.pone.0018011.g003
Methods
these 340 sequences were extracted from the European Ribosomal
[66] and then
Slides forIdentification of deeply-branching ss-rRNA sequences by Jonathan than 90% gaps or with 2014remove
UC Davis EVE161 Course Taught RNA databasemore Eisen manually curated toalignment
Winter poor
columns with