SlideShare uma empresa Scribd logo
1 de 114
Shotgun
metagenomics
C. Titus Brown
UC Davis / School of Veterinary Medicine
ctbrown@ucdavis.edu
Shotgun metagenomics
• Collect samples;
• Extract DNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
Computational protocol for
assembly
Genome assembly works pretty well
Velvet IDBA Spades
Quality Quality Quality
Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08
Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08
Largest contig 561,449 979,948 1,387,918
# misassembled contigs 631 1032 752
Genome fraction (%) 72.949 90.969 90.424
Duplication ratio 1.004 1.007 1.004
Conclusion: SPAdes and IDBA achieve similar results.
Dr. Sherine Awad
Evaluating short-read
data / coverage plots
Assembly starts to work @ ~10x
What’s coming?
Lots more data.
Who ya gonna call??
…to do your bioinformatics?
Topics to explore -
• Experimental design to enable good data analysis.
• Extracting whole genomes and evaluating strain
variation.
• Technology review and open problems.
• Assembly graphs.
• Long reads and infinite sequencing – what next?
Or just a Q&A.
I can talk for 90 minutes straight, too, but I’d rather not.
Assembly graphs.Assembly “graphs” are a way of representing
sequencing data that retains all variation;
assemblers then crunch this down into a single
sequence.
Image from Iqbal et al., 2012. PMID 23172865
Long read sequencing
See:
http://angus.readthedocs.org/en/2015/_static/Torsten
_Seemann_LRS.pdf
& nanopore etc.
Intro, overview,
and case studies
Shotgun metagenomics
• Collect samples;
• Extract DNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
FASTQ etc.
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
Name
Quality score
Note: /1 and /2 => interleaved
Shotgun sequencing &
assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Dealing with shotgun
metagenome reads.
1) Annotate/analyze individual reads.
1) Assemble reads into contigs, genomes, etc.
Annotating individual
reads
• Works really well when you have EITHER
o (a) evolutionarily close references
o (b) rather long sequences
(This is obvious, right?)
Annotating individual
reads #2
• We have found that this does not work well
with Illumina samples from unexplored
environments (e.g. soil).
• Sensitivity is fine (correct match is usually
there)
• Specificity is bad (correct match may be
drowned out by incorrect matches)
Assembly: good.
Howe et al., 2014, PMID 24632729
Assemblies yield much
more significant
homology matches.
Recommendation:
For reads < 200-300 bp,
• Annotate individual reads for human-
associated samples, or exploration of well-
studied systems.
• Otherwise, look to assembly.
So, why assemble?
• Increase your ability to assign
homology/orthology correctly!!
Essentially all functional annotation systems
depend on sequence similarity to assign
homology. This is why you want to assemble
your data.
Why else would you want
to assemble?
• Assemble new “reference”.
• Look for large-scale variation from reference
– pathogenicity islands, etc.
• Discriminate between different members of
gene families.
• Discover operon assemblages & annotate
on co-incidence of genes.
• Reduce size of data!!
Why don’t you want to
assemble??
• Abundance threshold – low-coverage filter.
• Strain variation
• Chimerism
Assembly, in
general
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Conclusion: even with
a read length of 200, the
E. coli genome cannot be
assembled completely.
Why?
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Conclusion: even with
a read length of 200, the
E. coli genome cannot be
assembled completely.
Why? REPEATS.
This is why paired-end
sequencing is so importan
for assembly.
Four main challenges for
de novo sequencing.
• Repeats.
• Low coverage.
• Errors
These introduce breaks in the
construction of contigs.
• Variation in coverage – transcriptomes and
metagenomes, as well as amplified genomic.
This challenges the assembler to distinguish between
erroneous connections (e.g. repeats) and real
connections.
Repeats
• Overlaps don’t place sequences uniquely when
there are repeats present.
UMD assembly primer (cbcb.umd.edu)
Metagenomics: Mixed
community sampling
Coverage distribution
Conclusions re mixed
community sampling
In shotgun metagenomics, you are sampling
randomly from the mixed population.
Therefore, the lowest abundance member of the
population (that you want to observe) drives the
required depth of sequencing!
1 in a million => ~50 Tbp sequencing for 10x coverage.
Assembly depends on high coverage
HMP mock community assembly; Velvet-based protocol.
Conclusions from
previous slide
• To recover any contigs at all, you need coverage >
10 (green line).
• To recover long sequences, you want coverage >
20 (blue line).
A few assembly stories
• Low complexity/Osedax symbionts
• High complexity/soil.
Osedax symbionts
Table S1 ! Specimens, and collection sites, used in this study
Site Dive1
Date Time Zone
(months)
# of
specimens
Osedax frankpressi
with Rs1 symbiont
2890m T486 Oct 2002 8 2
T610 Aug 2003 18 3
T1069 Jan 2007 59 2
DR010 Mar 2009 85 3
DR098 Nov 2009 93 4
DR204 Oct 2010 104 1
DR234 Jun 2011 112 2
1820m T1048 Oct 2006 7 3
T1071 Jan 2007 10 3
DR012 Mar 2009 36 4
DR2362
Jun 2011 63 2
O. frankpressi from DR236
1
Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc
Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) .
2
Samples used for genomic analysis were collected during dive DR236.
Goffredi et al., submitted.
16s
shotgun
Metagenomic assembly followed by
binning enabled isolation of fairly
complete genomes
Assembly allowed genomic content
comparisons to nearest cultured relative
Osedax assembly story
Conclusions include:
• Osedax symbionts have genes needed for
free living stage;
• metabolic versatility in carbon, phosphate,
and iron uptake strategies;
• Genome includes mechanisms for
intracellular survival, and numerous potential
virulence capabilities.
Osedax assembly story
• Low diversity metagenome!
• Physical isolation => MDA => sequencing =>
diginorm => binning =>
o 94% complete Rs1
o 66-89% complete Rs2
• Note: many interesting critters are hard to
isolate => so, basically, metagenomes.
Human-associated
communities
• “Time series community genomics analysis
reveals rapid shifts in bacterial species,
strains, and phage during infant gut
colonization.” Sharon et al. (Banfield lab);
Genome Res. 2013 23: 111-120
Setup
• Collected 11 fecal samples from premature
female, days 15-24.
• 260m 100-bp paired-end Illumina HiSeq
reads; 400 and 900 base fragments.
• Assembled > 96% of reads into contigs >
500bp; 8 complete genomes; reconstructed
genes down to 0.05% of population
abundance.
Sharon et al., 2013; pmid 22936250
Key strategy: abundance
binning
Bin reads by k-mer
abundance
Assemble most
abundance bin
Remove reads that
map to assembly
Sharon et al., 2013; pmid 22936250
Sharon et al., 2013; pmid 22936250
Tracking abundance
Sharon et al., 2013; pmid 22936250
Conclusions
• Recovered strain variation, phage variation,
abundance variation, lateral gene transfer.
• Claim that “recovered genomes are
superior to draft genomes generated in
most isolate genome sequencing projects.”
Environmental
metagenomics:
Deepwater Horizon spill
• “Transcriptional response of bathypelagic
marine bacterioplankton to the Deepwater
Horizon oil spill.” Rivers et al., 2013, Moran
lab. Pmid 23902988.
Sequencing strategy
Protocol for messenger RNA
extraction, assembly,
downstream analysis, including
differential expression.
Note: used overlapping paired-
end reads.
Great Prairie Grand
Challenge - soil
• Together with Janet Jansson, Jim Tiedje, et
al.
• “Can we make sense of soil with deep
Illumina sequencing?”
Great Prairie Grand
Challenge - soil
• What ecosystem level functions are present, and
how do microbes do them?
• How does agricultural soil differ from native soil?
• How does soil respond to climate perturbation?
• Questions that are not easy to answer without
shotgun sequencing:
o What kind of strain-level heterogeneity is present in the population?
o What does the phage and viral population look like?
o What species are where?
A “Grand Challenge”
dataset (DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Approach 1: Digital normalization
(a computational version of library normalization)
Suppose you
have a dilution
factor of A (10) to
B(1). To get 10x
of B you need to
get 100x of A!
Overkill!!
This 100x will
consume disk
space and,
because of errors,
memory.
We can discard it
for you…
Approach 2: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low
coverage.
Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
Corn Prairie
Iowa prairie & corn -
very even.
Taxonomy– Iowa prairie
Note: this is predicted taxonomy of contigs w/o considering abundance
or length. (MG-RAST)
Taxonomy – Iowa corn
Note: this is predicted taxonomy of contigs w/o
considering abundance or length. (MG-RAST)
Strain variation?
Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1
has a
polymorphism
rate > 5%
Can measure by
read mapping.
Concluding thoughts on
assembly (I)
• There’s no standard approach yet; almost
every paper uses a specialized pipeline of
some sort.
o More like genome assembly
o But unlike transcriptomics…
Concluding thoughts on
assembly (II)
• Anecdotally, everyone worries about strain
variation.
o Some groups (e.g. Banfield, us) have found that
this is not a problem in their system so far.
o Others (viral metagenomes! HMP!) have found
this to be a big concern.
Concluding thoughts on
assembly (III)
• Some groups have found metagenome assembly
to be very useful.
• Others (us! soil!) have not yet proven its utility.
High coverage is critical.
Low coverage is the dominant
problem blocking assembly of
metagenomes
Questions --
• How much sequencing should I do?
• How do I evaluate metagenome
assemblies?
• Which assembler is best?
Assembly strategies &
techniques
Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
Coverage distribution
matters!
(MD amplified)
Assembly depends on high
coverage
Shared low-level
transcripts may
not reach the
threshold for
assembly.
K-mer based assemblers
scale poorly
Why do big data sets require big machines??
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs scale poorly with data size
Practical memory
measurements
Velvet measurements (Adina Howe)
How much data do we
need? (I)
(“More” is rather vague…)
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low
coverage.
Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
How much? (II)
• Suppose we need 10x coverage to assemble a
microbial genome, and microbial genomes average
5e6 bp of DNA.
• Further suppose that we want to be able to assemble
a microbial species that is “1 in a 100000”, i.e. 1 in
1e5.
• Shotgun sequencing samples randomly, so must
sample deeply to be sensitive.
10x coverage x 5e6 bp x 1e5 =~ 50e11, or 5 Tbp of
sequence.
Currently this would cost approximately $100k, for 10 full
Illumina runs, but in a year we will be able to do it for
much less.
We can estimate
sequencing req’d:
http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
“Whoa, that’s a lot of
data…”
http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
Some approximate
metagenome sizes
• Deep carbon mine data set: 60 Mbp (x 10x => 600
Mbp)
• Great Prairie soil: 12 Gbp (4x human genome)
• Amazon Rain Forest Microbial Observatory soil: 26
Gbp
Partitioning separates reads by genome.
When computationally spiking HMP mock data with one E.
coli genome (left) or multiple E. coli strains (right), majority of
partitions contain reads from only a single genome (blue) vs
multi-genome partitions (green).
Partitions containing spiked data indicated with a *Adina Howe
**
Is your assembly good?
• Truly reference-free assembly is hard to evaluate.
• Traditional “genome” measures like N50 and NG50
simply do not apply to metagenomes, because
very often you don’t know what the genome “size”
is.
Evaluating assembly
Predicted genome.
X
X
X
X
X
X
X
X
XX
Reads - noisy observations
of some genome.
Assembler
(a Big Black Box)
Evaluating correctness of metagenomes is still undiscovered country.
Assembler overlap?
See tutorial.
What’s the best
assembler?
-25
-20
-15
-10
-5
0
5
10
BCM* BCM ALLP NEWB SOAP** MERAC CBCB SOAP* SOAP SGA PHUS RAY MLK ABL
CumulativeZ-scorefromrankingmetrics
Bird assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
What’s the best
assembler?
-8
-6
-4
-2
0
2
4
6
8
BCM CSHL CSHL* SYM ALLP SGA SOAP* MERAC ABYSS RAY IOB CTD IOB* CSHL** CTD** CTD*
CumulativeZ-scorefromrankingmetrics
Fish assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
What’s the best
assembler?
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
-14
-10
-6
-2
2
6
10
SGA PHUS MERAC SYMB ABYSS CRACS BCM RAY SOAP GAM CURT
CumulativeZ-scorefromrankingmetrics
Snake assembly
Answer: for eukaryotic
genomes, it depends
• Different assemblers perform differently, depending
on
o Repeat content
o Heterozygosity
• Generally the results are very good (est
completeness, etc.) but different between different
assemblers (!)
• There Is No One Answer.
Estimated completeness: CEGMA
Each assembler lost different ~5% CEGs
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
Tradeoffs in N 50 and %
incl.
0
10
20
30
40
50
60
70
80
90
100
110
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
BCM ALLP SOAP NEWB MERAC SGA CBCB PHUS RAY MLK ABL
%ofestimatedgenomesizepresentinscaffolds>=25Kbp
NG50scaffoldlength(bp)
Bird assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
Evaluating assemblies
• Every assembly returns different results for eukaryotic
genomes.
• For metagenomes, it’s even worse.
o No systematic exploration of precision, recall, etc.
o Very little in the way of cross-comparable data sets
o Often sequencing technology being evaluated is out of date
o etc. etc.
Our experience
• Our metagenome assemblies compare well with
others, but we have little in the way of ground truth
with which to evaluate.
• Scaffold assembly is tricky; we believe in contig
assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex
metagenomes”, for our suggested pipeline and
statistics & references.
Metagenomic assemblies are highly variable
Adina Howe et al., arXiv 1212.0159
How to choose a
metagenome assembler
• Try a few.
• Use what seems to perform best (most bp > some
minimum)
• Try: MEGAHIT and SPAdes and IDBA.
An example
assembly
Deep Carbon data set
• Name: DCO_TCO_MM5
• Masimong Gold Mine; microbial cells filtered from
fracture water from within a 1.9km borehole.
(32,000 year old water?!)
• M.C.Y. Lau, C. Magnabosco, S. Grim, G. Lacrampe
Couloume, K. Wilkie, B. Sherwood Lollar, D.N. Simkus,
G.F. Slater, S. Hendrickson, M. Pullin, T.L. Kieft, O.
Kuloyo, B. Linage, G. Borgonie, E. van Heerden, J.
Ackerman, C. van Jaarsveld, and T.C. Onstott
DCO_TCO_M
M5
20m reads / 2.1 Gbp
5.6m reads / 601.3 Mbp
Adapter trim &
quality filter
Diginorm to C=10
Trim high-
coverage reads at
low-abundance
k-mers
“Could you take a look at this?
MG-RAST is telling us we
have a lot of artificially
duplicated reads, i.e. the data
is bad.”
Entire process took ~4 hours of
computation, or so.
(Minimum genome size est: 60.1 Mbp)
Assembly stats:
All contigs Contigs > 1kb
k N contigs Sum BP N contigs Sum BP
Max
contig
21 343263 63217837 6271 10537601 9987
23 302613 63025311 7183 13867885 21348
25 276261 62874727 7375 15303646 34272
27 258073 62500739 7424 16078145 48742
29 242552 62001315 7349 16426147 48746
31 228043 61445912 7307 16864293 48750
33 214559 60744478 7241 17133827 48768
35 203292 60039871 7129 17249351 45446
37 189948 58899828 7088 17527450 59437
39 180754 58146806 7027 17610071 54112
41 172209 57126650 6914 17551789 65207
43 165563 56440648 6925 17654067 73231
DCO_TCO_MM
5
(Minimum genome size est: 60.1 Mbp)
Chose two:
• A: k=43 (“long contigs”)
• 165563 contigs
• 56.4 Mbp
• longest contig: 73231 bp
• B: k=43 (“high recall”)
• 343263 contigs
• 63.2 Mbp
• longest contig is 9987 bp
DCO_TCO_MM
5
How to evaluate??
How many reads map
back?
Mapped 3.8m paired-end reads (one subsample):
• high-recall: 41% of pairs map
• longer-contigs: 70% of pairs map
+ 150k single-end reads:
• high-recall: 49% of sequences map
• longer-contigs: 79% of sequences map
Annotation/exploration
with MG-RAST
• You can upload genome assemblies to MG-RAST,
and annotate them with coverage; tutorial to
follow.
• What does MG-RAST do?
Conclusion
• This is a pretty good metagenome assembly – >
80% of reads map!
• Surprised that the larger dataset (6.32 Mbp, “high
recall”) accounts for a smaller percentage of the
reads – 49% vs 79% for the 56.4 Mbp “long contigs”
data set.
• I now suspect that different parameters are
recovering different subsets of the sample…
• Don’t trust MG-RAST ADR calls.
You can estimate
metagenome size…
Estimates of metagenome
size
Calculation: # reads * (avg read len) / (diginorm
coverage)
Assumes: few entirely erroneous reads (upper bound);
saturation (lower bound).
• E. coli: 384k * 86.1 / 5.0 => 6.6 Mbp est. (true: 4.5
Mbp)
• MM5 deep carbon: 60 Mbp
• Great Prairie soil: 12 Gbp
• Amazon Rain Forest Microbial Observatory: 26 Gbp
Extracting whole
genomes?
So far, we have only assembled contigs, but not whole
genomes.
Can entire genomes be
assembled from metagenomic
data?
Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenome contigs into
~whole genomes. YES.
Pmid 22301318, Iverson et al.
Concluding thoughts
• What works?
• What needs work?
• What will work?
What works?
Today,
• From deep metagenomic data, you can get the
gene and operon content (including abundance of
both) from communities.
• You can get microarray-like expression information
from metatranscriptomics.
What needs work?
• Assembling ultra-deep samples is going to require
more engineering, but is straightforward. (“Infinite
assembly.”)
• Building scaffolds and extracting whole genomes
has been done, but I am not yet sure how feasible it
is to do systematically with existing tools (c.f.
Armbrust Lab).
What will work,
someday?
• Sensitive analysis of strain variation.
o Both assembly and mapping approaches do a poor job detecting many
kinds of biological novelty.
o The 1000 Genomes Project has developed some good tools that need to
be evaluated on community samples.
• Ecological/evolutionary dynamics in vivo.
o Most work done on 16s, not on genomes or functional content.
o Here, sensitivity is really important!
The interpretation
challenge
• For soil, we have generated approximately 1200
bacterial genomes worth of assembled genomic DNA
from two soil samples.
• The vast majority of this genomic DNA contains unknown
genes with largely unknown function.
• Most annotations of gene function & interaction are
from a few phylogenetically limited model organisms
o Est 98% of annotations are computationally inferred: transferred from model
organisms to genomic sequence, using homology.
o Can these annotations be transferred? (Probably not.)
This will be the biggest sequence analysis challenge of the next 50 years.
What are future needs?
• High-quality, medium+ throughput annotation of
genomes?
o Extrapolating from model organisms is both immensely important and yet
lacking.
o Strong phylogenetic sampling bias in existing annotations.
• Synthetic biology for investigating non-model
organisms?
(Cleverness in experimental biology doesn’t scale )
• Integration of microbiology, community
ecology/evolution modeling, and data analysis.

Mais conteúdo relacionado

Mais procurados

Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics Christopher Mason
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101Ino de Bruijn
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008Saul Kravitz
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencingcdgenomics525
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Integrated DNA Technologies
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisdrelamuruganvet
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineCandy Smellie
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansGenomeInABottle
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 
Errors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation SequencingErrors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation SequencingNixon Mendez
 
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1QIAGEN
 
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...Thermo Fisher Scientific
 
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...eventi-ITBbari
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 

Mais procurados (20)

Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
 
Embed Repro Test
Embed Repro TestEmbed Repro Test
Embed Repro Test
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
Errors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation SequencingErrors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation Sequencing
 
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1
Single-Cell Analysis - Powered by REPLI-g: Single Cell Analysis Series Part 1
 
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
 
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 

Destaque

Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validationGenomeInABottle
 
The Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsThe Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsExternalEvents
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyPhilip Ashton
 
Genome Wide Methodologies and Future Perspectives
 Genome Wide Methodologies and Future Perspectives Genome Wide Methodologies and Future Perspectives
Genome Wide Methodologies and Future PerspectivesBrian Krueger
 
Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? FAO
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingEmiliano De Cristofaro
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSMirko Rossi
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
 
Innovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyInnovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyQIAGEN
 
DNA Sequencing from Single Cell
DNA Sequencing from Single CellDNA Sequencing from Single Cell
DNA Sequencing from Single CellQIAGEN
 
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceGenomeInABottle
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...nist-spin
 
Plant genome sequencing and crop improvement
Plant genome sequencing and crop improvementPlant genome sequencing and crop improvement
Plant genome sequencing and crop improvementRagavendran Abbai
 

Destaque (20)

Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validation
 
The Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsThe Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groups
 
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
Proposal for 2016 survey of WGS capacity in EU/EEA Member StatesProposal for 2016 survey of WGS capacity in EU/EEA Member States
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
 
Poster ESHG
Poster ESHGPoster ESHG
Poster ESHG
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiology
 
Genome Wide Methodologies and Future Perspectives
 Genome Wide Methodologies and Future Perspectives Genome Wide Methodologies and Future Perspectives
Genome Wide Methodologies and Future Perspectives
 
Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety?
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Innovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyInnovative NGS Library Construction Technology
Innovative NGS Library Construction Technology
 
DNA Sequencing from Single Cell
DNA Sequencing from Single CellDNA Sequencing from Single Cell
DNA Sequencing from Single Cell
 
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
 
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for HarmonizationEU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Plant genome sequencing and crop improvement
Plant genome sequencing and crop improvementPlant genome sequencing and crop improvement
Plant genome sequencing and crop improvement
 

Semelhante a 2015 beacon-metagenome-tutorial

2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Apollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionApollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionMonica Munoz-Torres
 
EVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionEVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionJonathan Eisen
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 

Semelhante a 2015 beacon-metagenome-tutorial (20)

2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Apollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionApollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 Introduction
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
EVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionEVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - Introduction
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

Mais de c.titus.brown

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 

Último

Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDivyaK787011
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 

Último (20)

Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 

2015 beacon-metagenome-tutorial

  • 1. Shotgun metagenomics C. Titus Brown UC Davis / School of Veterinary Medicine ctbrown@ucdavis.edu
  • 2. Shotgun metagenomics • Collect samples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 3.
  • 5. Genome assembly works pretty well Velvet IDBA Spades Quality Quality Quality Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08 Largest contig 561,449 979,948 1,387,918 # misassembled contigs 631 1032 752 Genome fraction (%) 72.949 90.969 90.424 Duplication ratio 1.004 1.007 1.004 Conclusion: SPAdes and IDBA achieve similar results. Dr. Sherine Awad
  • 6. Evaluating short-read data / coverage plots Assembly starts to work @ ~10x
  • 8. Who ya gonna call?? …to do your bioinformatics?
  • 9. Topics to explore - • Experimental design to enable good data analysis. • Extracting whole genomes and evaluating strain variation. • Technology review and open problems. • Assembly graphs. • Long reads and infinite sequencing – what next? Or just a Q&A. I can talk for 90 minutes straight, too, but I’d rather not.
  • 10. Assembly graphs.Assembly “graphs” are a way of representing sequencing data that retains all variation; assemblers then crunch this down into a single sequence. Image from Iqbal et al., 2012. PMID 23172865
  • 13. Shotgun metagenomics • Collect samples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 15. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 16. Dealing with shotgun metagenome reads. 1) Annotate/analyze individual reads. 1) Assemble reads into contigs, genomes, etc.
  • 17. Annotating individual reads • Works really well when you have EITHER o (a) evolutionarily close references o (b) rather long sequences (This is obvious, right?)
  • 18. Annotating individual reads #2 • We have found that this does not work well with Illumina samples from unexplored environments (e.g. soil). • Sensitivity is fine (correct match is usually there) • Specificity is bad (correct match may be drowned out by incorrect matches)
  • 19. Assembly: good. Howe et al., 2014, PMID 24632729 Assemblies yield much more significant homology matches.
  • 20. Recommendation: For reads < 200-300 bp, • Annotate individual reads for human- associated samples, or exploration of well- studied systems. • Otherwise, look to assembly.
  • 21. So, why assemble? • Increase your ability to assign homology/orthology correctly!! Essentially all functional annotation systems depend on sequence similarity to assign homology. This is why you want to assemble your data.
  • 22. Why else would you want to assemble? • Assemble new “reference”. • Look for large-scale variation from reference – pathogenicity islands, etc. • Discriminate between different members of gene families. • Discover operon assemblages & annotate on co-incidence of genes. • Reduce size of data!!
  • 23. Why don’t you want to assemble?? • Abundance threshold – low-coverage filter. • Strain variation • Chimerism
  • 25. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005
  • 26. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why?
  • 27. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why? REPEATS. This is why paired-end sequencing is so importan for assembly.
  • 28. Four main challenges for de novo sequencing. • Repeats. • Low coverage. • Errors These introduce breaks in the construction of contigs. • Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  • 29. Repeats • Overlaps don’t place sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  • 31.
  • 32. Conclusions re mixed community sampling In shotgun metagenomics, you are sampling randomly from the mixed population. Therefore, the lowest abundance member of the population (that you want to observe) drives the required depth of sequencing! 1 in a million => ~50 Tbp sequencing for 10x coverage.
  • 33. Assembly depends on high coverage HMP mock community assembly; Velvet-based protocol.
  • 34. Conclusions from previous slide • To recover any contigs at all, you need coverage > 10 (green line). • To recover long sequences, you want coverage > 20 (blue line).
  • 35. A few assembly stories • Low complexity/Osedax symbionts • High complexity/soil.
  • 36. Osedax symbionts Table S1 ! Specimens, and collection sites, used in this study Site Dive1 Date Time Zone (months) # of specimens Osedax frankpressi with Rs1 symbiont 2890m T486 Oct 2002 8 2 T610 Aug 2003 18 3 T1069 Jan 2007 59 2 DR010 Mar 2009 85 3 DR098 Nov 2009 93 4 DR204 Oct 2010 104 1 DR234 Jun 2011 112 2 1820m T1048 Oct 2006 7 3 T1071 Jan 2007 10 3 DR012 Mar 2009 36 4 DR2362 Jun 2011 63 2 O. frankpressi from DR236 1 Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) . 2 Samples used for genomic analysis were collected during dive DR236. Goffredi et al., submitted.
  • 38. Metagenomic assembly followed by binning enabled isolation of fairly complete genomes
  • 39. Assembly allowed genomic content comparisons to nearest cultured relative
  • 40. Osedax assembly story Conclusions include: • Osedax symbionts have genes needed for free living stage; • metabolic versatility in carbon, phosphate, and iron uptake strategies; • Genome includes mechanisms for intracellular survival, and numerous potential virulence capabilities.
  • 41. Osedax assembly story • Low diversity metagenome! • Physical isolation => MDA => sequencing => diginorm => binning => o 94% complete Rs1 o 66-89% complete Rs2 • Note: many interesting critters are hard to isolate => so, basically, metagenomes.
  • 42. Human-associated communities • “Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.” Sharon et al. (Banfield lab); Genome Res. 2013 23: 111-120
  • 43. Setup • Collected 11 fecal samples from premature female, days 15-24. • 260m 100-bp paired-end Illumina HiSeq reads; 400 and 900 base fragments. • Assembled > 96% of reads into contigs > 500bp; 8 complete genomes; reconstructed genes down to 0.05% of population abundance.
  • 44. Sharon et al., 2013; pmid 22936250
  • 45. Key strategy: abundance binning Bin reads by k-mer abundance Assemble most abundance bin Remove reads that map to assembly Sharon et al., 2013; pmid 22936250
  • 46. Sharon et al., 2013; pmid 22936250
  • 47. Tracking abundance Sharon et al., 2013; pmid 22936250
  • 48. Conclusions • Recovered strain variation, phage variation, abundance variation, lateral gene transfer. • Claim that “recovered genomes are superior to draft genomes generated in most isolate genome sequencing projects.”
  • 49. Environmental metagenomics: Deepwater Horizon spill • “Transcriptional response of bathypelagic marine bacterioplankton to the Deepwater Horizon oil spill.” Rivers et al., 2013, Moran lab. Pmid 23902988.
  • 51. Protocol for messenger RNA extraction, assembly, downstream analysis, including differential expression. Note: used overlapping paired- end reads.
  • 52. Great Prairie Grand Challenge - soil • Together with Janet Jansson, Jim Tiedje, et al. • “Can we make sense of soil with deep Illumina sequencing?”
  • 53. Great Prairie Grand Challenge - soil • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: o What kind of strain-level heterogeneity is present in the population? o What does the phage and viral population look like? o What species are where?
  • 54. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 55. Approach 1: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 56. Approach 2: Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly.
  • 57. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 58. Resulting contigs are low coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  • 59. Corn Prairie Iowa prairie & corn - very even.
  • 60. Taxonomy– Iowa prairie Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 61. Taxonomy – Iowa corn Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 62. Strain variation? Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 63. Concluding thoughts on assembly (I) • There’s no standard approach yet; almost every paper uses a specialized pipeline of some sort. o More like genome assembly o But unlike transcriptomics…
  • 64. Concluding thoughts on assembly (II) • Anecdotally, everyone worries about strain variation. o Some groups (e.g. Banfield, us) have found that this is not a problem in their system so far. o Others (viral metagenomes! HMP!) have found this to be a big concern.
  • 65. Concluding thoughts on assembly (III) • Some groups have found metagenome assembly to be very useful. • Others (us! soil!) have not yet proven its utility.
  • 66. High coverage is critical. Low coverage is the dominant problem blocking assembly of metagenomes
  • 67. Questions -- • How much sequencing should I do? • How do I evaluate metagenome assemblies? • Which assembler is best?
  • 69. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 70. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 72. Assembly depends on high coverage
  • 73. Shared low-level transcripts may not reach the threshold for assembly.
  • 74. K-mer based assemblers scale poorly Why do big data sets require big machines?? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 75. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs scale poorly with data size
  • 77. How much data do we need? (I) (“More” is rather vague…)
  • 78. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 79. Resulting contigs are low coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  • 80. How much? (II) • Suppose we need 10x coverage to assemble a microbial genome, and microbial genomes average 5e6 bp of DNA. • Further suppose that we want to be able to assemble a microbial species that is “1 in a 100000”, i.e. 1 in 1e5. • Shotgun sequencing samples randomly, so must sample deeply to be sensitive. 10x coverage x 5e6 bp x 1e5 =~ 50e11, or 5 Tbp of sequence. Currently this would cost approximately $100k, for 10 full Illumina runs, but in a year we will be able to do it for much less.
  • 81. We can estimate sequencing req’d: http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  • 82. “Whoa, that’s a lot of data…” http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  • 83. Some approximate metagenome sizes • Deep carbon mine data set: 60 Mbp (x 10x => 600 Mbp) • Great Prairie soil: 12 Gbp (4x human genome) • Amazon Rain Forest Microbial Observatory soil: 26 Gbp
  • 84. Partitioning separates reads by genome. When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vs multi-genome partitions (green). Partitions containing spiked data indicated with a *Adina Howe **
  • 85. Is your assembly good? • Truly reference-free assembly is hard to evaluate. • Traditional “genome” measures like N50 and NG50 simply do not apply to metagenomes, because very often you don’t know what the genome “size” is.
  • 86. Evaluating assembly Predicted genome. X X X X X X X X XX Reads - noisy observations of some genome. Assembler (a Big Black Box) Evaluating correctness of metagenomes is still undiscovered country.
  • 88. What’s the best assembler? -25 -20 -15 -10 -5 0 5 10 BCM* BCM ALLP NEWB SOAP** MERAC CBCB SOAP* SOAP SGA PHUS RAY MLK ABL CumulativeZ-scorefromrankingmetrics Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 89. What’s the best assembler? -8 -6 -4 -2 0 2 4 6 8 BCM CSHL CSHL* SYM ALLP SGA SOAP* MERAC ABYSS RAY IOB CTD IOB* CSHL** CTD** CTD* CumulativeZ-scorefromrankingmetrics Fish assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 90. What’s the best assembler? Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf -14 -10 -6 -2 2 6 10 SGA PHUS MERAC SYMB ABYSS CRACS BCM RAY SOAP GAM CURT CumulativeZ-scorefromrankingmetrics Snake assembly
  • 91. Answer: for eukaryotic genomes, it depends • Different assemblers perform differently, depending on o Repeat content o Heterozygosity • Generally the results are very good (est completeness, etc.) but different between different assemblers (!) • There Is No One Answer.
  • 92. Estimated completeness: CEGMA Each assembler lost different ~5% CEGs Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 93. Tradeoffs in N 50 and % incl. 0 10 20 30 40 50 60 70 80 90 100 110 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 BCM ALLP SOAP NEWB MERAC SGA CBCB PHUS RAY MLK ABL %ofestimatedgenomesizepresentinscaffolds>=25Kbp NG50scaffoldlength(bp) Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 94. Evaluating assemblies • Every assembly returns different results for eukaryotic genomes. • For metagenomes, it’s even worse. o No systematic exploration of precision, recall, etc. o Very little in the way of cross-comparable data sets o Often sequencing technology being evaluated is out of date o etc. etc.
  • 95. Our experience • Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate. • Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding. • See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.
  • 96. Metagenomic assemblies are highly variable Adina Howe et al., arXiv 1212.0159
  • 97. How to choose a metagenome assembler • Try a few. • Use what seems to perform best (most bp > some minimum) • Try: MEGAHIT and SPAdes and IDBA.
  • 99. Deep Carbon data set • Name: DCO_TCO_MM5 • Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water?!) • M.C.Y. Lau, C. Magnabosco, S. Grim, G. Lacrampe Couloume, K. Wilkie, B. Sherwood Lollar, D.N. Simkus, G.F. Slater, S. Hendrickson, M. Pullin, T.L. Kieft, O. Kuloyo, B. Linage, G. Borgonie, E. van Heerden, J. Ackerman, C. van Jaarsveld, and T.C. Onstott
  • 100. DCO_TCO_M M5 20m reads / 2.1 Gbp 5.6m reads / 601.3 Mbp Adapter trim & quality filter Diginorm to C=10 Trim high- coverage reads at low-abundance k-mers “Could you take a look at this? MG-RAST is telling us we have a lot of artificially duplicated reads, i.e. the data is bad.” Entire process took ~4 hours of computation, or so. (Minimum genome size est: 60.1 Mbp)
  • 101. Assembly stats: All contigs Contigs > 1kb k N contigs Sum BP N contigs Sum BP Max contig 21 343263 63217837 6271 10537601 9987 23 302613 63025311 7183 13867885 21348 25 276261 62874727 7375 15303646 34272 27 258073 62500739 7424 16078145 48742 29 242552 62001315 7349 16426147 48746 31 228043 61445912 7307 16864293 48750 33 214559 60744478 7241 17133827 48768 35 203292 60039871 7129 17249351 45446 37 189948 58899828 7088 17527450 59437 39 180754 58146806 7027 17610071 54112 41 172209 57126650 6914 17551789 65207 43 165563 56440648 6925 17654067 73231 DCO_TCO_MM 5 (Minimum genome size est: 60.1 Mbp)
  • 102. Chose two: • A: k=43 (“long contigs”) • 165563 contigs • 56.4 Mbp • longest contig: 73231 bp • B: k=43 (“high recall”) • 343263 contigs • 63.2 Mbp • longest contig is 9987 bp DCO_TCO_MM 5 How to evaluate??
  • 103. How many reads map back? Mapped 3.8m paired-end reads (one subsample): • high-recall: 41% of pairs map • longer-contigs: 70% of pairs map + 150k single-end reads: • high-recall: 49% of sequences map • longer-contigs: 79% of sequences map
  • 104. Annotation/exploration with MG-RAST • You can upload genome assemblies to MG-RAST, and annotate them with coverage; tutorial to follow. • What does MG-RAST do?
  • 105. Conclusion • This is a pretty good metagenome assembly – > 80% of reads map! • Surprised that the larger dataset (6.32 Mbp, “high recall”) accounts for a smaller percentage of the reads – 49% vs 79% for the 56.4 Mbp “long contigs” data set. • I now suspect that different parameters are recovering different subsets of the sample… • Don’t trust MG-RAST ADR calls.
  • 107. Estimates of metagenome size Calculation: # reads * (avg read len) / (diginorm coverage) Assumes: few entirely erroneous reads (upper bound); saturation (lower bound). • E. coli: 384k * 86.1 / 5.0 => 6.6 Mbp est. (true: 4.5 Mbp) • MM5 deep carbon: 60 Mbp • Great Prairie soil: 12 Gbp • Amazon Rain Forest Microbial Observatory: 26 Gbp
  • 108. Extracting whole genomes? So far, we have only assembled contigs, but not whole genomes. Can entire genomes be assembled from metagenomic data? Iverson et al. (2012), from the Armbrust lab, contains a technique for scaffolding metagenome contigs into ~whole genomes. YES. Pmid 22301318, Iverson et al.
  • 109. Concluding thoughts • What works? • What needs work? • What will work?
  • 110. What works? Today, • From deep metagenomic data, you can get the gene and operon content (including abundance of both) from communities. • You can get microarray-like expression information from metatranscriptomics.
  • 111. What needs work? • Assembling ultra-deep samples is going to require more engineering, but is straightforward. (“Infinite assembly.”) • Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab).
  • 112. What will work, someday? • Sensitive analysis of strain variation. o Both assembly and mapping approaches do a poor job detecting many kinds of biological novelty. o The 1000 Genomes Project has developed some good tools that need to be evaluated on community samples. • Ecological/evolutionary dynamics in vivo. o Most work done on 16s, not on genomes or functional content. o Here, sensitivity is really important!
  • 113. The interpretation challenge • For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples. • The vast majority of this genomic DNA contains unknown genes with largely unknown function. • Most annotations of gene function & interaction are from a few phylogenetically limited model organisms o Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology. o Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
  • 114. What are future needs? • High-quality, medium+ throughput annotation of genomes? o Extrapolating from model organisms is both immensely important and yet lacking. o Strong phylogenetic sampling bias in existing annotations. • Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn’t scale ) • Integration of microbiology, community ecology/evolution modeling, and data analysis.

Notas do Editor

  1. Both BLAST. Shotgun: annotated reads only.
  2. Beware of Janet Jansson bearing data.
  3. Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
  4. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.