2014 sage-talk

Making assembly cheap & easy, and consequences
thereof
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Feb 2014
ctb@msu.edu

Generally, yay #openscience!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Problem under consideration: shotgun
metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png

Analogy: we seek an understanding
of humanity via our libraries.

http://eofdreams.com/library.html;

But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.

http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.

Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured

in lab.
 Very little transport in soil and sediment =>
slow mixing rates.
 Estimates of immense diversity:
 Billions of microbial cells per gram of soil.
 Million+ microbial species per gram of soil (Gans et

al, 2005)
 One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)

“By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”

Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;

N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml

Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?

 How do soil microbial communities respond to

climate perturbation?
 Genome-level questions:
 What kind of strain-level heterogeneity is present in

the population?
 What are the phage and viral populations &
dynamic?
 What species are where, and how much is shared
between different geographical locations?

Must use culture independent and
metagenomic approaches
 Many reasons why you can‟t or don‟t want to

culture:
Cross-feeding, niche specificity, dormancy, etc.
 If you want to get at underlying function, 16s

analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.

Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png

Computational reconstruction of
(meta)genomic content.

http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate

Great Prairie Grand
Challenge --SAMPLING
LOCATIONS

2008

A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600

MetaHIT (Qin et. al, 2011), 578 Gbp

Basepairs of Sequencing (Gbp)

500

400

Rumen (Hess et. al, 2011), 268 Gbp

300

200

Rumen K-mer Filtered,
111 Gbp

100

NCBI nr database,
37 Gbp

0
Iowa,
Iowa, Native Kansas,
Continuous
Prairie
Cultivated
corn
corn

Kansas,
Native
Prairie
GAII

Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Restored Switchgrass
Continuous
Native
corn
Prairie
Prairie

HiSeq

Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.

 Mixed population sampling => sensitivity driven by

lowest abundance.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
 Sequencing is straightforward; data analysis is not.

“$1000 genome with $1m analysis”

Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest soil data set ever
sequenced, ~2010.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)

Assembly graphs scale with data size, not
information.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com

The Problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 No locality to the data in terms of graph structure.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore‟s Law.

Several solutions
1.More efficient exploration of data.
2. Subdivide data

3. Discard redundant data.

Primary approach: Digital
normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.

The high-coverage
reads in sample A
are unnecessary
for

Diginorm is “lossy compression”
 Nearly perfect from an information theoretic

perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.

Where are we taking this?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.

=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.

Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal “driver

mutations” in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.

The real challenge:
understanding
 We have gotten distracted by shiny toys:

sequencing!! Data!!
 Data is now plentiful! But:
 We typically have no knowledge of what > 50% of

an environmental metagenome “means”,
functionally.
 Most data is not openly available, so we cannot
mine correlations across data sets.
 Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.

Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.

My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable analyses,

openly.
 Education and training.

“Platform perspective”

Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.

khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression

CC0; BSD; on github; in reStructuredText.

Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people‟s existing data for

free, IFF they open it up within a year.
See: “Dead Sea Scrolls & Open Marine
Transcriptome Project” blog post; CephSeq white
paper.

“Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.

IPython Notebook: data + code
=>
IPython)Notebook)

Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Jim Tiedje, MSU
 Susannah Tringe and Janet






Jansson (JGI, LBNL)
Erich Schwarz, Caltech /
Cornell
Paul Sternberg, Caltech
Robin Gasser, U. Melbourne
Weiming Li, MSU
Shana Goffredi, Occidental

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.

2014 sage-talk

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a 2014 sage-talk

Semelhante a 2014 sage-talk (20)

Mais de c.titus.brown

Mais de c.titus.brown (17)

Último

Último (20)

2014 sage-talk

Notas do Editor