Presentation on how to chat with PDF using ChatGPT code interpreter
2014 sage-talk
1. Making assembly cheap & easy, and consequences
thereof
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Feb 2014
ctb@msu.edu
2. Generally, yay #openscience!
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/research.html
Preprints: on arXiv, q-bio:
„diginorm arxiv‟
3. Problem under consideration: shotgun
metagenomics
Collect samples;
Extract DNA;
Feed into sequencer;
Computationally analyze.
“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
4. Analogy: we seek an understanding
of humanity via our libraries.
http://eofdreams.com/library.html;
5. But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
6. Points:
Lots of fragments needed! (Deep sampling.)
Having read and understood some books will help
quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.
7. Investigating soil microbial
communities
95% or more of soil microbes cannot be cultured
in lab.
Very little transport in soil and sediment =>
slow mixing rates.
Estimates of immense diversity:
Billions of microbial cells per gram of soil.
Million+ microbial species per gram of soil (Gans et
al, 2005)
One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)
8. “By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”
Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;
N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml
9. Questions to address
Role of soil microbes in nutrient cycling:
How does agricultural soil differ from native soil?
How do soil microbial communities respond to
climate perturbation?
Genome-level questions:
What kind of strain-level heterogeneity is present in
the population?
What are the phage and viral populations &
dynamic?
What species are where, and how much is shared
between different geographical locations?
10. Must use culture independent and
metagenomic approaches
Many reasons why you can‟t or don‟t want to
culture:
Cross-feeding, niche specificity, dormancy, etc.
If you want to get at underlying function, 16s
analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.
11. Shotgun metagenomics
Collect samples;
Extract DNA;
Feed into sequencer;
Computationally analyze.
“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
12. Computational reconstruction of
(meta)genomic content.
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
13. Points:
Lots of fragments needed! (Deep sampling.)
Having read and understood some books will help
quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate
16. Why do we need so much data?!
20-40x coverage is necessary; 100x is ~sufficient.
Mixed population sampling => sensitivity driven by
lowest abundance.
For example, for E. coli in 1/1000 dilution, you would
need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
Sequencing is straightforward; data analysis is not.
“$1000 genome with $1m analysis”
17. Great Prairie Grand Challenge goals
How much of the source metagenome can we
reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest soil data set ever
sequenced, ~2010.)
What can we learn about soil from looking at the
reconstructed metagenome? (See list of
questions)
19. The Problem
We can cheaply gather DNA data in quantities
sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
No locality to the data in terms of graph structure.
Since ~2008:
The field has engaged in lots of engineering
optimization…
…but the data generation rate has consistently
outstripped Moore‟s Law.
21. Primary approach: Digital
normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.
The high-coverage
reads in sample A
are unnecessary
for
28. Diginorm is “lossy compression”
Nearly perfect from an information theoretic
perspective:
Discards 95% more of data for genomes.
Loses < 00.02% of information.
29. Where are we taking this?
Streaming online algorithms only look at data
~once.
Diginorm is streaming, online…
Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
31. Prospective: sequencing tumor cells
Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
Most of this data will be redundant and not useful.
Developing diginorm-based algorithms to
eliminate data while retaining variant information.
32. The real challenge:
understanding
We have gotten distracted by shiny toys:
sequencing!! Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of
an environmental metagenome “means”,
functionally.
Most data is not openly available, so we cannot
mine correlations across data sets.
Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.
33. Data intensive biology & hypothesis
generation
My interest in biological data is to enable better
hypothesis generation.
34. My interests
Open source ecosystem of analysis tools.
Loosely coupled APIs for querying databases.
Publishing reproducible and reusable analyses,
openly.
Education and training.
“Platform perspective”
35. Practical implications of diginorm
Data is (essentially) free;
For some problems, analysis is now cheaper
than data gathering (i.e. essentially free);
…plus, we can run most of our approaches in
the cloud.
36. khmer-protocols
Read cleaning
Effort to provide standard “cheap”
assembly protocols for the cloud.
Diginorm
Entirely copy/paste; ~2-6 days from
raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
Open, versioned, forkable, citable.
Assembly
Annotation
RSEM differential
expression
38. Can we incentivize data sharing?
~$100-$150/transcriptome in the cloud
Offer to analyze people‟s existing data for
free, IFF they open it up within a year.
See: “Dead Sea Scrolls & Open Marine
Transcriptome Project” blog post; CephSeq white
paper.
39. “Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.
40.
41. My interests
Open source ecosystem of analysis tools.
Loosely coupled APIs for querying databases.
Publishing reproducible and reusable analyses,
openly.
Education and training.
“Platform perspective”