SlideShare uma empresa Scribd logo
1 de 72
SCALABLE APPROACHES 
TO EXPLORING 
MICROBIAL DIVERSITY 
C. Titus Brown 
ctb@msu.edu 
Asst Professor, MMG / CSE; Michigan State University 
1/15: Population Health & Reproduction, VetMed, UC Davis 
Talk slides on slideshare.net/c.titus.brown
Funding and motivation:
The central question of my lab -- 
How can we most effectively use computation to extract 
information from large sequence data sets, for the purpose 
of better understanding non- and semi-model organisms? 
Focus on environmental microbes, marine animals, 
& agricultural and veterinary animals.
Biology is becoming data rich – and a 
rising tide lifts all boats! 
http://susieinfrance.blogspot.com/2010/06/rising-tide-lifts-all-boats.html
…but sometimes the tide comes in a bit 
fast.
Our foil for today: 
Investigating soil microbial communities 
Life on earth depends on soil microbes, but: 
• 95% or more of soil microbes cannot be cultured in lab. 
• Very little transport in soil and sediment => 
slow mixing rates. 
• Estimates of immense diversity: 
• Billions of microbial cells per gram of soil. 
• Million+ microbial species per gram of soil (Gans et al, 2005) 
• One observed lower bound for genomic sequence complexity => 
26 Gbp (Amazon Rain Forest Microbial Observatory)
“By 'soil' we understand (Vil'yams, 1931) a loose surface 
layer of earth capable of yielding plant crops. In the physical 
N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS 
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h 
tml 
sense the soil represents a complex disperse system 
consisting of three phases: solid, liquid, and gaseous.” 
Microbes live in & on: 
• Surfaces of 
aggregate particles; 
• Pores within 
microaggregates;
Specific questions to address: 
• Role of soil microbes in nutrient cycling? 
• How does agricultural soil differ from native soil? 
• How do soil microbial communities respond to climate 
perturbation? 
• Genome-level questions: 
• What kind of strain-level heterogeneity is present in the population? 
• What are the phage and viral populations & dynamics thereof? 
• What species are where, and how much is shared between 
different geographical locations?
Must use culture independent and 
metagenomic approaches 
• Many reasons why you can’t or don’t want to culture: 
Cross-feeding, niche specificity, dormancy, etc. 
• If you want to get at underlying function, 16s analysis 
alone is not sufficient. 
Single-cell sequencing & shotgun metagenomics are two 
common ways to investigate complex microbial communities.
Shotgun metagenomics 
• Collect samples; 
• Extract DNA; 
• Feed into sequencer; 
• Computationally analyze. 
“Sequence it all and let the 
bioinformaticians sort it 
Wikipedia: Environmental shotgun 
sequencing.png 
out”
Computational reconstruction of 
(meta)genomic content. 
http://eofdreams.com/library.html; 
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; 
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Points: 
• Lots of fragments needed! (Deep sampling.) 
• Having read and understood some books will help quite a bit 
(Reference genomes.) 
• Rare books will be harder to reconstruct than common books. 
• Errors in OCR process matter quite a bit. (Sequencing error) 
• The more, different specialized libraries you sample, the more 
likely you are to discover valid correlations between topics and 
books. (We don’t understand most microbial function.) 
• A categorization system would be an invaluable but not 
infallible guide to book topics. (Phylogeny can guide 
interpretation.) 
• Understanding the language would help you validate & 
understand the books.
Great Prairie Grand Challenge - 
-SAMPLING LOCATIONS 
2008
A “Grand Challenge” dataset (DOE/JGI) 
600 
500 
400 
300 
200 
100 
0 
Iowa, 
Continuous 
corn 
Iowa, Native 
Prairie 
Kansas, 
Cultivated 
corn 
Kansas, 
Native 
Prairie 
MetaHIT (Qin et. al, 2011), 578 Gbp 
Wisconsin, 
Continuous 
corn 
Wisconsin, 
Native 
Prairie 
Wisconsin, 
Restored 
Prairie 
Wisconsin, 
Switchgrass 
Basepairs of Sequencing (Gbp) 
GAII HiSeq 
Rumen (Hess et. al, 2011), 268 Gbp 
NCBI nr database, 
37 Gbp 
Total: 1,846 Gbp soil metagenome 
Rumen K-mer Filtered, 
111 Gbp
A “Grand Challenge” dataset (DOE/JGI) 
600 
500 
400 
300 
200 
100 
0 
Iowa, 
Continuous 
corn 
Iowa, Native 
Prairie 
Kansas, 
Cultivated 
corn 
Kansas, 
Native 
Prairie 
MetaHIT (Qin et. al, 2011), 578 Gbp 
Wisconsin, 
Continuous 
corn 
Wisconsin, 
Native 
Prairie 
Wisconsin, 
Restored 
Prairie 
Wisconsin, 
Switchgrass 
Basepairs of Sequencing (Gbp) 
GAII HiSeq 
Rumen (Hess et. al, 2011), 268 Gbp 
NCBI nr database, 
37 Gbp 
Total: 1,846 Gbp soil metagenome 
Rumen K-mer Filtered, 
111 Gbp
My algorithm research: 3 methods. 
1. Adaptation of a suite of probabilistic data structures for 
representing set membership and counting (Bloom filters 
and CountMin Sketch). (Zhang et al., PLoS One, 2014.) 
2. An online streaming approach to lossy compression of 
sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.) 
3. Compressible de Bruijn graph representation for 
assembly. (Pell et al., PNAS, 2012.)
Method #2 - Digital normalization 
(a computational version of library normalization) 
Suppose you have a 
dilution factor of A (10) to 
B(1). To get 10x of B you 
need to get 100x of A! 
Overkill!! 
This 100x will consume 
disk space and, because 
of errors, memory. 
We can discard it for 
you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Assembling Iowa prairie and Iowa corn: 
Total 
Assembly 
Total Contigs 
(> 300 bp) 
% Reads 
Assembled 
Putting it in perspective: 
Total equivalent of ~1200 bacterial genomes 
Human genome ~3 billion bp 
Predicted 
protein 
coding 
2.5 bill 4.5 mill 19% 5.3 mill 
3.5 bill 5.9 mill 22% 6.8 mill 
Adina Howe
Resulting contigs are all low coverage. 
Howe et al., 2014 
Figure11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil metagenomes.
Iowa prairie & corn DNA abundances are 
very even. 
Corn Prairie 
Howe et al., 2014
Assembly is a good idea: 
Howe et al., 2014
Analyses of 
metabolic potential 
begin to illuminate 
differences. 
Howe et al., 2014
We see little strain variation in sample. 
Top two allele frequencies 
Position within contig 
Can measure 
by read 
mapping. 
Of 5000 most 
abundant 
contigs, only 1 
has a 
polymorphism 
rate > 5%
Biogeography: Iowa sample overlap? 
Corn and prairie content graphs have 51% nucleotide 
overlap. 
Corn Prairie 
Suggests that at greater depth, samples may have similar 
genomic content.
Biogeography of genomic DNA in soil 
How much genomic richness is shared 
between different sites? 
Qingpeng Zhang
So, for soil: 
• We really do need more data; 
• But at least now we can assemble what we already have. 
• Estimate required sequencing depth at 50 Tbp; 
• Now also have 2-8 Tbp from Amazon Rain Forest 
Microbial Observatory. 
• …still not saturated coverage, but getting closer. 
Iowa soil work has been published: 
Howe et al., 2014, PNAS.
So, for soil: 
Note! There are now much faster assembly approaches…! 
See: Megahit, http://arxiv.org/abs/1409.7208 
(Technology marches on!)
So, for soil: 
• We really do need more data; 
• But at least now we can assemble what we already have. 
• Estimate required sequencing depth at 50 Tbp; 
• Now also have 2-8 Tbp from Amazon Rain Forest 
Microbial Observatory. 
• …still not saturated coverage, but getting closer. 
But, diginorm approach turns out to also be widely 
useful.
Digital normalization is popular… 
Estimated ~1000 users of our software. 
Diginorm algorithm now included in Trinity 
software from Broad Institute (~10,000 users) 
Illumina TruSeq long-read technology now 
incorporates our approach (~100,000 users)
The data problem: Looking forward 5 
years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
We face an infinite data problem. 
• For all intents and purposes 
• For example, Illumina estimates that 228,000 human 
genomes will be resequenced this year, primarily by 
researchers; this is only going to grow. 
• Similar stories across all of biology (although #s lower :)
Current analysis approaches are multipass, 
e.g. variant calling: 
Data 
Mapping 
Sorting 
Calling Answer 
On infinite data, you really only want to look at the data once…
Streaming algorithms can be very efficient 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
Some key points -- 
• Digital normalization is streaming. 
• Digital normalizing is computationally efficient (lower 
memory than other approaches; parallelizable/multicore; 
single-pass) 
• Currently, primarily used for prefiltering for assembly, but 
relies on underlying abstraction (De Bruijn graph) that is 
also used in variant calling.
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Some key points -- 
• Digital normalization is streaming. 
• Digital normalizing is computationally efficient (lower 
memory than other approaches; parallelizable/multicore; 
single-pass) 
• Currently, primarily used for prefiltering for assembly, but 
relies on underlying abstraction (De Bruijn graph) that is 
also used in variant calling.
Error correction as the solution for our ills 
Current work: error correction (??) 
Errors in sequencing data are at the root of many 
problems: 
• Assembly is 100x lower memory in the absence of errors. 
• Mapping is computationally trivial when there are no 
errors. 
• Variant calling and genotyping become simple, as does 
species detection.
We can error correct high-coverage shotgun data 
with k-mer spectra: 
Chaisson et al., 2009 
True k-mers 
Erroneous k-mers
Streaming error correction on E. coli data 
(Early days…) 
TP FP TN FN 
1% error rate, 100x coverage. 
Michael Crusoe, Jordan Fish, Jason Pell 
Error 
correction 3,494,631 3,865 460,601,171 5,533 
(corrected) (mistakes) (OK) (missed)
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling.
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet?
What does the future hold? 
• More emphasis on training and infrastructure. 
• Data integration! 
• Identifying the function of unknown genes…
Summer NGS workshop (2010-2017)
The infrastructure challenge 
In 5-10 years, we will have nigh-infinite data. 
(Genomic, transcriptomic, proteomic, metabolomic, 
…?) 
We currently have no good way of querying, 
exploring, investigating, or mining these data sets, 
especially across multiple locations..
Distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014)
My charge: We don’t know what most genes do. 
Total 
Assembly 
Total Contigs 
(> 300 bp) 
% Reads 
Assembled 
Putting it in perspective: 
Total equivalent of ~1200 bacterial genomes 
Human genome ~3 billion bp 
Predicted 
protein 
coding 
2.5 bill 4.5 mill 19% 5.3 mill 
3.5 bill 5.9 mill 22% 6.8 mill 
Howe et al, 2014; pmid 24632729
Data Intensive Biology 
Opportunities & challenges; how can we best support the 
biology? 
"I have traveled the length and breadth of this 
country and talked with the best people, and I can 
assure you that data processing is a fad that won't 
last out the year." --The editor in charge of business 
books for Prentice Hall, 1957
Thanks! 
Key points: 
• Facing nigh-infinite data situation; 
• The first stages of sequence analysis, assembly and variant 
calling, are computationally intensive (but we’re hoping to fix 
that); 
• Training in data intensive biology is critical to the future of 
biology. 
• Data sharing and data integration infrastructure is also critical.
Graph alignment can detect read saturation
Proposal: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
Proposal: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
Proposal: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
Proposal: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
Graph queries 
across public & walled-garden data sets: 
assembled 
sequence 
SIMILARITY TO ALSO CONTAINS 
nitrite 
reductase 
ppaZ 
raw 
sequence 
See Lee, 
Alekseyenko, Brown, 
paper in SciPy 2009: 
the “pygr” project.

Mais conteúdo relacionado

Mais procurados

2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsJonathan Eisen
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysjrossibarra
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plantsjrossibarra
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizejrossibarra
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Monica Munoz-Torres
 

Mais procurados (20)

2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plants
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maize
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 

Destaque

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalWebtrends
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.Gina Montgomery, V-TSP
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorKyle Lacy
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...ProductCamp Boston
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber ShandwickWeber Shandwick Korea
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BAnco Stuij
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azureGal Kogman
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Gina Montgomery, V-TSP
 

Destaque (13)

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - Technical
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.
 
John saraguro diapositiva
John saraguro diapositivaJohn saraguro diapositiva
John saraguro diapositiva
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer Behavior
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening Slides
 
Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2B
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azure
 
actividad 1.4
actividad 1.4actividad 1.4
actividad 1.4
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
 

Semelhante a 2014 nyu-bio-talk

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityQingpeng "Q.P." Zhang
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystemsTimeScience
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information NahalMalik1
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In GenomicsSaul Kravitz
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesAdina Chuang Howe
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...FOODCROPS
 

Semelhante a 2014 nyu-bio-talk (20)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial Diversity
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 

Mais de c.titus.brown

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Último

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 

Último (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 

2014 nyu-bio-talk

  • 1. SCALABLE APPROACHES TO EXPLORING MICROBIAL DIVERSITY C. Titus Brown ctb@msu.edu Asst Professor, MMG / CSE; Michigan State University 1/15: Population Health & Reproduction, VetMed, UC Davis Talk slides on slideshare.net/c.titus.brown
  • 3. The central question of my lab -- How can we most effectively use computation to extract information from large sequence data sets, for the purpose of better understanding non- and semi-model organisms? Focus on environmental microbes, marine animals, & agricultural and veterinary animals.
  • 4. Biology is becoming data rich – and a rising tide lifts all boats! http://susieinfrance.blogspot.com/2010/06/rising-tide-lifts-all-boats.html
  • 5. …but sometimes the tide comes in a bit fast.
  • 6. Our foil for today: Investigating soil microbial communities Life on earth depends on soil microbes, but: • 95% or more of soil microbes cannot be cultured in lab. • Very little transport in soil and sediment => slow mixing rates. • Estimates of immense diversity: • Billions of microbial cells per gram of soil. • Million+ microbial species per gram of soil (Gans et al, 2005) • One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
  • 7. “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbes live in & on: • Surfaces of aggregate particles; • Pores within microaggregates;
  • 8. Specific questions to address: • Role of soil microbes in nutrient cycling? • How does agricultural soil differ from native soil? • How do soil microbial communities respond to climate perturbation? • Genome-level questions: • What kind of strain-level heterogeneity is present in the population? • What are the phage and viral populations & dynamics thereof? • What species are where, and how much is shared between different geographical locations?
  • 9. Must use culture independent and metagenomic approaches • Many reasons why you can’t or don’t want to culture: Cross-feeding, niche specificity, dormancy, etc. • If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
  • 10. Shotgun metagenomics • Collect samples; • Extract DNA; • Feed into sequencer; • Computationally analyze. “Sequence it all and let the bioinformaticians sort it Wikipedia: Environmental shotgun sequencing.png out”
  • 11. Computational reconstruction of (meta)genomic content. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 12. Points: • Lots of fragments needed! (Deep sampling.) • Having read and understood some books will help quite a bit (Reference genomes.) • Rare books will be harder to reconstruct than common books. • Errors in OCR process matter quite a bit. (Sequencing error) • The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We don’t understand most microbial function.) • A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) • Understanding the language would help you validate & understand the books.
  • 13. Great Prairie Grand Challenge - -SAMPLING LOCATIONS 2008
  • 14. A “Grand Challenge” dataset (DOE/JGI) 600 500 400 300 200 100 0 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie MetaHIT (Qin et. al, 2011), 578 Gbp Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass Basepairs of Sequencing (Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 15. A “Grand Challenge” dataset (DOE/JGI) 600 500 400 300 200 100 0 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie MetaHIT (Qin et. al, 2011), 578 Gbp Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass Basepairs of Sequencing (Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 16. My algorithm research: 3 methods. 1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). (Zhang et al., PLoS One, 2014.) 2. An online streaming approach to lossy compression of sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.) 3. Compressible de Bruijn graph representation for assembly. (Pell et al., PNAS, 2012.)
  • 17. Method #2 - Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 24. Assembling Iowa prairie and Iowa corn: Total Assembly Total Contigs (> 300 bp) % Reads Assembled Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 25. Resulting contigs are all low coverage. Howe et al., 2014 Figure11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil metagenomes.
  • 26. Iowa prairie & corn DNA abundances are very even. Corn Prairie Howe et al., 2014
  • 27. Assembly is a good idea: Howe et al., 2014
  • 28. Analyses of metabolic potential begin to illuminate differences. Howe et al., 2014
  • 29. We see little strain variation in sample. Top two allele frequencies Position within contig Can measure by read mapping. Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5%
  • 30. Biogeography: Iowa sample overlap? Corn and prairie content graphs have 51% nucleotide overlap. Corn Prairie Suggests that at greater depth, samples may have similar genomic content.
  • 31. Biogeography of genomic DNA in soil How much genomic richness is shared between different sites? Qingpeng Zhang
  • 32. So, for soil: • We really do need more data; • But at least now we can assemble what we already have. • Estimate required sequencing depth at 50 Tbp; • Now also have 2-8 Tbp from Amazon Rain Forest Microbial Observatory. • …still not saturated coverage, but getting closer. Iowa soil work has been published: Howe et al., 2014, PNAS.
  • 33. So, for soil: Note! There are now much faster assembly approaches…! See: Megahit, http://arxiv.org/abs/1409.7208 (Technology marches on!)
  • 34. So, for soil: • We really do need more data; • But at least now we can assemble what we already have. • Estimate required sequencing depth at 50 Tbp; • Now also have 2-8 Tbp from Amazon Rain Forest Microbial Observatory. • …still not saturated coverage, but getting closer. But, diginorm approach turns out to also be widely useful.
  • 35. Digital normalization is popular… Estimated ~1000 users of our software. Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users) Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)
  • 36. The data problem: Looking forward 5 years… Navin et al., 2011
  • 37. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 38. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 39. We face an infinite data problem. • For all intents and purposes • For example, Illumina estimates that 228,000 human genomes will be resequenced this year, primarily by researchers; this is only going to grow. • Similar stories across all of biology (although #s lower :)
  • 40. Current analysis approaches are multipass, e.g. variant calling: Data Mapping Sorting Calling Answer On infinite data, you really only want to look at the data once…
  • 41. Streaming algorithms can be very efficient Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 42. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 48. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 49. Error correction as the solution for our ills Current work: error correction (??) Errors in sequencing data are at the root of many problems: • Assembly is 100x lower memory in the absence of errors. • Mapping is computationally trivial when there are no errors. • Variant calling and genotyping become simple, as does species detection.
  • 50. We can error correct high-coverage shotgun data with k-mer spectra: Chaisson et al., 2009 True k-mers Erroneous k-mers
  • 51. Streaming error correction on E. coli data (Early days…) TP FP TN FN 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell Error correction 3,494,631 3,865 460,601,171 5,533 (corrected) (mistakes) (OK) (missed)
  • 52.
  • 53.
  • 54. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling.
  • 55. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 56. Analysis is done after sequencing. Sequencing Analysis
  • 57. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 58. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 59. What does the future hold? • More emphasis on training and infrastructure. • Data integration! • Identifying the function of unknown genes…
  • 60. Summer NGS workshop (2010-2017)
  • 61. The infrastructure challenge In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations..
  • 62. Distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 63. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014)
  • 64. My charge: We don’t know what most genes do. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729
  • 65. Data Intensive Biology Opportunities & challenges; how can we best support the biology? "I have traveled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won't last out the year." --The editor in charge of business books for Prentice Hall, 1957
  • 66. Thanks! Key points: • Facing nigh-infinite data situation; • The first stages of sequence analysis, assembly and variant calling, are computationally intensive (but we’re hoping to fix that); • Training in data intensive biology is critical to the future of biology. • Data sharing and data integration infrastructure is also critical.
  • 67. Graph alignment can detect read saturation
  • 68. Proposal: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 69. Proposal: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 70. Proposal: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 71. Proposal: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 72. Graph queries across public & walled-garden data sets: assembled sequence SIMILARITY TO ALSO CONTAINS nitrite reductase ppaZ raw sequence See Lee, Alekseyenko, Brown, paper in SciPy 2009: the “pygr” project.

Notas do Editor

  1. Fly-over country (that I live in)
  2. Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
  3. Update from Jordan
  4. Lure them in with bioinformatics and then show them that Michigan, in the summertime, is qite nice!
  5. Analyze data in cloud; import and export important; connect to other databases.
  6. Analyze data in cloud; import and export important; connect to other databases.
  7. Analyze data in cloud; import and export important; connect to other databases.
  8. Analyze data in cloud; import and export important; connect to other databases.
  9. Analyze data in cloud; import and export important; connect to other databases.
  10. Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.