SlideShare uma empresa Scribd logo
1 de 19
Infrastructure for Data
Intensive Biology
“Better Science through Superior Software”
C. Titus Brown
Current research:
Compressive algorithms for
sequence analysis
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Can we enable and accelerate sequence-
based inquiry by making all basic analysis
easier and some analyses possible?
Three super-awesome
technologies…
1. Low-memory k-mer counting
(Zhang et al., PLoS One, 2014)
2. Compressible assembly graphs
(Pell et al., PNAS, 2012)
3. Streaming lossy compression of sequence
data
(Brown et al., arXiv, 2012)
…implemented in one super-
awesome software package…
github.com/ged-lab/khmer/
BSD licensed
Openly developed using good practice.
> 10 external contributors.
Thousands of downloads/month.
50 citations in 3 years.
We think > 1000 people are using it; have heard
from dozens.
…enabling super-awesome
biology.
1. Assembling soil metagenomes
Howe et al., PNAS, 2014
2. Understanding bone-eating worm symbionts
Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome
(in preparation)
4. Understanding derived anural development in
Molgulid ascidians (in preparation)
Early on, lack of replicability in pubs slowed us down =>
Strategy: “level up” the field
High quality & novel science,
done openly,
written up in reproducible and
remixable papers,
using IPython Notebook,
and posted to preprint servers.
Expression based
clustering of 85 lamprey
tissue samples (de novo
assembly of 3 billion reads)
~ 1 month
Camille Scott
Open protocols for the
cloud: ~$100/analysis
Read cleaning
Preprocessing
Assembly
Annotation
khmer-protocols.readthedocs.org/
Transcriptome and metagenome assembly protocols
The data challenge in biology
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic,
metabolomic, …?)
We currently have no good way of querying,
exploring, investigating, or mining these data
sets, especially across multiple locations..
Moreover, most data is unavailable until after
publication…
…which, in practice, means it will be lost.
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Graph queries
assembled
sequence
nitrite
reductase
ppaZ
SIMILARITY TO ALSO CONTAINS
raw
sequence
across public & walled-garden data sets:
See Lee,
Alekseyenko, Brown,
paper in SciPy 2009:
the “pygr” project.
The larger vision
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.
Who needs this?
Everyone.
Environmental microbiology, evo devo,
agriculture, VetMed...
How would I start?
1-2 pilot projects w/domain
postdocs: drive computational
infrastructure with biology
problems.
Support postdocs with
software engineer
(infrastructure) and graduate
student CS (research).
Cross-train postdocs in data-
intensive research methods
and software engineering.
Note: finding existing data is not a
problem :)
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs
physical parameters – potential
collab.
Via Elizabeth Kujawinski
Education and training
Biology is underprepared for data-intensive
investigation.
We must teach and train the next generations.
~5-10 workshops / year, novice -> masterclass; open
materials.
Deeply self-interested:
What problems does everyone have, now?
(Assembly)
What problems do leading-edge researchers have?
(Data integration)
Pre-answered Questions
Q: What will be open?
A: Everything; I succeed & fail publicly.
Q: How will you measure success?
A: By other people using & extending our
“products” without talking to us.
Blog: ivory.idyll.org/blog/ - search for “moore”, “satire”
@ctitusbrown
Graph queries
across public & walled-garden data
sets:
“What data sets contain <this gene>?”
“Which reads match to <this gene>, but
not in <conserved domain>?”
“Give me relative abundance of <gene
X> across all data sets, grouped by
nitrogen exposure.”

Mais conteúdo relacionado

Mais procurados

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemSlideCentral
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardPacificResearchPlatform
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiPacificResearchPlatform
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STACSimeon Fitch
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureDavid LeBauer
 

Mais procurados (20)

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Deep Learning in Deep Space
Deep Learning in Deep SpaceDeep Learning in Deep Space
Deep Learning in Deep Space
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
Insight_150115_Demo
Insight_150115_DemoInsight_150115_Demo
Insight_150115_Demo
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Friday talk 11.02.2011
Friday talk 11.02.2011Friday talk 11.02.2011
Friday talk 11.02.2011
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STAC
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 

Destaque

BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussionc.titus.brown
 
Results Now Presentation
Results Now PresentationResults Now Presentation
Results Now Presentationbriancenteno
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemJames Price
 
Keeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKeeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKegler Brown Hill + Ritter
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsKegler Brown Hill + Ritter
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principlejnpletcher
 
Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09DockIT
 
ADR in practice: Pre-Claim Conciliation – 2 years on
ADR in practice: Pre-Claim Conciliation – 2 years onADR in practice: Pre-Claim Conciliation – 2 years on
ADR in practice: Pre-Claim Conciliation – 2 years onAcas Comms
 
Mn1 sec 2 - les 4 - (taghabun 1-18)
Mn1   sec 2 - les 4 - (taghabun 1-18)Mn1   sec 2 - les 4 - (taghabun 1-18)
Mn1 sec 2 - les 4 - (taghabun 1-18)Fawad Kiyani
 
Healthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECDHealthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECDAlex Rascanu
 
e-book: Social Business Now
e-book: Social Business Nowe-book: Social Business Now
e-book: Social Business NowSanne Heerink
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingElitas Groep BV
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkKegler Brown Hill + Ritter
 
Osss...!!! Magazine Concept
Osss...!!! Magazine ConceptOsss...!!! Magazine Concept
Osss...!!! Magazine Concept@rtNya
 

Destaque (20)

Deadlocks
DeadlocksDeadlocks
Deadlocks
 
BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussion
 
Results Now Presentation
Results Now PresentationResults Now Presentation
Results Now Presentation
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of Nehalem
 
Social Media Evidence
Social Media EvidenceSocial Media Evidence
Social Media Evidence
 
Keeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference ClaimsKeeping the Gold: Successfully Resolving Preference Claims
Keeping the Gold: Successfully Resolving Preference Claims
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principle
 
CG borodino
CG borodinoCG borodino
CG borodino
 
Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09
 
IT & Business Centre
IT & Business CentreIT & Business Centre
IT & Business Centre
 
ADR in practice: Pre-Claim Conciliation – 2 years on
ADR in practice: Pre-Claim Conciliation – 2 years onADR in practice: Pre-Claim Conciliation – 2 years on
ADR in practice: Pre-Claim Conciliation – 2 years on
 
Climate Summit
Climate SummitClimate Summit
Climate Summit
 
Futura+ Idealcombi
Futura+ IdealcombiFutura+ Idealcombi
Futura+ Idealcombi
 
Mn1 sec 2 - les 4 - (taghabun 1-18)
Mn1   sec 2 - les 4 - (taghabun 1-18)Mn1   sec 2 - les 4 - (taghabun 1-18)
Mn1 sec 2 - les 4 - (taghabun 1-18)
 
Healthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECDHealthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECD
 
e-book: Social Business Now
e-book: Social Business Nowe-book: Social Business Now
e-book: Social Business Now
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van Sourcing
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
 
Osss...!!! Magazine Concept
Osss...!!! Magazine ConceptOsss...!!! Magazine Concept
Osss...!!! Magazine Concept
 

Semelhante a 2014 moore-ddd

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 

Semelhante a 2014 moore-ddd (20)

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
Hadoop
HadoopHadoop
Hadoop
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop
HadoopHadoop
Hadoop
 

Mais de c.titus.brown

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-reviewc.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 

2014 moore-ddd

  • 1. Infrastructure for Data Intensive Biology “Better Science through Superior Software” C. Titus Brown
  • 2. Current research: Compressive algorithms for sequence analysis Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Can we enable and accelerate sequence- based inquiry by making all basic analysis easier and some analyses possible?
  • 3. Three super-awesome technologies… 1. Low-memory k-mer counting (Zhang et al., PLoS One, 2014) 2. Compressible assembly graphs (Pell et al., PNAS, 2012) 3. Streaming lossy compression of sequence data (Brown et al., arXiv, 2012)
  • 4. …implemented in one super- awesome software package… github.com/ged-lab/khmer/ BSD licensed Openly developed using good practice. > 10 external contributors. Thousands of downloads/month. 50 citations in 3 years. We think > 1000 people are using it; have heard from dozens.
  • 5. …enabling super-awesome biology. 1. Assembling soil metagenomes Howe et al., PNAS, 2014 2. Understanding bone-eating worm symbionts Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome (in preparation) 4. Understanding derived anural development in Molgulid ascidians (in preparation)
  • 6. Early on, lack of replicability in pubs slowed us down => Strategy: “level up” the field High quality & novel science, done openly, written up in reproducible and remixable papers, using IPython Notebook, and posted to preprint servers. Expression based clustering of 85 lamprey tissue samples (de novo assembly of 3 billion reads) ~ 1 month Camille Scott
  • 7. Open protocols for the cloud: ~$100/analysis Read cleaning Preprocessing Assembly Annotation khmer-protocols.readthedocs.org/ Transcriptome and metagenome assembly protocols
  • 8. The data challenge in biology In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication… …which, in practice, means it will be lost.
  • 9. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 10. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 11. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 12. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 13. Graph queries assembled sequence nitrite reductase ppaZ SIMILARITY TO ALSO CONTAINS raw sequence across public & walled-garden data sets: See Lee, Alekseyenko, Brown, paper in SciPy 2009: the “pygr” project.
  • 14. The larger vision Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future.
  • 15. Who needs this? Everyone. Environmental microbiology, evo devo, agriculture, VetMed...
  • 16. How would I start? 1-2 pilot projects w/domain postdocs: drive computational infrastructure with biology problems. Support postdocs with software engineer (infrastructure) and graduate student CS (research). Cross-train postdocs in data- intensive research methods and software engineering. Note: finding existing data is not a problem :) “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski
  • 17. Education and training Biology is underprepared for data-intensive investigation. We must teach and train the next generations. ~5-10 workshops / year, novice -> masterclass; open materials. Deeply self-interested: What problems does everyone have, now? (Assembly) What problems do leading-edge researchers have? (Data integration)
  • 18. Pre-answered Questions Q: What will be open? A: Everything; I succeed & fail publicly. Q: How will you measure success? A: By other people using & extending our “products” without talking to us. Blog: ivory.idyll.org/blog/ - search for “moore”, “satire” @ctitusbrown
  • 19. Graph queries across public & walled-garden data sets: “What data sets contain <this gene>?” “Which reads match to <this gene>, but not in <conserved domain>?” “Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”

Notas do Editor

  1. Squeeze information out of data; speed up downstream analyses; make impossible possible.
  2. Applicable to many basic sequence analysis problems: error removal, species sorting, and de novo sequence assembly.
  3. Hard to tell how many people are using it because it’s freely available in several locations.
  4. The point is to enable biology; volume and velocity of data from sequencers is blocking.
  5. Doing computational science with good software engineering approaches is enabling; scientist + soft eng grad students are super capable.
  6. 1000s of people want to do what we do, can’t collaborate with them all => open protocols. Forkable, ctiable, open, tested. This is your methods section for computational analysis.
  7. Analyze data in cloud; import and export important; connect to other databases.
  8. Analyze data in cloud; import and export important; connect to other databases.
  9. Analyze data in cloud; import and export important; connect to other databases.
  10. Analyze data in cloud; import and export important; connect to other databases.
  11. Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.
  12. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.
  13. Drive with pilot projects; train domain postdocs in computation; e.g. 20+ sites with multi-omic sampling, clearly the future but no way to analyze the data.
  14. Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  15. Mention moore science fiction project