SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Better science through
superior software
Michael R. Crusoe
Software Engineer & Bioinformatician
The GED Lab @ Michigan State
mcrusoe@msu.edu @biocrusoe
Open, online science
Much of the software and approaches talked
about today are available:
khmer software:
http://github.com/ged-lab/khmer/
Titus’s blog: http://ivory.idyll.org/blog/
Titus’s twitter: @ctitusbrown
Overview
● Next-gen sequencing data deluge
● ♫How do you solve a problem like big data?♫
● Impact of khmer software
● Future work
● Being a good F/OSS community member and
leading by example
● Acknowledgements
Problem
“The power of next-gen. sequencing: get 180x
coverage... and then watch your assemblies
never finish” - Erich Schwarz
“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than
Moore’s Law.
2. Your data gathering rate matches Moore’s
Law.
3. Your data gathering rate exceeds Moore’s
Law.
“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’
s Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s
Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.
A software & algorithms approach: can we
develop lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed
data for later reanalysis. Short answer is: yes,
we can.
Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
● Reference free.
● Is single pass: looks at each read only once;
● Does not “collect” the majority of errors;
● Keeps all low-coverage reads & retains all
information.
GED Lab’s approach: khmer
diginorm: ejects most data while retaining the
information content.
partitioning: split transcriptomic and meta
{transcript,gen}omic datasets
fast k-mer counting: for better preprocessing,
repeat detection, and sequencing coverage
estimates
Reference-free variant calling
- More to come -
TheGEDlabat MSU:
Theoretical => applied solutions.
Impact
● any biologist can use our tools in a rented
cloud computer, cheaply
● Overcome complexity: Erich Schwarz
assembled H. contortus when it was
previously not possible.
● Overcome data excess: 5.1 billion reads from
50 different sea lamprey tissue -> diginorm
technique removed 98.7% for being
redundant.
Future work
● targeted-gene assembly from short reads
(Fish et al., Ribosomal Database Project)
● rRNA search in shotgun data
● error-correction for mRNAseq &
metagenomic data
● strain variation collapse, assembly, and
recovery
● Goal: make most assembly easy and all
evaluation easy.
Interactions
khmer both builds upon existing Free and
Open-Source Software (F/OSS) and is itself
made under an open-source license.
used in curriculum: both Software Carpentry
ANGUS based courses and the MSU NGS
summer course
● BIG DATA grant reviewers specifically
mentioned the GED Lab’s “[...] long and
successful track-record and experience in
following rigorous but open software
development processes.” -> CTB received 3-
year NIH R01 support
● Transparent and public software
development yielded participation from
others.
Personal Acknowledgments
C. Titus Brown for slides, employment
Acknowledgements
Labmembersinvolved Collaborators
● Adina Howe (w/Tiedje)
● Jason Pell
● Arend Hintze
● Rosangela Canino-
Koning
● Qingpeng Zhang
● Elijah Lowe
● Likit Preeyanon
● Jiarong Guo
● Tim Brom
● Kanchan Pavangadkar
● Eric McDonald
● Chris Welcher
● Jim Tiedje, MSU
● Billie Swalla, UW
● Janet Jansson,
LBNL
● Susannah Tringe,
JGI
Funding
USDA NIFA; NSF
IOS; BEACON.

Mais conteúdo relacionado

Mais procurados

Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
bosc_2008
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft
 

Mais procurados (20)

Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Scientists to tap data networks' hidden powers
Scientists to tap data networks' hidden powersScientists to tap data networks' hidden powers
Scientists to tap data networks' hidden powers
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine Intelligence
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Bringing bioinformatics into the library
Bringing bioinformatics into the libraryBringing bioinformatics into the library
Bringing bioinformatics into the library
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
Pacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesPacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth Sciences
 
Creating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOCreating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIO
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
TA-RE: An Exchange Language for Mining Software Repositories
TA-RE: An Exchange Language for Mining Software RepositoriesTA-RE: An Exchange Language for Mining Software Repositories
TA-RE: An Exchange Language for Mining Software Repositories
 

Semelhante a Better science through superior software

2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Dahlquist bosc 20160709
Dahlquist bosc 20160709Dahlquist bosc 20160709
Dahlquist bosc 20160709
GRNsight
 

Semelhante a Better science through superior software (20)

2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Dahlquist bosc 20160709
Dahlquist bosc 20160709Dahlquist bosc 20160709
Dahlquist bosc 20160709
 
Big Data
Big Data Big Data
Big Data
 
Slides barcelona risk data
Slides barcelona risk dataSlides barcelona risk data
Slides barcelona risk data
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Better science through superior software

  • 1. Better science through superior software Michael R. Crusoe Software Engineer & Bioinformatician The GED Lab @ Michigan State mcrusoe@msu.edu @biocrusoe
  • 2. Open, online science Much of the software and approaches talked about today are available: khmer software: http://github.com/ged-lab/khmer/ Titus’s blog: http://ivory.idyll.org/blog/ Titus’s twitter: @ctitusbrown
  • 3. Overview ● Next-gen sequencing data deluge ● ♫How do you solve a problem like big data?♫ ● Impact of khmer software ● Future work ● Being a good F/OSS community member and leading by example ● Acknowledgements
  • 4. Problem “The power of next-gen. sequencing: get 180x coverage... and then watch your assemblies never finish” - Erich Schwarz
  • 5. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.
  • 6.
  • 7. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore’ s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore’s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore’s Law. => You need serious help.
  • 8. A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we can store only the compressed data for later reanalysis. Short answer is: yes, we can.
  • 9. Digital normalization approach A digital analog to cDNA library normalization, diginorm: ● Reference free. ● Is single pass: looks at each read only once; ● Does not “collect” the majority of errors; ● Keeps all low-coverage reads & retains all information.
  • 10. GED Lab’s approach: khmer diginorm: ejects most data while retaining the information content. partitioning: split transcriptomic and meta {transcript,gen}omic datasets fast k-mer counting: for better preprocessing, repeat detection, and sequencing coverage estimates Reference-free variant calling - More to come -
  • 11. TheGEDlabat MSU: Theoretical => applied solutions.
  • 12. Impact ● any biologist can use our tools in a rented cloud computer, cheaply ● Overcome complexity: Erich Schwarz assembled H. contortus when it was previously not possible. ● Overcome data excess: 5.1 billion reads from 50 different sea lamprey tissue -> diginorm technique removed 98.7% for being redundant.
  • 13. Future work ● targeted-gene assembly from short reads (Fish et al., Ribosomal Database Project) ● rRNA search in shotgun data ● error-correction for mRNAseq & metagenomic data ● strain variation collapse, assembly, and recovery ● Goal: make most assembly easy and all evaluation easy.
  • 14. Interactions khmer both builds upon existing Free and Open-Source Software (F/OSS) and is itself made under an open-source license. used in curriculum: both Software Carpentry ANGUS based courses and the MSU NGS summer course
  • 15. ● BIG DATA grant reviewers specifically mentioned the GED Lab’s “[...] long and successful track-record and experience in following rigorous but open software development processes.” -> CTB received 3- year NIH R01 support ● Transparent and public software development yielded participation from others.
  • 16. Personal Acknowledgments C. Titus Brown for slides, employment
  • 17. Acknowledgements Labmembersinvolved Collaborators ● Adina Howe (w/Tiedje) ● Jason Pell ● Arend Hintze ● Rosangela Canino- Koning ● Qingpeng Zhang ● Elijah Lowe ● Likit Preeyanon ● Jiarong Guo ● Tim Brom ● Kanchan Pavangadkar ● Eric McDonald ● Chris Welcher ● Jim Tiedje, MSU ● Billie Swalla, UW ● Janet Jansson, LBNL ● Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON.