SlideShare uma empresa Scribd logo
1 de 40
Building khmer, a platform
for research in scalable
sequence analysis
C. Titus Brown
ctb@msu.edu
Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Assembly graphs scale with data size, not
information.
Practical memory
measurements (soil)
Velvet measurements (Adina Howe)
Counting k-mers
efficiently (RAM)
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development
A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…
khmer-protocols
khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)
Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
Genomic intervals shared
between data sets
Qingpeng Zhang
* Assembly free!
Error correction via graph
alignment
Jason Pell and Jordan Fish
Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)
Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.
Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>
Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/
Thanks!
Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.
Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.

Mais conteúdo relacionado

Destaque

Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10John Doyle
 
Grandparents day
Grandparents day Grandparents day
Grandparents day Takahe One
 
Trainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahTrainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahZafar Ahmad
 
Kezia Tragedy
Kezia TragedyKezia Tragedy
Kezia Tragedycobbo
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkKegler Brown Hill + Ritter
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1bamadogg
 
Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Khomeini Mujahid
 
AMD-V™ Nested Paging
AMD-V™ Nested PagingAMD-V™ Nested Paging
AMD-V™ Nested PagingJames Price
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event ProgramStephanie Abbott
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua ArquiteturaFernando Galdino
 
Presentacio Ciutats
Presentacio CiutatsPresentacio Ciutats
Presentacio CiutatsReckonerr
 
The Birth of Kehfab
The Birth of KehfabThe Birth of Kehfab
The Birth of KehfabWebsites.ca
 

Destaque (20)

Duygular
DuygularDuygular
Duygular
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
 
Peuples inconnus
Peuples inconnusPeuples inconnus
Peuples inconnus
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
 
Trainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahTrainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II Layyah
 
Kezia Tragedy
Kezia TragedyKezia Tragedy
Kezia Tragedy
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
What is electricity
What is electricityWhat is electricity
What is electricity
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Shirk
ShirkShirk
Shirk
 
Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]
 
AMD-V™ Nested Paging
AMD-V™ Nested PagingAMD-V™ Nested Paging
AMD-V™ Nested Paging
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event Program
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
 
Presentacio Ciutats
Presentacio CiutatsPresentacio Ciutats
Presentacio Ciutats
 
The Birth of Kehfab
The Birth of KehfabThe Birth of Kehfab
The Birth of Kehfab
 
Body Language
Body  LanguageBody  Language
Body Language
 

Semelhante a 2014 toronto-torbug

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?Michaela Greiler
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasMerce Crosas
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroPaul Boos
 
Automatic for the People
Automatic for the PeopleAutomatic for the People
Automatic for the PeopleAndy Zaidman
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auevanbottcher
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your CodeNate Abele
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Leaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value AnalysisLeaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value AnalysisTechWell
 
Trends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinTrends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinDirecti Group
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsTaesu Kim
 
How to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated TestingHow to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated TestingTechWell
 

Semelhante a 2014 toronto-torbug (20)

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Software testing
Software testingSoftware testing
Software testing
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
 
Automatic for the People
Automatic for the PeopleAutomatic for the People
Automatic for the People
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Leaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value AnalysisLeaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value Analysis
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Trends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinTrends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa Crispin
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
 
How to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated TestingHow to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated Testing
 

Mais de c.titus.brown

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Último

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Último (20)

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

2014 toronto-torbug

  • 1. Building khmer, a platform for research in scalable sequence analysis C. Titus Brown ctb@msu.edu
  • 2. Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
  • 4. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
  • 5. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
  • 6. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 7. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
  • 8. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
  • 11. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
  • 12. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
  • 13. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • 14. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • 15. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
  • 16. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
  • 17. How is this feasible?! Representative half-arsed lab software development Version that worked once, for some publication. Grad student 1 research Grad student 2 research Incompatible and broken code
  • 18. Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests A not-insane way to do software development
  • 19. A not-insane way to do software development Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests Run tests Run tests
  • 20. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • 21. Integration testing • khmer is designed to work with other packages. • For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages. • These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
  • 23. khmer-protocols: • Provide standard “cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
  • 24. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
  • 25. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 26. Benchmarking protocols Data subset; AWS m1.xlarge ~1 hour (See PyCon 2014 talk; video and blog post.)
  • 27. Benchmarking protocols Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.)
  • 28. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch
  • 29. Genomic intervals shared between data sets Qingpeng Zhang * Assembly free!
  • 30. Error correction via graph alignment Jason Pell and Jordan Fish
  • 31. Error correction on simulated E. coli data 1% error rate, 100x coverage. Jordan Fish and Jason Pell TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed)
  • 32. Single pass, reference free, tunable, streaming online variant calling. Streaming, online variant calling. See NIH BIG DATA grant, http://ged.msu.edu/.
  • 33. Novelty… to what power? • “Novelty” requirements for “high impact publishing”: o Must do novel algorithm development o …and apply to novel and interesting data sets. o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662) • We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
  • 34. Reproducibility Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • 35. Concluding thoughts • API is destiny – without online counting, diginorm & streaming approaches would not have been possible. • Tackle the hard problems – engineering optimization would not have gotten us very far. • Testing lets us scale development & process – which means when something works, we can run with it.
  • 36. Caveats • Expense and effort – you can spend an infinite amount of time on infrastructure & process! o Advice: choose techniques that address actual pain points. o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014) • Funders and reviewers just don’t care – adopt good software practices for yourself, not others. o Advice: briefly mention keywords in grants, papers. • Advisors just don’t care – see above. o These are 90% true statements :>
  • 37. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of it!) “It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?” - http://thescienceweb.wordpress.com/2014/02/21/bioinfor matics-software-companies-have-no-clue-why-no-one- buys-their-products/
  • 39. Prospective: sequencing tumor cells • Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations. • 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence. • Most of this data will be redundant and not useful. • Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • 40. Where are we taking this? • Streaming online algorithms only look at data ~once. • Diginorm is streaming, online… • Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.

Notas do Editor

  1. Slow, but powerful.
  2. Acceptance testing other people’s software
  3. Update from Jordan
  4. More generally….