4. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
5. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
6. De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
7. The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
11. This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
12. Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
13. Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
14. Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
15. This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
16. Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
17. How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
18. Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development
19. A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
20. Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
21. Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…
23. khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
24. Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
25. Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
31. Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
32. Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
33. Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)
34. Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
35. Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.
36. Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>
37. Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/
39. Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.
40. Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.