1. C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
May 1, 2013
ctb@msu.edu
Streaming approaches to reference-free variant
calling
2. Open, online science
Much of the software and approaches I’m talking
about today are available:
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
3. Outline & Overview
Motivation: lots of data; analyzed with “offline”
approaches.
Reference-based vs reference-free approaches.
Single-pass algorithms for lossy compression;
application to resequencing data.
4. Shotgun sequencing
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
5. Sequencers produce errors
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
9. Counting
We have a reference genome (or gene set) and
want to know how much we have. Think gene
expression/microarrays, copy number variation..
10. Noisy observations <->
information
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
11. “Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than Moore’s
Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.
13. “Three types of data scientists.”
1. Your data gathering rate is slower than Moore’s
Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.
14. Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
15. Applications in cancer genomics
Single-cell cancer genomics will advance:
e.g. ~60-300 Gbp data for each of ~1000 tumor
cells.
Infer phylogeny of tumor => mechanistic insight.
Current approaches are computationally intensive
and data-heavy.
16. Current variant calling approach.
Map reads to
reference
"Pileup" and do variant
calling
Downstream
diagnostics
17. Drawbacks of reference-based
approaches
Fairly narrowly defined heuristics.
Allelic mapping bias: mapping biased towards
reference allele.
Ignorant of “unexpected” novelty
Indels, especially large indels, are often ignored.
Structural variation is not easily retained or
recovered.
True novelty discarded.
Most implementations are multipass on big data.
18. Challenges
Considerable amounts of noise in data (0.1-1%
error)
Reference-based approaches have several
drawbacks.
Dependent on quality/applicability of reference.
Detection of true novelty (SNP vs indels; SVs)
problematic.
=> The first major data reduction step (variant
calling) is extremely lossy in terms of potential
information.
19. Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
A software & algorithms approach: can we develop
lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed data for
later reanalysis.
Short answer is: yes, we can.
20. Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Save in cold storage
Save for reanalysis,
investigation.
21. My lab at MSU:
Theoretical => applied solutions.
Theoretical advances
in data structures and
algorithms
Practically useful & usable
implementations, at scale.
Demonstrated
effectiveness on real data.
22. 1. Time- and space-efficient k-mer
counting
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales
http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/
24. Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
Core algorithm is single pass, “low” memory.
3. Online, streaming, lossy
compression.
(NOVEL)
Brown et al., arXiv, 2012
31. Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads & retains all
information.
Smooths out coverage of regions.
32. Can we apply this algorithmically efficient
technique to variants? Yes.
Single pass, reference free, tunable, streaming online varian
36. Reference-free variant calling
Streaming & online algorithm; single pass.
For real-time diagnostics, can be applied as bases are
emitted from sequencer.
Reference free: independent of reference bias.
Coverage of variants is adaptively adjusted to retain
all signal.
Parameters are easily tuned, although theory needs
to be developed.
High sensitivity (e.g. C=50 in 100x coverage) => poor
compression
Low sensitivity (C=20) => good compression.
Can “subtract” reference => novel structural variants.
(See: Cortex, Zam Iqbal.)
37. Concluding thoughts
This approach could provide significant and
substantial practical and theoretical leverage to
challenging problem.
They provide a path to the future:
Many-core implementation; distributable?
Decreased memory footprint => cloud/rental computing
can be used for many analyses.
Still early days, but funded…
Our other techniques are in use, ~dozens of labs
using digital normalization.
38. References & reading list
Iqbal et al., De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nat. Gen
2012.
(PubMed 22231483)
Nordstrom et al., Mutation identification by direct
comparison of whole-genome sequencing data
from mutant and wild-type individuals using k-
mers. Nat. Biotech 2013.
(PubMed 23475072)
Brown et al., Reference-Free Algorithm for
Computational Normalization of Shotgun
Sequencing Data. arXiv 1203.4802
Note: this talk is online at slideshare.net, c.titus.brown.
39. Acknowledgements
Lab members involved Collaborators
Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Rosangela Canino-Koning
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Chris Welcher
Jim Tiedje, MSU
Billie Swalla, UW
Janet Jansson, LBNL
Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
Thank you for the invitation!
Editor's Notes
Bad habit…
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.