1. Next-Gen Sequencing:
4 years in the trenches
C. Titus Brown
Asst Prof, CSE and Microbiology;
BEACON NSF STC
Michigan State University
ctb@msu.edu
2. These slides are available online.
“titus brown slideshare”
You can also e-mail me: ctb@msu.edu
Also note that these are my opinions and observations, culled
from personal experience, online material, and reading. I’m
happy to cite/explain further upon request, but:
Your Mileage May Vary
3. Things I won’t talk about
Don’t work on/with/have anything useful to say about:
Exome sequencing
Ancient DNA
ChIP-seq (protein-DNA interactions)
Work on but you’re probably not interested in:
Metagenomics (sequencing uncultured microbial communities)
Bioinformatics data structures and algorithms
4. Overview
Shotgun sequencing basics
Things everyone wants to know: how much $$...
Various current problems & challenges
Technology, now and future
Some papers and projects worth looking at; & our own
experiences
5.
6.
7. Two specific concepts:
First, sequencing everything at random is very much easier
than sequencing a specific gene region. (For example, it will
soon be easier and cheaper to shotgun-sequence all of E. coli
then it is to get a single good plasmid sequence.)
Second, if you are sequencing on a 2-D substrate (wells, or
surfaces, or whatnot) then any increase in density (smaller
wells, or better imaging) leads to a squared increase in the
number of sequences.
These two concepts underlie the recent stunning increases
in sequencing capacity.
8.
9.
10.
11.
12. What are current costs for
Illumina?
Approximate costs from MSU sequencing center, a few
months ago, including labor:
RNAseq:
$200 prep / sample
Single-ended 1x50 -- $1100/lane – 100-150 mn reads
Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2)
Barcoding samples, etc, gets complicated.
Discuss biology, etc with a sequencing geek before going
forward!
13. What does this data really give
you??
With RNAseq, you can do de novo (genome- and gene-annotation-
independent) gene & isoform discovery and quantification; 50-
100m reads/sample is probably “enough”
(see: http://blog.fejes.ca/?p=607 for a good discussion)
With genome resequencing, you can do variant
analysis/discovery; I recommend 20x depth.
De novo assembly of complex vertebrate genomes is not casual:
Cheap short-read sequencing does not yet deliver good long-range
contiguity; repeats, heterozygosity get in the way.
Assembly & scaffolding process itself is still evolving.
14. Why so much data?
Why do we need 10-20x coverage (resequencing) or 50-
100m reads (mRNAseq) with Illumina?
Two (linked) reasons:
Shotgun sequencing is random
Counting/sampling variation
17. Coverage conclusions
More coverage rarely hurts (you can always discard data, but
it is harder/more $$ to get more data from an old sample)
Your desired coverage numbers should be driven by
sensitivity considerations.
18. Problems and challenges
Systematic bias in sequencing and software.
Genome assembly: scaffolding and sensitivity
Gene references
mRNAseq isoform construction
19. Resequencing: bias and error
Calling SNPs by mapping --
U. Colorado
http://genomics-course.jasondk.org/?p=395
20. Both sequencing and bioinformatics
yield many low-frequency artifacts!
“Obvious” things like misalignments to paralogous/repeat
sequences.
Indels are handled badly by current tools (up to 60% false
positive rate?!)
Oxidation of DNA during library prep step (acoustic
shearing) generated 8-oxoguanine “lesions” responsible for
artifacts involving C>A/G>T triplets.
=> With any data set, especially big ones, there will both
random and systematic error and bias.
http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-
truth-you-cant-handle-the-truth/
22. Genome assembly: scaffolding &
sensitivity
Everyone wants two things from a genome assembly --
Long/correct scaffolds
See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-
for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon
Complete genome content
23. Sequence data
Reads
original DNA
fragments
original DNA
fragments
Sequenced ends
http://www.cbcb.umd.edu/research/assembly_primer.shtml
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
25. Scaffolds
Ordered, oriented contigs
mate pairs
contigs
gap size estimate
Scaffold
contig
gap
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
http://dx.doi.org/10.6084/m9.figshare.100940
26. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Longer reads!
Repeat copy 1 Repeat copy 2
Long reads can span repeats and heterozygous regions
Polymorphic contig 22
Polymorphic contig
Contig 1 Contig 4
Polymorphic contig 33
Polymorphic contig
27. Cod: PacBio results
Mapping to the published genome
11.4 kbp subread
10.6 kbp subread
10.9 kbp subread
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
28. Sensitivity – does your genome
include everything?
Generally not!
For example, the chick genome is missing a substantial
number of genes from microchromosomes:
723 genes from HSA19q missing from chicken galGal4.
ESTs and RNAseq transcripts for many or most.
29. Approach - Digital normalization
(a computational version of library normalization)
Digital normalization “smooths
out” coverage from different
loci, and can “recover” low
coverage regions for assembly.
30. Applying diginorm to increase
sensitivity
Reassembled chick genome from 70x Illumina ->
normalized reads in ~24 hours.
Contig assembly contained partial or complete matches to
70% of previously unmappable transcripts assembled from
chick mRNAseq
Together with Wes Warren (WUSTL), Hans Cheng (USDA
ADOL), Jerry Dodgson (MSU) proposing to apply PacBio
and normalization to improve chick genome; should be
generalizable approach.
34. Gene Modeler Pipeline (“gimme”?)
Merge transcripts together based on transcript mapping to
genome; can include existing gene predictions, iterate.
Construct gene models
Remove redundant sequences
Predict strands and ORFs
Likit Preeyanon
35. Some thoughts on bioinfo
Software is evolving very fast. Don’t worry about using the
latest, but keep an eye on possible artifacts/problems with
what you do use.
In NGS, online information (seqanswers, biostar, Twitter) is
generally far less behind than publications.
36. Technology – where next?
Most slides taken from Lex Nederbragt:
http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-
technologies-at-the-norwegian-sequencing-centre-and-beyond
37. High-throughput sequencing
Phase 1: more is better
2005 GS20 200 000 reads 100 bp
0.02 Gb/run
2011 GS FLX+ 1.2 million reads 750 bp
0.7 Gb/run
2006 GA 28 million reads 25 bp
0.7 Gb/run
2011 HiSeq 2000 3 billion reads 2x100 bp
600 Gb/run
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
38. High-throughput sequencing
Phase 2: smaller is better
GS Junior from Roche/454
0.04 GB/run
400 bp reads
0.7 GB/run
700 bp reads
MiSeq from Illumina
4.5 GB/run
2x150 bp reads
600 GB/run
2x100 bp reads
PGM from Ion Torrent/
Life Technologies
0.01, 0.1 or 1 GB/run
100 or 200 bp reads
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
39. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencing
Why benchtop sequencing instruments?
Diagnostics
Affordable price
per instrument Small projects
Fast turn around time
http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/
http://www.vetlearn.com/ http://vanillajava.blogspot.com
40. Which instrument to choose?
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
41. High-throughput sequencing
Phase 3: single-molecule
C2 (current) chemistry:
Average read length 2500 bp
36 000 reads
90 MB per ‘run’
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
42. S
High-throughput sequencing
Real-time sequencing Technology
Phospholinked hexaphosphate nucleotides
G A T C
b
Lim of detection zone
it
Fluorescence pulse
Intensity
e detection Time
slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics
; Lex Nederbragt
Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m
e ethod is shown.
43. Need to combine Illumina + PacBio still.
P_errorCorrection pipeline from
93% of reads recovered
2.7x
Alignments of at least 1kb to cod published assembly
+
Error-corrected reads
23x
s
+ w
rea
d
Ra
24 cpus
4.5 days
100 Gb RAM
slides from http://slideshare.net/flxlex/ ; Lex
44. My perspective on tech:
Illumina HiSeq + benchtop sequencers (MiSeq) currently
most reliable for data generation: data in hand, decent
quality.
PacBio data is an excellent add-on for situations where long
reads are needed (to bridge repeats or het regions).
45. Two final pieces of advice
Should you work with genome centers? Maybe.
Genome centers are good at large, well funded projects.
Their default pipelines are reliable but not always cutting edge.
“Weird” problems (high heterozygosity, or complex repeats)
may require more attention than they can give.
They also have their own schedules and incentives.
Where should you go for contract sequencing?
I get asked this a lot!
My best recommendation is UC Davis.
“Cheaper” is not always “better”; data quality can vary
immensely.
46. Advertisement: next-gen sequence
course http://bioinformatics.msu.edu/ngs-summer-course-2013
June 10-June 20, Kellogg Biological Station; < $500
Hands on exposure to data, analysis tools.
47. Acknowledgements
I showed work from Likit Preeyanon and Alexis Black
Pyrkosz, in my lab
Hans Cheng is primary collaborator on chick work
USDA funded our technology development.
Lex Nederbragt for his slides :)