2013 pag-equine-workshop

Next-Gen Sequencing:
4 years in the trenches

C. Titus Brown
Asst Prof, CSE and Microbiology;
BEACON NSF STC
Michigan State University
ctb@msu.edu

These slides are available online.

“titus brown slideshare”

You can also e-mail me: ctb@msu.edu

Also note that these are my opinions and observations, culled
from personal experience, online material, and reading. I’m
happy to cite/explain further upon request, but:
Your Mileage May Vary

Things I won’t talk about
Don’t work on/with/have anything useful to say about:
Exome sequencing
Ancient DNA
ChIP-seq (protein-DNA interactions)

Work on but you’re probably not interested in:
Metagenomics (sequencing uncultured microbial communities)
Bioinformatics data structures and algorithms

Overview
 Shotgun sequencing basics

 Things everyone wants to know: how much $$...

 Various current problems & challenges

 Technology, now and future

 Some papers and projects worth looking at; & our own
experiences

Two specific concepts:
First, sequencing everything at random is very much easier
than sequencing a specific gene region. (For example, it will
soon be easier and cheaper to shotgun-sequence all of E. coli
then it is to get a single good plasmid sequence.)
Second, if you are sequencing on a 2-D substrate (wells, or
surfaces, or whatnot) then any increase in density (smaller
wells, or better imaging) leads to a squared increase in the
number of sequences.

These two concepts underlie the recent stunning increases
in sequencing capacity.

What are current costs for
Illumina?
Approximate costs from MSU sequencing center, a few
months ago, including labor:

RNAseq:
$200 prep / sample
Single-ended 1x50 -- $1100/lane – 100-150 mn reads
Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2)

Barcoding samples, etc, gets complicated.
Discuss biology, etc with a sequencing geek before going
forward!

What does this data really give
you??
 With RNAseq, you can do de novo (genome- and gene-annotation-
independent) gene & isoform discovery and quantification; 50-
100m reads/sample is probably “enough”
(see: http://blog.fejes.ca/?p=607 for a good discussion)

 With genome resequencing, you can do variant
analysis/discovery; I recommend 20x depth.

 De novo assembly of complex vertebrate genomes is not casual:
Cheap short-read sequencing does not yet deliver good long-range
contiguity; repeats, heterozygosity get in the way.
Assembly & scaffolding process itself is still evolving.

Why so much data?
Why do we need 10-20x coverage (resequencing) or 50-
100m reads (mRNAseq) with Illumina?

Two (linked) reasons:
Shotgun sequencing is random
Counting/sampling variation

1. Useful minimum coverage
depends on high average coverage

2. mRNAseq quantitation – must
overcome sampling variation

Coverage conclusions
More coverage rarely hurts (you can always discard data, but
it is harder/more $$ to get more data from an old sample)

Your desired coverage numbers should be driven by
sensitivity considerations.

Problems and challenges
Systematic bias in sequencing and software.

Genome assembly: scaffolding and sensitivity

Gene references

mRNAseq isoform construction

Resequencing: bias and error
Calling SNPs by mapping --

U. Colorado
http://genomics-course.jasondk.org/?p=395

Both sequencing and bioinformatics
yield many low-frequency artifacts!
“Obvious” things like misalignments to paralogous/repeat
sequences.
Indels are handled badly by current tools (up to 60% false
positive rate?!)
Oxidation of DNA during library prep step (acoustic
shearing) generated 8-oxoguanine “lesions” responsible for
artifacts involving C>A/G>T triplets.

=> With any data set, especially big ones, there will both
random and systematic error and bias.
http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-
truth-you-cant-handle-the-truth/

Suggestion: Cortex variant caller

Iqbal et al., Nat Genet. 2012, pmid 22231483

Genome assembly: scaffolding &
sensitivity
Everyone wants two things from a genome assembly --

Long/correct scaffolds

See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-
for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon

Complete genome content

Sequence data
Reads

original DNA

fragments

original DNA

fragments

Sequenced ends

http://www.cbcb.umd.edu/research/assembly_primer.shtml
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Contigs
Building contigs

ACGCGATTCAGGTTACCACG
GCGATTCAGGTTACCACGCG
GATTCAGGTTACCACGCGTA
TTCAGGTTACCACGCGTAGC
CAGGTTACCACGCGTAGCGC
Aligned reads GGTTACCACGCGTAGCGCAT
TTACCACGCGTAGCGCATTA
ACCACGCGTAGCGCATTACA
CACGCGTAGCGCATTACACA
CGCGTAGCGCATTACACAGA
CGTAGCGCATTACACAGATT
TAGCGCATTACACAGATTAG
Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG


Scaffolds
Ordered, oriented contigs

mate pairs
contigs

gap size estimate

Scaffold
contig
gap

http://dx.doi.org/10.6084/m9.figshare.100940


Longer reads!
Repeat copy 1 Repeat copy 2

Long reads can span repeats and heterozygous regions

Polymorphic contig 22
Polymorphic contig

Contig 1 Contig 4
Polymorphic contig 33
Polymorphic contig

Cod: PacBio results
Mapping to the published genome
11.4 kbp subread

10.6 kbp subread

10.9 kbp subread


Sensitivity – does your genome
include everything?
Generally not!

For example, the chick genome is missing a substantial
number of genes from microchromosomes:
723 genes from HSA19q missing from chicken galGal4.
ESTs and RNAseq transcripts for many or most.

Approach - Digital normalization
(a computational version of library normalization)

Digital normalization “smooths
out” coverage from different
loci, and can “recover” low
coverage regions for assembly.

Applying diginorm to increase
sensitivity
Reassembled chick genome from 70x Illumina ->
normalized reads in ~24 hours.
Contig assembly contained partial or complete matches to
70% of previously unmappable transcripts assembled from
chick mRNAseq

Together with Wes Warren (WUSTL), Hans Cheng (USDA
ADOL), Jerry Dodgson (MSU) proposing to apply PacBio
and normalization to improve chick genome; should be
generalizable approach.

Mapping => mRNAseq quantitation

Reference transcriptome required.

Existing chick gene models lack exons,
isoforms

Our data

Models

*This gene contains at least 4 isoforms.
Likit Preeyanon

(Exon detection is pretty good.)

Likit Preeyanon

Gene Modeler Pipeline (“gimme”?)
Merge transcripts together based on transcript mapping to
genome; can include existing gene predictions, iterate.
Construct gene models
Remove redundant sequences
Predict strands and ORFs

Likit Preeyanon

Some thoughts on bioinfo
Software is evolving very fast. Don’t worry about using the
latest, but keep an eye on possible artifacts/problems with
what you do use.

In NGS, online information (seqanswers, biostar, Twitter) is
generally far less behind than publications.

Technology – where next?
Most slides taken from Lex Nederbragt:

http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-
technologies-at-the-norwegian-sequencing-centre-and-beyond

High-throughput sequencing
Phase 1: more is better
2005 GS20 200 000 reads 100 bp
0.02 Gb/run

2011 GS FLX+ 1.2 million reads 750 bp
0.7 Gb/run

2006 GA 28 million reads 25 bp
0.7 Gb/run

2011 HiSeq 2000 3 billion reads 2x100 bp
600 Gb/run

Phase 2: smaller is better
GS Junior from Roche/454
0.04 GB/run
400 bp reads
0.7 GB/run
700 bp reads

MiSeq from Illumina
4.5 GB/run
2x150 bp reads
600 GB/run
2x100 bp reads

PGM from Ion Torrent/
Life Technologies
0.01, 0.1 or 1 GB/run
100 or 200 bp reads


Why benchtop sequencing instruments?

Diagnostics
Affordable price
per instrument Small projects

Fast turn around time

http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/
http://www.vetlearn.com/ http://vanillajava.blogspot.com

Which instrument to choose?


Phase 3: single-molecule

C2 (current) chemistry:
Average read length 2500 bp
36 000 reads
90 MB per ‘run’


S
Real-time sequencing Technology
Phospholinked hexaphosphate nucleotides

G A T C

b

Lim of detection zone
it

Fluorescence pulse
Intensity

e detection Time
slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics
; Lex Nederbragt
Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m
e ethod is shown.

Need to combine Illumina + PacBio still.
P_errorCorrection pipeline from

 93% of reads recovered
2.7x
Alignments of at least 1kb to cod published assembly

+

Error-corrected reads
23x

s
+ w
rea
d
Ra
24 cpus
4.5 days
100 Gb RAM

slides from http://slideshare.net/flxlex/ ; Lex

My perspective on tech:
Illumina HiSeq + benchtop sequencers (MiSeq) currently
most reliable for data generation: data in hand, decent
quality.

PacBio data is an excellent add-on for situations where long
reads are needed (to bridge repeats or het regions).

Two final pieces of advice
Should you work with genome centers? Maybe.
Genome centers are good at large, well funded projects.
Their default pipelines are reliable but not always cutting edge.
“Weird” problems (high heterozygosity, or complex repeats)
may require more attention than they can give.
They also have their own schedules and incentives.

Where should you go for contract sequencing?
I get asked this a lot!
My best recommendation is UC Davis.
“Cheaper” is not always “better”; data quality can vary
immensely.

Advertisement: next-gen sequence
course http://bioinformatics.msu.edu/ngs-summer-course-2013

June 10-June 20, Kellogg Biological Station; < $500
Hands on exposure to data, analysis tools.

Acknowledgements
I showed work from Likit Preeyanon and Alexis Black
Pyrkosz, in my lab
Hans Cheng is primary collaborator on chick work

USDA funded our technology development.

Lex Nederbragt for his slides :)

2013 pag-equine-workshop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a 2013 pag-equine-workshop

Semelhante a 2013 pag-equine-workshop (20)

Mais de c.titus.brown

Mais de c.titus.brown (20)

2013 pag-equine-workshop