High Throughput Sequencing Technologies: On the path to the $0* genome

High Throughput Sequencing
Technologies:
On the path to the $0* Genome
Brian Krueger, PhD
Duke University
Center for Human Genome Variation

Chromatin Basics
1) 1400nm - Metaphase Chromosome
2) 700nm - Condensed Chromosome
3) 300nm - Extended Condensed Chromosome
4) 30nm – Packed nucleosomes
5) 11nm – Nucleosome string
6) 2nm – DNA double Helix
6
12
3
4
5
Image credit: Nature Education
• Chromatin is the DNA packing
material
• Two forms
– Euchromatin
• Open and actively
transcribed
– Heterochromatin
• Packed and not producing
RNA

DNA Basics
Credit: Wikimedia Commons
• DNA is made of sugar phosphate
bases
– Purines
• Adenine
• Guanine
– Pyrimidines
• Cytosine
• Thymine
• Sequence of bases determines
when and what proteins are made

Gene Expression – Enhancers/Promoters
• DNA is converted into useable
information in a process called
transcription
– Enhancers
• Serve as accessory beacons that
bind proteins involved in
regulating gene expression
• Help the polymerase “find”
where a gene is located in the
chromatin
– Promoter
• Located just upstream of the
transcription start site
• Staging site for the polymerase
transcription factors that create
mRNA – RNA polymerase II
– Transcription start site
• First transcribed base of mRNA
sequence

Gene Expression – Transcription/Translation
• DNA is composed of Exons and Introns
– Exons are protein coding regions of DNA
– Introns are noncoding regions of DNA that
must be removed during transcription to
produce mature mRNA
• Introns removed during transcription by
the RNA spliceosome
– Sequence dependent process
• Mature mRNA is capped (methylated) and
a poly-adenine tail is added for stability
• Sequence exported to the cytoplasm for
translation and protein production
• Mutations to the DNA can negatively
affect every step of this process!

Chromosome
Common DNA MutationsSequence
variants
Structural
variants
Single nucleotide variant
Small insertion
Small deletion
Deletion
Translocation
Reference A B C D
ATCGGGTCATGTCA
ATCGGGTCATATCA
A B C D
ATCGGGTCATGACGTCA
A B C D
ATCGGGTCAT
A B C D
A C D
A B GE
Duplication A B C DC
Inversion A B
D C
F
Credit: Elizabeth Ruzzo, PhD, CHGV

Common DNA Mutations
• Effects
– No effect
– Too much protein
– Too little protein
– No protein
– Not the right protein
Image Credit: Cooper et al. Nat Rev Genet
• Site of Mutation Matters
– Exons
– RNA splice sites
– Enhancers
– Promoters
– 5’ and 3’ UTR regulatory regions
Splice variant
We’re all mutants! Your genome has 4 million single nucleotide variants and 700,000 insertions/deletions!
Luckily, the genome is 3 billion base pairs and only 2% of those bases code for protein

• Mutations/Variations can be detected using DNA
sequencing
– First invented in the mid 1970s
– Two very similar methods developed
– Maxam-Gilbert Sequencing
• Chemical modification and cleavage paired with gel
electrophoresis
• DNA is 5’ labeled with radioactivity
• Exposed to chemical agents that cause specific DNA
breaks
• Run on a gel and the pattern reveals which base is at
each site
– Sanger Sequencing
• Dideoxy DNA sequencing paired with gel
electrophoresis
• DNA is 5’ labeled with radioactivity
• Small amount of Dideoxy base added to 4 separate
primer extension reactions
• Run on a gel to determine bases at each position by
size
DNA Sequencing
Maxam-Gilbert
Sanger
X
No 3’-OH,
No Extension!

• Sanger sequencing
– Beat Maxam-Gilbert Sequencing as the method
of choice
– Became fully automated
• Dideoxy bases replaced with fluorescently labeled
dideoxy bases (1 reaction now instead of 4)
• Liquid chromatography replaces gel
electrophoresis
• Lasers and computers replace graduate students
and postdocs
• By far the dominant sequencing method up
until 2007 – 30 years!
– Still considered the gold standard for validating
sequencing data
• Huge limitations for genome wide sequencing
because Sanger can only be used to sequence
one fragment per Sanger reaction
First Generation Sequencing Technology

• Done using Sanger sequencing…
• Took 10 years to complete
• Cost $3 billion dollars
• Used a technique called hierarchical whole genome shotgun
sequencing
– Shotgun Sequencing also invented by Frederick Sanger
– Genome fragmented into 200-400kb fragments
– Genome fragments cloned into over 30,000 bacmid libraries
– Libraries were then fragmented
– Sanger sequencing performed
– Genome assembled using computers to line up over lapping
sequences
• Most human genome sequencing today is done using
whole genome shotgun sequencing!
Human Genome Sequencing
Hierarchical Shotgun Sequencing

• Developed to increase throughput of Sanger sequencing
• Can sequence many molecules in parallel
– Does not require homogenous input
– DNA sequenced as clusters or in nanowells
– Single machine can sequence 3-10 Billion independent DNA
fragments AT THE SAME TIME!
– Single Sanger Sequencer maxes out at 1152 reactions per
machine
• Time from DNA to genome reduced from 10 years to 1 day!
Second Generation Sequencing
Illumina HiSeq (3-9 billion clusters – 600GB-1.8TB)
Ion Torrent Proton
(100 - 300 million nanowells -
20 - 60GB)

2nd Gen: Sequencing by Synthesis Overview
Align reads to a
reference genome
Fragmented DNA
Ligate Adaptors
Add Bases
ImageCleave
Wash Wash
Bind Library and create clusters
Sequencing Cycle
Repeat Hundreds of
times on billions of
clusters
(1:20)
Genomic
DNA

Mutation Calling/Filtering
Variant
calling
Visual
Inspection
Cross-checking
public databases
Sanger sequencing
confirmation
Exome Variant Server 6500 exome
sequenced individuals

Detecting Copy Number Variants
heterozygous
deletion
homozygous
deletion
duplication
Windows
ERDS (Estimation by Read Depth with SNvs)
Average read depth (RD) of every 2-kb window were calculated, followed
by GC corrections. A paired Hidden Markov model was applied to infer
copy numbers of every window by utilizing both RD information and
heterozygosity information.

Flavors of Sequencing
• Whole Genome Sequencing
– Obtain whole blood or tissue sample
– Create sequencing libraries of all DNA
fragments
• Whole Exome Sequencing
– Utilizes a selection protocol
– Attach complimentary RNA or DNA strands to
beads
– Fish out ONLY coding DNA sequences
– Create sequencing libraries from enriched DNA
– Reduces cost and analysis time
• Custom Capture
– Same protocol as Exome sequencing
– Only target desired DNA sequences
• Amplicon Sequencing
– Use PCR to amplify target DNA
– Sequence amplified DNA (Amplicon)

Disadvantages of 2nd Generation Tech
• Rely on amplification to create libraries and clusters
– All polymerases have an inherent error rate (10-6-10-7)
– Errors introduced every 10 million to 100 million bases
– Secondary validation of variants is key
• Short reads cannot be used for De novo genome
assembly
– 2nd Generation sequencers have a maximum read
length of 400bp
– This is too short to span long repeat regions
– Not good for detecting trinucleotide repeat
expansions ex: fragile X, Huntington’s, spinocerebellar
ataxias
• Short reads can miss large structural variations
– Genome Translocations and inversions likely will be
missed
– Require significant read depth at break points for
these variations to be detected
• Trouble detecting small insertions and deletions
– Short reads computationally hard to align and call
• Very high quality single molecule long reads
would fix many of these problems!
A
CD
GE FA
A B C DB B
A B C DB B BB B
A B C DBB B
X
X

• Defined as single molecule sequencing
• Less complex sample prep and much longer read length
(1-100kb) compared to 200-400bp for 2nd Gen
• Two categories
– Sequencing by synthesis
• Pioneered by Pacific Biosciences
• Sequencer uses super microscopes and polymerase bound
nanowells to WATCH DNA as it is sequenced in real time
• Nanowells filled with DNA bases
• Fluorescence of base only detected at the polymerase
– Direct sequencing by passing DNA through a nanopore
• Bases fed through a membrane bound nanopore
• Ionic difference between both sides of the membrane
• Detect how ion flow changes at the pore as each base passes
through
• Bleeding edge technology
– Many technical hurdles with very high error rates (10-25%)
– Very expensive technology
• Costs 3-10x as much as Illumina to do whole genome
sequencing
– Short/Long read hybrid proposed to leverage the base
accuracy of 2nd gen sequencing and the length of 3rd gen
• Use long reads as a scaffold and correct the errors with short reDS
The Future: Third Generation Sequencing
PacBio
Oxford Nanopore

Costs Associated with Clinical Sequencing
Whole Genome Exome Custom Capture Amplicon
Size (GB) 100 12.5 0.13-1 0.03-0.13
Preparation $400 $200 $80 $40
Sequencing $4,300 $400 $12-100 $1-12
Data Processing/Storage $350 $200 $50 $25
Clincal Review $5,000-10,000 $2,000-6000 $700-2000 $400-900
Total $10,000-15,000 $2,800-6,800 $1,000-2,000 $500-1,000
DNA sequencing costs are falling, but analysis and clinical review cost will
likely remain stable for the foreseeable future
New sequencing technology announced this year should reduce the cost
of preparing and sequencing a whole genome to $1000 starting in mid
2014 (Does not include Analysis and Review)
How will we ever get to the $0* genome?!?!

Sequencing Costs in the Genome Era
Image credit: NIH
HG Draft
HG Final

Image credit: NIH
Sanger Sanger – HGP High
HG Draft
HG Final

Image credit: NIH
Sanger
Roche/454
Illumina
ABI Solid
Helicos
Sanger – HGP High
HG Draft
HG Final

Sequencing in the Genome Era: 2008-2010
• The Dawn of the Second Generation Sequencers
– Roche 454 - 2007
• Imaging based pyrosequencing
• Camera detects pyrophosphate release after each base is
added to nanowells – Bright dot = Base present
– ABI Solid - 2007
• Dye tagged fragment ligation
• Imaging based
• Complicated detection scheme using “color space”
– Illumina - 2008
• Imaging based reversible dye termination sequencing
• Camera detects fluorescently labeled bases in each cluster –
Color determines base
– Helicos (3rd Gen) - 2009
• First “single molecule” sequencer – Third generation
sequencing
• Plagued with problems
• BUT the fear that it might work helped drive down costs
454
Illumina
GAIIx
ABI
Solid 3
Helicos

Image credit: NIH
Illumina
Sanger
Roche/454
Illumina
ABI Solid
Helicos
Sanger – HGP High
HG Draft
HG Final

• The death of the competition
– Illumina
• Release of the HiSeq
• Drastically increases output 10x over the GAIIx
– Roche 454
• Release 454 titanium and 454 Junior
• Used primarily for microbes because it can sequence 400bp and
do de novo assembly of these small organisms
• Expensive and error prone
• Roche will phase out the 454 family in 2014
– ABI Solid
• Never caught on
• Expensive, error prone, complicated sample prep
– Helicos
• Filed for bankruptcy 2011
• Costs remain level because Illumina has no competition
Illumina HiSeq 2000

Image credit: NIH
Illumina
Illumina
Complete Genomics
Ion Torrent
PacBio
Nanopore
Sanger
Roche/454
Illumina
ABI Solid
Helicos
Sanger – HGP High
HG Draft
HG Final

• New Contenders
– Complete Genomics
• Proprietary tech and generate data in-house
• Competitive pricing with Illumina sequencing
– Pacific Biosciences (3rd Gen)
• Announce the PacBio RS
• Promise high base accuracy, single molecule sequencing with reads
reaching up to 20kb
– Ion Torrent
• Same sequencing methodology as the Roche 454 system
• Difference is that it detects the release of H+ after bases are added
• Removes need for time consuming imaging steps
• Promise a $1000 genome
– Oxford Nanopore (3rd Gen)
• Announce MinIon and GridIon
• Promise very cheap single molecule sequencing that can be done
on a thumb drive
• Promising competition forces price reductions
PacBio RS
Ion Torrent Proton
Nanopore MinIon

Illumina
Complete Genomics
Ion Torrent
PacBio
Nanopore
Image credit: NIH
Sanger
Roche/454
Illumina
ABI Solid
Helicos
Sanger – HGP High
Illumina
Illumina
HG Draft
HG Final

Sequencing in the Genome Era: 2012-Present
• New Contenders Fail - Mostly
– Complete Genomics
• Not embraced by the research community and serves the diagnostic niche
– Pacific Biosciences
• Didn’t deliver on promises – 15% error rate, shorter reads (1-10kb)
• Slowly improving – reduced error rate to 5-10%, reads reaching 20-50kb
– Ion Torrent
• Didn’t deliver on promises - Low data output, expensive
• Serves niche diagnostic market where speed is more valuable than cost or
amount of data output
• 60GB PII chip has been “coming” since 2012 – Slated for late 2014 release
– Oxford Nanopore
• Finally released first data in 2014
• Full of errors and looks like proof of concept tech
– Illumina
• Release NextSeq500 for the diagnostic market to kill Ion Torrent
• Release the HiSeqX which can sequence a human genome for $1000 to kill
Complete Genomics (1.8TB of output in 3 days! – 16 genomes)
– HiSeqX MUST be purchased as a 10 pack ($10 million)
– Contractually forced to ONLY use the HiSeqX for genomes
• Prices remain steady 2012-14 because the competition can’t deliver
Releases $1000 genome sequencer,
Only lets rich people use it.
Hat image: chasesocal, Deviant Art

The Promise of the $0* Genome
• HiSeqX brings clinical genome cost down to $6-
10K (mid 2014)
• Hurdles for the $0* Genome
– *Relies on health insurance companies or
governments paying most of the bill
– Clinician Education
• Many clinicians do not understand genetic data or
how to use it to affect patient care
– Proof of widely applicable value
• Genome sequences for MOST people not very
informative
• Need more population wide data to accurately
predict how variants outside of coding regions
contribute to disease
• Currently used in cancer, neonatal, fertility and
undiagnosed disease diagnostics
– Cost reduction
• Cost of the all-in test needs to be <$5000
• Similar to other high diagnostic value, high tech
tests such as PET, CT, and MRI scans
• Likely to happen with streamlined analysis pipelines
Improvements over the next few years
will cause more insurance companies
to approve payment on whole genome
diagnostics

High Throughput Sequencing Technologies: On the path to the $0* genome

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a High Throughput Sequencing Technologies: On the path to the $0* genome

Semelhante a High Throughput Sequencing Technologies: On the path to the $0* genome (20)

Último

Último (20)

High Throughput Sequencing Technologies: On the path to the $0* genome

Notas do Editor