This document provides an overview of next-generation sequencing technologies and their applications. It discusses genome enrichment techniques to isolate targeted regions for sequencing. It also describes template preparation methods like emulsion PCR and solid-phase amplification. Finally, it reviews various sequencing platforms like Illumina, SOLiD, 454 and details the sequencing and imaging processes. There are exercises proposed to work with sequencing data files in Galaxy.
16. What?
Only sequence relevant parts of the genome instead of whole genome, e.g.:
• specific Mb-scale regions known to be involved in particular disease (e.g.
based on GWAS)
• specific candidate genes belonging to disease pathway
• exome (= all exons)
=> how to isolate these from non-target sequence? “pulldown”
16
19. Performance metrics
• fold-enrichment: ratio of abundance of target sequences post-enrichment vs
pre-enrichment
• capture specificity: fraction of sequence reads that map to target
• uniformity: relative abundance of individual targets after enrichment
• completeness: fraction of target bases detectably captured
19
21. Problem: most imaging systems not designed to detect single fluorescent event
=> need amplified templates
Aim: to produce a representative, non-biased source of nucleic acid material
from the genome under investigation => population of identical templates
Steps:
1. shear DNA
2. amplify templates
Options: emulsion PCR (emPCR) or solid phase amplification
21
22. Amplification by emulsion PCR
emulsion = mixture of two or more immiscible (unblendable) liquids; e.g.
mayonnaise, vinaigrette
emPCR: thousands of microreactors/micro-eppendorfs
one bead + one DNA molecule per microreactor => PCR to 1000s of copies
22
27. Cyclic reversible termination
DNA synthesis is terminated after adding single nucleotide
start/stop/start/stop/start/stop/...
Illumina: 4-colour
sequencing result
sequencing steps
Metzker et al, 2010
27
28. Helicos: 1-colour
sequencing steps
sequencing result
Metzker et al, 2010
Metzker et al, 2010
28
33. Run time Gb/run
Roche 454 8.5 hr 45
Illumina 9 days 35
SOLiD 14 days 50
Helicos 8 days 37
PacBio ? ?
33
34. Accuracy - base calling error
• base quality drops along read
Sanger > SOLiD > Illumina > 454 > Helicos
(“dephasing” within clusters)
• base calling errors
34
35. Accuracy - homopolymer runs
Issue for Roche 454:
39% of errors are homopolymers
A5 motifs: 3.3% error rate
A8 motifs: 50% error rate
Reason: use signal intensity as a measure for homopolymer length
35
39. Is it 4? Is it 5? Is it 4?
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
39
40. Consensus accuracy
Increase accuracy for SNP calling by increasing coverage:
Illumina: 20X
SOLiD: 12X
454: 7.4X
Sanger: 3X
Factors: raw accuracy + read length
How deep do you have to sequence? => Poisson distribution: “If you sequence at
average of 10X, how much of the genome will be covered at least 5X”?
40
42. FASTQ file format
example fasta entries (n=2)
“@” + identifier example fastq entries (n=2)
sequence
“+” + identifier (optional)
phred-based quality scores
phred quality score encoding
Wikipedia
42
43. Sequence quality control
Is this good sequence? (essential!)
E.g.: using FastQC tool (Babraham Institute, UK; http://
www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)
43
51. Online genome analysis
http://galaxy.psu.edu/
“Galaxy allows you to do analyses you cannot do anywhere else without the
need to install or download anything. You can analyze multiple alignments,
compare genomic annotations, profile metagenomic samples and much much
more...”
51
59. Try to login to the server mentioned on Toledo with username and password
provided there.
There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt and
s_2_sequence.txt (= paired ends)
• How many sequences are in s_1_sequence.txt?
• What encoding was used for the quality score? Illumina? Sanger?
• What are the numerical quality scores for the first sequence in
s_1_sequence.txt (i.e. 7172283/1)?
59
60. • Create an account on the Galaxy server
• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload
them into Galaxy. These files are also available on the linux server
• Have a look at the contents of s_1_sequence.txt.
• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ
Groomer”)
• Draw the quality score boxplot for s_1_sequence.txt
• Draw the nucleotide distribution chart for s_1_sequence.txt
60
61. References
Bentley DR et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456: 53-59 (2008)
Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could
sequence everything? Nature Biotechnology 26: 1125-1133 (2008)
Korbel JO et al. Paired-end mapping reveals extensive structural variation in the
human genome. Science 318: 420-426 (2007)
Mardis ER. A decade’s perspective on DNA sequencing technology. Nature
470: 198-203 (2011)
Metzker ML. Sequencing technologies - the next generation. Nature Reviews
Genetics 11:31-46 (2010)
Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology
26:1135-1145 (2008)
Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics
and Human Genetics 10 (2009)
61