1. Sample & Assay Technologies
Next Generation Sequencing:
Data analysis for genetic profiling
Ravi Vijaya Satya, Ph.D.
Senior Scientist, R&D
Ravi.VijayaSatya@QIAGEN.com
2. Sample & Assay Technologies
Welcome to the three-part webinar series
Next Generation Sequencing and its role in cancer biology
Webinar 1: Next-generation sequencing, an introduction to technology and
applications
Date:
April 4, 2013
Speaker:
Quan Peng, Ph.D.
Webinar 2:
Date:
Speaker:
Next-generation sequencing for cancer research
April 11, 2013
Vikram Devgan, Ph.D., MBA
Webinar 3:
Date:
Speaker:
Next-generation sequencing data analysis for genetic profiling
April 18, 2013
Ravi Vijaya Satya, Ph.D.
Title, Location, Date
2
3. Sample & Assay Technologies
Agenda
NGS Data Analysis
Read Mapping
Variant Calling
Variant Annotation
Targeted Enrichment
GeneRead Gene Panels
GeneRead Data Analysis Portal
Background
Workflow
Data Interpretation
3
4. Sample & Assay Technologies
Read Mapping
Reads mapped to a reference genome
Millions of reads
from a single run
Alignment
Mapping Quality
Programs for read-mapping
Hash-based
MAQ, ELAND, SOAP, Novoalign
Suffix array/Burrows Wheeler Transform based
BWA, BowTie, BowTie2, SOAP2
Title, Location, Date
4
5. Sample & Assay Technologies
Variant Calling
Determine if there is enough statistical support to call a variant
Reference sequence
ACAGTTAAGCCTGAACTAGACTAGGATCGTCCTAGATAGTCTCGATAGCTCGATATC
Aligned reads
AACTAGACTAGGATCGTCCTAGATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
GATCGTCCTAGATAGTCTCGATAGCTCGAT
GATCGTCCTAGATAGTCTCGATAGCTCGAT
GATCGTCCTAGATAGTCTCGATAGCTCGAT
Multiple factors are considered in calling variants
No. of reads with the variant
Mapping qualities of the reads
Base qualities at the variant position
Strand bias (variant is seen in only one of the strands)
Variant Calling Software
GATK Unified Genotyper, Torrent Variant Caller, SamTools, Mutect, …
Title, Location, Date
5
6. Sample & Assay Technologies
Variant Representation
VCF – Variant Call Format
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Header lines
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been
filtered">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to
detect strand bias">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=OND,Number=1,Type=Float,Description="Overall non-diploid ratio (alleles/(alleles+nonalleles))">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##contig=<ID=chrM,length=16571,assembly=hg19>
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr2,length=243199373,assembly=hg19>
Column labels
##contig=<ID=chr3,length=198022430,assembly=hg19>
##contig=<ID=chr4,length=191154276,assembly=hg19>
#CHROM POS
ID
REF ALT
QUAL FILTER
INFO
FORMAT
Sample
chr1
11181327 rs11121691 C
T
100.0 PASS
DP=1000;MQ=87.67 GT:AD:DP 0/1:146,45:191
chr1
11190646 rs2275527
G
A
100.0 PASS
DP=1000;MQ=67.38 GT:AD:DP 0/1:462,121:583
chr1
11205058 rs1057079
C
T
100.0 PASS
DP=1000;MQ=79.57 GT:AD:DP 0/1:49,143:192
Variant calls
Title, Location, Date
6
7. Variant Annotation
Sample & Assay Technologies
dbSNP/COSMIC ID
Chro
m
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr2
R
ef
11181327 rs11121691 C
11190646 rs2275527 G
11205058 rs1057079 C
11288758 rs1064261 G
11300344 rs191073707 C
11301714 rs1135172 A
11322628 rs2295080 G
186641626 rs2853805 G
186642429 rs2206593 A
186643058
rs5275
A
186645927 rs2066826 C
29415792 rs1728828 G
Pos
ID
Alt
T
A
T
A
T
G
T
A
G
G
T
A
Actual change and position within
the codon or amino acid sequence
Gene Mutation
Name type
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
ALK
SNP
chr2 29416366
rs1881421 G C
ALK
SNP
chr2 29416481
rs1881420 T C
ALK
SNP
chr2 29416572
rs1670283 T C
ALK
SNP
chr2 29419591
chr2 29445458
chr2 29446184
rs1670284 G T
rs3795850 G T
rs2276550 C G
ALK
ALK
ALK
SNP
SNP
SNP
Effect of the variant
on protein coding
Codon
AA
Filtered
Variant
Allele Frequency
snpEff Effect
Change
Change Coverage
Frequency
c.6909C>T p.L2303
C=0.761 T=0.239
SYNONYMOUS_CODING
1,924
0.239
c.5553G>A p.S1851
G=0.791 A=0.208
SYNONYMOUS_CODING
5,842
0.208
c.4731C>T p.A1577
C=0.254 T=0.746
SYNONYMOUS_CODING
1,928
0.746
c.2997G>A p.N999
G=0.212 A=0.788
SYNONYMOUS_CODING
5,186
0.788
C=0.924 T=0.076
INTRON
210
0.076
c.1437A>G p.D479
A=0.248 G=0.752
SYNONYMOUS_CODING
3,965
0.752
G=0.239 T=0.755
UPSTREAM
339
0.755
G=0.0 A=1.0
UTR_3_PRIME
97
1
A=0.167 G=0.833
UTR_3_PRIME
3,552
0.833
A=0.759 G=0.241
UTR_3_PRIME
237
0.241
C=0.88 T=0.12
INTRON
209
0.12
G=0.0 A=1.0
UTR_3_PRIME
2,520
1
NON_SYNONYMOUS_CODI
c.4587G>C p.D1529E
G=0.907 C=0.093
NG
4,361
0.093
NON_SYNONYMOUS_CODI
c.4472T>C p.K1491R
T=0.954 C=0.045
NG
3,061
0.045
NON_SYNONYMOUS_CODI
c.4381T>C p.I1461V
T=0.0 C=0.999
NG
5,834
0.999
G=0.093 T=0.907
INTRON
739
0.907
c.3375G>T p.G1125
G=0.917 T=0.082
SYNONYMOUS_CODING
1,776
0.082
C=0.895 G=0.105
INTRON
475
0.105
SIFT score
Predicts the deleterious effect of an amino acid change based on how conserved the
sequence is among related species
Polyphen score
Predicts the impact of the variant on protein structure
Title, Location, Date
7
8. Sample & Assay Technologies
GeneRead DNAseq Gene Panel: Targeted Sequencing
What is targeted sequencing?
Sequencing a sub set of regions in the whole-genome
Why do we need targeted sequencing?
Not all regions in the genome are of interest or relevant to specific study
Exome Sequencing: sequencing most of the exonic regions of the genome (exome).
Protein-coding regions constitute less than 2% of the entire genome
Focused panel/hot spot sequencing: focused on the genes or regions of interest
What are the advantages of focused panel sequencing?
More coverage per sample, more sensitive mutation detection
More samples per run, lower cost per sample
Title, Location, Date
8
9. Sample & Assay Technologies
Target Enrichment - Methodology
Multiplex PCR
Small DNA input (< 100ng)
Short processing time
(several hrs)
Relatively small throughput
(KB - MB region)
Sample
preparation
(DNA
isolation)
PCR target
enrichment
(2 hours)
Title, Location, Date
Library
construction
Sequencing
Data analysis
9
10. Sample & Assay Technologies
Variants Identifiable through Multiplex PCR
SNPs – single nucleotide polymorphisms
Indels
Indels < 20 bp in length
Variants not callable
Structural variants
Large indels
Inversions
Copy Number Variants (CNVs)
Large insertion
Inversion
CNV
Title, Location, Date
10
11. Sample & Assay Technologies
GeneRead DNAseq Gene Panel
Multiplex PCR technology based targeted enrichment for DNA sequencing
Cover all human exons (coding region + UTR)
Division of gene primers sets into 4 tubes; up to 1200 plex in each tube
11
12. Sample & Assay Technologies
GeneRead DNAseq Gene Panel
Focus on your Disease of Interest
Comprehensive Cancer Panel (124 genes)
Disease Focused Gene Panels (20 genes)
Breast cancer
Colon Cancer
Gastric cancer
Leukemia
Liver cancer
Genes Involved in Disease
Lung Cancer
Ovarian Cancer
Prostate Cancer
Genes with High Relevance
12
14. Sample & Assay Technologies
Data Analysis for Targeted Sequencing
GeneRead data analysis work flow
Read
Mapping
Primer
Trimming
Variant
Calling
Variant
Annotation
Read mapping
Identify the possible position of the read within the reference
Align the read sequence to reference sequences
Primer trimming
Remove primer sequences from the reads
Variant calling
Identify differences between the reference and reads
Variant filtering and annotation
Functional information about the variant
Title, Location, Date
14
15. Sample & Assay Technologies
Reads from Targeted Sequencing
Typical NGS raw read from targeted sequencing
Adapter
Barcode
Primer
Insert sequence
Primer
Adapter
-3’
5’Removal of adapters and de-multiplexing
Primer
Insert sequence
Primer
-3’
5’-3’
5’Read length can vary:
only part of the insert
5’or the 3’ primer may
be present
5’-
-3’
-3’
Title, Location, Date
15
16. Sample & Assay Technologies
Read Mapping
Align reads to the reference genome
Reference sequence
Amplicon 1
Amplicon 2
Aligned reads
Title, Location, Date
16
17. Primer Trimming
Sample & Assay Technologies
Primer sequences must be trimmed for accurate variant calling
Reference sequence
Amplicon 1
Frequency of `C` without
primer trimming = 4/13 = 31%
C
C
C
C
Aligned reads
Amplicon 2
Title, Location, Date
Frequency of `C` after primer
trimming = 4/7 = 57%
17
18. Sample & Assay Technologies
GeneRead Variant Calling Overview
BAM file
(w/ flow space info)
BED file for
amplicons used
Run Parameters
Annotation
Variant calling and filtering
Torrent Variant
Caller (TVC)
vcf
GATK Variant
Annotator
IonTorrent
vcf
snpEff
(basic annotation)
vcf
Seq.
platform
vcf
GATK Unified
Genotyper
MiSeq
GATK Indel
Realigner
bam
bam
GATK Base
Quality
Recalibrator
GATK Variant
Filtration
vcf
Additional filtering
(based on
frequency and
coverage)
SnpSift
(links to dbSNP,
Cosmic and
computation of
Sift scores, etc.)
VCF to excel
dbSNP
Cosmic
dbNSFP
Variants in excel
format
Title, Location, Date
18
19. Sample & Assay Technologies
Indel realignment
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat
Genet. 2011 May; 13(5):191-8. PMID: 21178889
Eliminates some false-positive variant calls around indels
Read aligners can not eliminate these alignment errors since they align reads
independently
Multiple sequence alignment can identify these errors and correct them
Title, Location, Date
19
20. Sample & Assay Technologies
Base Quality Recalibration
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data.
Nat Genet. 2011 May; 13(5):191-8. PMID: 21178889
Eliminates sequencer-specific biases
Lane-specific/sample-specific biases
Instrument-specific under-reporting/over-reporting of quality scores
Systematic errors based on read position
Di-nucleotide-specific sequencing errors
Recalibrartion leads to improved variant calls
Title, Location, Date
20
21. Sample & Assay Technologies
Variant Filtration
Variant Frequency
Somatic mode
SNPs with frequency < 4% and indels with frequency < 20%
Germline mode
SNPs with frequency < 20% and indels with frequency < 25%
Strand Bias
SNPs with FS ≤ 60
Indels with FS ≤ 200
Mapping Quality
SNPs with MQ ≤ 40.0
C
C
C
Haplotype Score
SNPs with HaplotypeScore ≤ 13.0
Not applicable for pooled samples
Title, Location, Date
Strand Bias: variants that are
present in reads from only one of
the two strands
21
22. Sample & Assay Technologies
Specificity Analysis
Specificity: the percentage of sequences that map to the intended targets
region of interest
number of on-target reads / total number of reads
Reference
sequence
ROI 1
ROI 2
NGS
reads
Off-target reads
On-target reads
Title, Location, Date
On-target
reads
22
23. Sample & Assay Technologies
Sequencing Depth
Coverage depth (or depth of coverage): how many times each base has been sequenced
Unlike Sanger sequencing, in which each sample is sequenced 1-3 times to be confident of
its nucleotide identity, NGS generally needs to cover each position many times to make a
confident base call, due to relative high error rate (0.1 - 1% vs 0.001 – 0.01%)
Increasing coverage depth is also helpful to identify low frequent mutations in heterogenous
samples such as cancer sample
Reference
sequence
NGS
reads
coverage depth = 4
Title, Location, Date
coverage depth = 3
coverage depth = 2
23
24. Sample & Assay Technologies
NGS Data Analysis: Uniformity
Coverage uniformity: measure the evenness of the coverage depth of target
position
Reference
sequence
NGS
reads
coverage depth = 10
coverage depth = 3
coverage depth = 2
Title, Location, Date
24
25. Sample & Assay Technologies
GeneRead Data Analysis Web Portal
FREE Complete & Easy to use Data Analysis with Web-based Software
25
26. Sample & Assay Technologies
GeneRead Data Analysis Web Portal
Title, Location, Date
26
29. Sample & Assay Technologies
Note: Runtimes depend on the number of reads in the input files. Typical runtimes are 20-60 minutes.
Title, Location, Date
29
33. Sample & Assay Technologies
Summary
Run Summary
Specificity
Coverage
Uniformity
Numbers of SNPs and Indels
Summary By Gene
Specificity
Coverage
Uniformity
# of SNPs and Indels
33
34. Sample & Assay Technologies
Features of Variant Report
SNP detection
Indel detection
34
35. Sample & Assay Technologies
QIAGEN’s GeneRead DNAseq Gene Panel System
FOCUS ON YOUR RELEVANT GENES
Focused:
Biologically relevant content
selection enables deep sequencing
on relevant genes and identification
of rare mutations
Flexible:
Mix and match any gene of interest
NGS platform independent:
Functionally validated for PGM,
MiSeq/HiSeq
Integrated controls:
Enabling quality control of prepared
library before sequencing
Free, complete and easy of use data
analysis tool
36. Sample & Assay Technologies
Upcoming webinars
Next Generation Sequencing and its role in cancer biology
Webinar 3:
Date:
Speaker:
Next-generation sequencing data analysis for genetic profiling
April 18, 2013
Ravi Vijaya Satya, Ph.D.
Webinar 1: Next-generation sequencing, an introduction to technology and
applications
Date:
May 3, 2013
Speaker:
Quan Peng, Ph.D.
Webinar 2:
Date:
Speaker:
Next-generation sequencing for cancer research
May 10, 2013
Vikram Devgan, Ph.D., MBA
Title, Location, Date
36