Bioc4010 lectures 1 and 2

Next-Generation Sequence Analysis
for Biomedical Applications

BIOC 4010/5010
Lecture 1

Dr. Dan Gaston
Postdoctoral Fellow Department of Pathology
Dr. Karen Bedard Lab
Bioinformatician, IGNITE Project

Introduction to Next-Gen Sequencing

LECTURE 1

Overview: Lecture 1
• Introduction AKA “Why does this matter?”
• “Next-Gen” Sequencing
• Bioinformatics Workflows
• Types of Next-Gen Experiments
• Working with the Human Genome
• Slides available on slideshare:
– http://www.slideshare.net/DanGaston

Major Areas in Human Disease
Genomics
• Complex diseases
– Genome Wide Association Studies (GWAS)
• Cancer
– Tumour genomics (Driver mutations)
– Transcriptomics
• Mendelian disease
– Whole Genome/Exome Sequencing
– Transcriptomics
– Genetic Linkage

Diagnosing Genetic Diseases
• Genetic Counselors/Physicians order
individual testing of genes based on patient
phenotype
• For rare diseases or unusual phenotypes may
run tens to hundreds of tests
• …..EXPENSIVE (Easily thousands of dollars)

Genetic Disease Research: Cutis Laxa

Chromosome 9:
120,962,282 -133,033,431

Cutis Laxa
• Linked Genomic Region ~13Mb in size
• Contains 143 Genes
• Prioritize and select genes for individual
sanger sequencing
• …Slow
• …Laborious
• …Can be expensive

Human Genomics

• $5,000 - $10,000 to sequence whole genome
• $1000 to sequence only protein-coding
portion (exome, later)

Clinical Genomics

• Rapid diagnosis of genetic disease in NICU cases
• Quicker and cheaper than sequential genetic
testing (traditional method)

Cancer Genomics

Welch JS, et al. JAMA, 2011;305, 1577

Cancer Chemotherapy Resistance

Human Disease Genomics at Dalhousie
• IGNITE: Identifying genetic mutations causing
rare mendelian diseases in Atlantic Canada
– 3 year, $2.5 million Genome Canada Project
– Currently working on >10 different diseases including
two inherited cancer’s
– Sequenced >20 individual exomes, 4 whole genomes,
and several transcriptomes
– More on Thursday…
• Dr. Graham Dellaire: Transcriptome sequencing
and analysis on multiple cancer cell lines

Short Reads

Millions of paired “short
reads”, 75-150bp each

FastQ Format

Read ID

Sequence

Quality line

FastQ Quality Scores

Quality Score (Q) Probability of incorrect base call Base call accuracy

10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10000 99.99%
50 1 in 100000 100.00%

Q = -10 log10 P

Quality Scores of Sequencing Reads

General Genomics Workflow
Raw Data Quality Control of Raw
Analysis Data

Whole Genome Alignment to reference
Mapping genome

Variant Calling Detection of genetic variation
(SNPs, Indels, SV)

Linking variants to biological
Annotation
information

Short Read Mapping
…CCAT CTATATGCG TCGGAAATT CGGTATAC
…CCAT GGCTATATG CTATCGGAAA GCGGTATA
…CCA AGGCTATAT CCTATCGGA TTGCGGTA C…
…CCA AGGCTATAT GCCCTATCG TTTGCGGT C…
…CC AGGCTATAT GCCCTATCG AAATTTGC ATAC…
…CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

1) Report location of genome where read matches best
2) Minimize mismatches
3) Mismatches with lower quality bases better than
mismatches with higher quality bases

Discovering Genetic Variation
SNPs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

INDELs

Next-Gen Sequencing Experiments
• Whole Genome Sequencing
• Targeted Exome Sequencing
• RNA-Seq
• ChIP-Seq
• CLIP-Seq

Composition of Human Genome
Size: 3.2 Gb

Genomic Content
Chromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA
1 249,250,621 4,401,091 2,012 31 1,130 134 66 106
2 243,199,373 4,607,702 1,203 50 948 115 40 93
3 198,022,430 3,894,345 1,040 25 719 99 29 77
4 191,154,276 3,673,892 718 39 698 92 24 71
5 180,915,260 3,436,667 849 24 676 83 25 68
6 171,115,067 3,360,890 1,002 39 731 81 26 67
7 159,138,663 3,045,992 866 34 803 90 24 70
8 146,364,022 2,890,692 659 39 568 80 28 42
9 141,213,431 2,581,827 785 15 714 69 19 55
10 135,534,747 2,609,802 745 18 500 64 32 56
11 135,006,516 2,607,254 1,258 48 775 63 24 53
12 133,851,895 2,482,194 1,003 47 582 72 27 69
13 115,169,878 1,814,242 318 8 323 42 16 36
14 107,349,540 1,712,799 601 50 472 92 10 46
15 102,531,392 1,577,346 562 43 473 78 13 39
16 90,354,753 1,747,136 805 65 429 52 32 34
17 81,195,210 1,491,841 1,158 44 300 61 15 46
18 78,077,248 1,448,602 268 20 59 32 13 25
19 59,128,983 1,171,356 1,399 26 181 110 13 15
20 63,025,520 1,206,753 533 13 213 57 15 34
21 48,129,895 787,784 225 8 150 16 5 8
22 51,304,566 745,778 431 21 308 31 5 23
X 155,270,560 2,174,952 815 23 780 128 22 52
Y 59,373,566 286,812 45 8 327 15 7 2
mtDNA 16,569 929 13 0 0 0 2 22

Transcriptomics: RNA-Seq
• Sequence the actively transcribed genes in a
cell line or tissue
– Only about 20% of genes are transcribed in
particular cell types
• Two types:
– Poly-A selection
– Total RNA + ribodepletion
• Many experimental questions can be
addressed

RNA-Seq: Gene Expression
Condition 1

Condition 2

RNA-Seq: Differential Splicing

Exon1 Exon 2 Exon 3

RNA-Seq: Novel/Non-Canonical Exon
Discovery

Exon1 Exon 2 Exon X Exon 3

RNA-Seq: Gene Fusion Events

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

RNA-Seq
• Important to take in to account biological
variability. A sample of cells is a mixed population
– Replicates!
• Not suited for discovering polymorphisms due to
higher error rates introduced by reverse
transcription step (RNA -> cDNA)
• High false positive rates for fusion gene discovery,
novel exons, when low expression levels

Short Read Mapping: Placing Millions
of Reads on Human Reference

• Problem: Efficiently place millions of reads
(75bp – 200bp) accurately within 3.2Gb of
reference genome
• Problem: Read may match equally well at
more than one location (pseudogenes, copy
number variation, repetititve elements)
• Problem: Sequencing reads may be paired

Short Read Mapping: Brute Force
Method

Simple conceptually: Compare each query k-mer to all k-
mers of genome

Genome Size (N): 3.2 billion bases
K-mer length (M): 7
Number of comparisons((N-M + 1) * M): 21 billion

Solution

Index the Reference Genome

Indexing the reference is like constructing a phone
book, quickly move towards the relevant portion of the
genome and ignore the rest.

Short Read Alignment: Suffix Array
Split genome into all suffixes (substrings) and sort
alphabetically

Allows query to be searched against an alphabetical
reference, skipping 96% of the genome

Ex: banana Sorted:
banana a
anana ana
nana anana
ana banana
na nana
a na

Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
Search for GATTACA… 1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11

Binary Search

• Initialize search range to entire list
– mid = (hi+lo)/2; middle = suffix[mid]
– if query matches middle: done
– else if query < middle: pick low range
– else if query > middle: pick hi range
• Repeat until done or empty range

Applied to Human Genome
• In practice simple methods of indexing the
genome can create very large data structures
– Suffix Array: > 12 GB
• Solution: Apply complex procedures that allow
you to index and compress the data:
– Burrows-Wheeler Transform
– FM-Index

Short Read Mapping: Mapping Quality
• Have also ignored quality scores of reads
• Mapping Quality (for a read): Sum the quality
scores at mismatched bases for alignment
(SUM_BASE_Q(best)), also consider all other
possible alignments

MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10-
SUM_BASE_Q(i))) )

Short Read Aligners
• BLAT: BLAST-Like Alignment Tool
• MAQ: First to take in to account quality scores
• BWA: First to use Burrows-Wheeler Transform
• Bowtie: Ungapped alignment only
• Bowtie2: Allows indels
• … and many more

Identifying and Annotating Genomic Variation for Disease Gene Discovery

LECTURE 2

Genetic Variation
• dbSNP (NCBI) catalogues > 53 million Single
Nucleotide Variations (SNVs) in humans
– 38 million validated
– 22 million in genes
– 36 million with frequencies
• 50-80% of mutations involved in inherited
disease caused by SNVs

SNP vs SNV
• Technically a polymorphism is a variation that
doesn’t cause disease and is common in a
population
• What is common?
– Greater than 5% in a population a typical
definition
– Definition for rare ranges from < 0.5% to < 1.5%

Discovering Genetic Variation

SNPs
reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

INDELs

Variant Calling: The Absurdly Simple
Way
Read depth at base: 10 T: 4 A: 6

Genotype: Heterozygous A/T

TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
reference genome

Variant Calling: The Absurdly Simple
Way
• Algorithm:
– Count all aligned bases that pass quality threshold
(e.g. >Q20)
– If #reads with alternative base > lower bound (20%)
and < upper bound (80%) call heterozygous alt
– Else if > upper bound call homozygous alternative
– Else call homozygous reference
• …But what about base qualities for more than
keeping reads?

Improving Variant Calling

• MAQ (Mapping and Assembling with Quality):
– Short Read Mapper and Genotype Caller
– First to use base qualities for either
– Introduced mapping Quality


① Base quality can not be more reliable than
mapping quality of read
② At most individual can have two real
nucleotides at a position (two alleles)
① Only consider two most frequent nucleotides
② Simplify to two states: A and B

• Three Possible Genotypes:
– AA, BB, AB
• Construct a model that includes base quality
to estimate the probability of error
• Calculate the probability of each genotype
given the data and error rate
• Genotype with highest probability is called

The Model

g = genotype e = error probability

m = ploidy (2)
k = number of reads

The Model

Reads that match
reference

The Model

Reads that don’t match
reference

• Two widely used tool sets for calling variants
– samtools (uses MAQ-type calculation)
– Genome Analysis Toolkit (GATK)
UnifiedGenotyper
• UnifiedGenotyper: Capable of calling both
indels and single nucleotide polymorphisms
(SNPs) and allele frequencies given multiple
samples

UnifiedGenotyper
Apply filters to discard poor reads and remove
biases:
① Duplicate reads
② Malformed reads (i.e. mismatch in #bases and base
qualities)
③ Bad mate (paired-end sequencing, paired reads map
to different chromosomes)
④ Mapping quality zero (maps to multiple locations
equally well)
⑤ Fewer than 10% mismatch on read in 20bp to either
side of position

Remove Duplicate Reads
Application Avg Read Length Avg Molecules
#Molecules/Lib #Molecules Sampled > 1
rary Sampled
30X Genome 5bn 2x100 450m 4.4%
4x Genome 5bn 2x100 60m 0.6%
100x Exome 500m 2x75 20m 2.0%

Duplicate reads break the assumption of
independent sampling from the library

Identify reads with identical start/stop positions

Sequencer-Specific Error Models

If a base was miscalled, what is it most likely to be called
as instead?

Predicted Base
A C G T
A - 57.7 17.1 25.2
Actual C 34.9 - 11.3 53.9
Base
G 31.9 5.1 - 63.0
T 45.9 22.1 32.0 -

Variant Calling

• SNP Calls infested with False Positives
– Machine artifacts
– Mis-mapped reads
– Mis-aligned indels
• 5 – 20% false positive rate

Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.

– Pro: Few false positives
– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage

calling variants and hard filtering early to produce
only highly-confident call set.
– Pro: Won’t miss real variants
– Con: Many more false positives

calling variants and hard filtering early to produce
only highly-confident call set.
– Con: False positives
– Pro: Won’t miss real variants

How Good Are My Calls?

• How many called SNPs?
– Human average of 1 heterozygous SNP / 1000
bases
• Fraction of variants already in dbSNP
• Transition/Transversion ratio
– Transitions 2x as common
• 2.8x when looking only at exons

Identifying Genetic Variation Causing
Genetic Disease

Discovering Genetic Variants Causing
Mendelian Disease

4 million genetic variants

2 million associated with
protein-coding genes

10,000 possibly
of disease
causing type

1500 <1%
frequency in
population

Discovering Genetic Variants Causing
Mendelian Disease


2 million associated with

10,000 possibly
of disease
causing type

1500 <1%
frequency in
Single Causal
population Genetic Variant

If a problem cannot be
solved, enlarge it.
--Dwight D. Eisenhower

TYPES OF SINGLE NUCLEOTIDE
VARIANTS

Disease Genomics: Hunting Down
Pathogenic Genetic Variation

Reference Exon 1 Intron 1 Exon 2

Start
TAA
Stop

Splice Sites


Start
TAA
mRNA coding for protein Stop

Splice Sites


Start
TAA

Patient Exon 1 Intron 1 Exon 2

Splice Sites


Start
TAA

TAC
Tyr


Splice Sites


Start
TAA

TAC
Splice Site Loss Tyr


Splice Sites


Start
TAA

TAC


Missense

Splice Sites


Start
TAA

TAC


Missense/Frameshift Stop Gain

Identifying Genetic Regions of Interest

Number of Genes in Genomic Regions
of Interest

Frequency of Polymorphisms:
Common vs Rare
• Mendelian disorders are caused by rare
variation, < 1-2% frequency in the relevant
population
• Leverage large projects aimed at assessing
genetic diversity in populations around the
world
– 1000 Genomes
– NHLBI Exome Sequencing Project

Population Matters
• Most variations in protein-coding genes
occurred fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes,
pathogen exposure and urban living

Population Matters
• Most variations in protein-coding genes occurred
fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes, pathogen
exposure and urban living
• Monogenic diseases have different prevalence in
different populations
– Cystic fibrosis in European population
– Hereditary hemochromotosis in Northern Europeans
– Tay-Sachs in Ashkenazi Jews
– Sickle-Cell anemia in Sub-saharan Africa populations

Population Matters
• Most variations in protein-coding genes occurred
fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes, pathogen
exposure and urban living
• Monogenic diseases have different prevalence in
different populations
– Cystic fibrosis in European population
– Hereditary hemochromotosis in Northern Europeans
– Tay-Sachs in Ashkenazi Jews
– Sickle-Cell anemia in Sub-saharan Africa populations
• Polygenic disorders

Exome Sequencing Project
• Multi-Institutional
• Total possible patient pool of > 250,000
individuals, well phenotyped
– Includes healthy individuals and diseased
• Currently 6700 exomes sequenced
– 4420 European descent
– 2312 African American
• 1.2 million coding variations
– Most extremely rare/unique
– Many population specific

IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…

IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…
– Acadians
– Native American
– Non-Acadian/European Descent

Population Frequency

• Mendelian disorders are rare
• If variation is in database, is it associated with
disease?
• Causal variation also needs to be rare
– Cutoff somewhere in the < 0.5 - < 1.5% range
– Should appear rarely or not at all in local controls
– Track with disease in family members under study

Predicting the Impact of Missense
Mutations
• Most use some level of evolutionary
conservation to determine how severe a
mutation is
– SIFT
– PolyPhen
– GERP++
– EvoD

Example: SIFT Algorithm

Multiple
Input Query
Homologs Sequence
Sequence
Alignment
Psi-BLAST Alignment

Multiple
Sequence PSSM Score
Alignment

Normalize
By most
frequent AA

Predicting Impact
• Other approaches include additional features:
– Protein structure information
– Site level annotation (active sites, binding sites,
etc)
– Protein domain information
– Biophysical properties of amino acids in that
position and of the substituted amino acid

Prediction Take-Away

The more conserved a site is the more likely
any substitution is to be deleterious

However: Current methods have pretty poor
performance, not suitable for clinical-level
diagnosis

Classifying Genetic Variants

4 million
variants

Intronic Exonic Intergenic

Amino Acid
Unknown Splice Site Silent Mutation Splice Site
Changing

Potential Potential
Disease Causing Disease Causing

Known
Known Genetic Stop Loss / Stop Missense
Polymorphism in
Disease Variant Gain Mutation
Population

Annotating Genes and Variants
• Is variant in a known protein-coding gene?
– What does the gene do?

– What molecular pathways? 2 million associated with

– What protein-protein interactions? 10,000 possibly
of disease

– What tissues is it expressed in?
causing type

1500 <1%
frequency in
population

– When in development?

ADDING ANNOTATIONS TO
VARIANTS

Genomic Intervals, Searching, and
Annotation
• Most common way of describing genomic
features is as an interval
• Multiple formats (BED, WIG, VCF, etc)
• In common for all is location:
– Chromosome
– Start Position of Feature
– End Position of Feature
– Annotations/Info (Optional)

Searching and Annotating: Interval
Trees

• Interval Trees allow efficient searching of all
overlapping intervals
• Easiest to make one tree per chromosome
• Given a set of intervals (n) on a number line
(chromosome) construct a tree

Interval Trees

All intervals to left All intervals to right

Node Contains:

- Centre point

- Intervals
sorted by start

- Intervals
sorted by end

IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa

CASE STUDIES

IGNITE Data Pipeline and Integration
Gene
Annotations Annotated
Genomic Variants

Mapped Gene
Region(s) Definitions
Filter
Sort
Prioritize
Known Genes Pathway and
Interactions

Brain Calcification
• 84 genes in chromosome 5 region
• No likely homozygous or compound heterozygous
variants within region shared between two
patients
• 29 genes with at least one targeted region with
little or no sequencing coverage
• Many only lacked coverage in 5’ and 3’ UTRs
• Collaborators performed statistical tests for
possibly copy-number variations of targeted
regions using exome sequencing data

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:
120,962,282 -133,033,431

Cutis Laxa: Genetic Mapping

Chromosome 17:
79,596,811-81,041,077

Charcot-Marie-Tooth Cutis Laxa
• 143 genes in region • 52 genes in region
• 13 known causative genes • 5 known causative genes
– MPZ – ATP6V0A2
– PMP22 – ELN
– GDAP1 – FBLN5
– KIF1B – EFEMP2
– MFN2 – SCYL1BP1
– SOX – ALDH18A1
– EGR2
– DNM2
– RAB7
– LITAF (SIMPLE)
– GARS
– YARS
– LMNA

Pathway and Interaction Data
• 37 pathways • 10 pathways
– Clathrin-derived vesicle – Phagosome
budding – Collecting duct acid
– Lysosome vesicle secretion
biogenesis – Lysosome
– Endocytosis – Protein digestion and
– Golgi-associated vesicle absorption
biogenesis – Metabolic pathways
– Membrane trafficking – Oxidative phosphorylation
– Trans-Golgi network vesicle – Arginine and proline
budding metabolism
• Primarily LMNA or DNM2 • Primarily ATP6V0A2

Results: Charcot-Marie-Tooth
• 8 Genes Prioritized
Gene Interactions Pathway
LRSAM1 MultipleEndocytosis
DNM1 DNM2 -
FNBP1 DNM2 -
TOR1A MNA -
STXBP1 Multiple Five
SH3GLB2 - Endocytosis
PIP5KL1 - Endocytosis
FAM125B - Endocytosis

• For more information
– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Results: Cutis Laxa
• 10 genes prioritized
Gene Interactions Pathway
HEXDC Multiple Phagosome
HG5 - Phagosome
HG5 Multiple Lysosome, Protein digestion
SIRT7 Multiple Metabolic Pathways
FASN - Metabolic Pathways
DCXR - Metabolic Pathways
PYCR1 - Metabolic Pathways,
Arginine/Proline
PCYT2 - Metabolic Pathways
ARHGDIA - Oxidative Phosphorylation

• For more information
– Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Bioc4010 lectures 1 and 2

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Bioc4010 lectures 1 and 2

Semelhante a Bioc4010 lectures 1 and 2 (20)

Mais de Dan Gaston

Mais de Dan Gaston (10)

Último

Último (20)

Bioc4010 lectures 1 and 2