SlideShare uma empresa Scribd logo
1 de 124
Next-Generation Sequence Analysis
   for Biomedical Applications

                BIOC 4010/5010
                   Lecture 1

                  Dr. Dan Gaston
   Postdoctoral Fellow Department of Pathology
               Dr. Karen Bedard Lab
         Bioinformatician, IGNITE Project
Introduction to Next-Gen Sequencing

LECTURE 1
Overview: Lecture 1
•   Introduction AKA “Why does this matter?”
•   “Next-Gen” Sequencing
•   Bioinformatics Workflows
•   Types of Next-Gen Experiments
•   Working with the Human Genome
•   Slides available on slideshare:
    – http://www.slideshare.net/DanGaston
Major Areas in Human Disease
             Genomics
• Complex diseases
  – Genome Wide Association Studies (GWAS)
• Cancer
  – Tumour genomics (Driver mutations)
  – Transcriptomics
• Mendelian disease
  – Whole Genome/Exome Sequencing
  – Transcriptomics
  – Genetic Linkage
Diagnosing Genetic Diseases
• Genetic Counselors/Physicians order
  individual testing of genes based on patient
  phenotype
• For rare diseases or unusual phenotypes may
  run tens to hundreds of tests
• …..EXPENSIVE (Easily thousands of dollars)
Genetic Disease Research
Genetic Disease Research: Cutis Laxa



                       Chromosome 9:
                       120,962,282 -133,033,431
Cutis Laxa
• Linked Genomic Region ~13Mb in size
• Contains 143 Genes
• Prioritize and select genes for individual
  sanger sequencing
• …Slow
• …Laborious
• …Can be expensive
Personalized Medicine
Human Genomics


• $5,000 - $10,000 to sequence whole genome
• $1000 to sequence only protein-coding
  portion (exome, later)
Clinical Genomics




• Rapid diagnosis of genetic disease in NICU cases
• Quicker and cheaper than sequential genetic
  testing (traditional method)
Cancer Genomics




          Welch JS, et al. JAMA, 2011;305, 1577
Cancer Chemotherapy Resistance
Human Disease Genomics at Dalhousie
• IGNITE: Identifying genetic mutations causing
  rare mendelian diseases in Atlantic Canada
  – 3 year, $2.5 million Genome Canada Project
  – Currently working on >10 different diseases including
    two inherited cancer’s
  – Sequenced >20 individual exomes, 4 whole genomes,
    and several transcriptomes
  – More on Thursday…
• Dr. Graham Dellaire: Transcriptome sequencing
  and analysis on multiple cancer cell lines
Short Reads




Millions of paired “short
reads”, 75-150bp each
FastQ Format


        Read ID

                  Sequence

                   Quality line
FastQ Quality Scores

Quality Score (Q)   Probability of incorrect base call   Base call accuracy

       10                        1 in 10                       90%
       20                       1 in 100                       99%
       30                      1 in 1000                     99.90%
       40                      1 in 10000                    99.99%
       50                     1 in 100000                    100.00%




                           Q = -10 log10 P
Quality Scores of Sequencing Reads
General Genomics Workflow
  Raw Data        Quality Control of Raw
  Analysis        Data

Whole Genome      Alignment to reference
  Mapping         genome


Variant Calling   Detection of genetic variation
                  (SNPs, Indels, SV)


                  Linking variants to biological
 Annotation
                  information
Short Read Mapping
    …CCAT   CTATATGCG       TCGGAAATT  CGGTATAC
    …CCAT GGCTATATG     CTATCGGAAA    GCGGTATA
    …CCA AGGCTATAT     CCTATCGGA    TTGCGGTA   C…
    …CCA AGGCTATAT   GCCCTATCG     TTTGCGGT    C…
    …CC AGGCTATAT    GCCCTATCG AAATTTGC     ATAC…
    …CC TAGGCTATA GCGCCCTA      AAATTTGC GTATAC…
    …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…




1) Report location of genome where read matches best
2) Minimize mismatches
3) Mismatches with lower quality bases better than
   mismatches with higher quality bases
Discovering Genetic Variation
SNPs
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
                CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
                 GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                        GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
                         TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
       reference genome   TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
                             TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
                                   ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
                                    TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
                                    TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
                                       GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
                                            AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
                                               TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
                                                 TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

                                                                        INDELs
Next-Gen Sequencing Experiments
•   Whole Genome Sequencing
•   Targeted Exome Sequencing
•   RNA-Seq
•   ChIP-Seq
•   CLIP-Seq
Next-Gen Sequencing Experiments
•   Whole Genome Sequencing
•   Targeted Exome Sequencing
•   RNA-Seq
•   ChIP-Seq
•   CLIP-Seq
Composition of Human Genome
                Size: 3.2 Gb
Genomic Content
Chromosome   Base pairs    Variations   Confirmed proteins   Putative proteins   Pseudogenes   miRNA   rRNA   Misc ncRNA
    1        249,250,621   4,401,091          2,012                 31              1,130       134     66        106
    2        243,199,373   4,607,702          1,203                 50               948        115     40         93
    3        198,022,430   3,894,345          1,040                 25               719         99     29         77
    4        191,154,276   3,673,892           718                  39               698         92     24         71
    5        180,915,260   3,436,667           849                  24               676         83     25         68
    6        171,115,067   3,360,890          1,002                 39               731         81     26         67
    7        159,138,663   3,045,992           866                  34               803         90     24         70
    8        146,364,022   2,890,692           659                  39               568         80     28         42
    9        141,213,431   2,581,827           785                  15               714         69     19         55
    10       135,534,747   2,609,802           745                  18               500         64     32         56
    11       135,006,516   2,607,254          1,258                 48               775         63     24         53
    12       133,851,895   2,482,194          1,003                 47               582         72     27         69
    13       115,169,878   1,814,242           318                   8               323         42     16         36
    14       107,349,540   1,712,799           601                  50               472         92     10         46
    15       102,531,392   1,577,346           562                  43               473         78     13         39
    16       90,354,753    1,747,136           805                  65               429         52     32         34
    17       81,195,210    1,491,841          1,158                 44               300         61     15         46
    18       78,077,248    1,448,602           268                  20                59         32     13         25
    19       59,128,983    1,171,356          1,399                 26               181        110     13         15
    20       63,025,520    1,206,753           533                  13               213         57     15         34
    21       48,129,895     787,784            225                   8               150         16      5         8
    22       51,304,566     745,778            431                  21               308         31      5         23
    X        155,270,560   2,174,952           815                  23               780        128     22         52
    Y        59,373,566     286,812             45                   8               327         15      7         2
  mtDNA        16,569         929               13                   0                0          0       2         22
Exome Sequencing
Transcriptomics: RNA-Seq
• Sequence the actively transcribed genes in a
  cell line or tissue
  – Only about 20% of genes are transcribed in
    particular cell types
• Two types:
  – Poly-A selection
  – Total RNA + ribodepletion
• Many experimental questions can be
  addressed
RNA-Seq: Gene Expression
         Condition 1




         Condition 2
RNA-Seq: Differential Splicing




Exon1             Exon 2             Exon 3
RNA-Seq: Novel/Non-Canonical Exon
            Discovery




Exon1          Exon 2   Exon X   Exon 3
RNA-Seq: Gene Fusion Events




Exon1              Exon 2         Exon 3




              Gene 2 Exon 4
RNA-Seq
• Important to take in to account biological
  variability. A sample of cells is a mixed population
   – Replicates!
• Not suited for discovering polymorphisms due to
  higher error rates introduced by reverse
  transcription step (RNA -> cDNA)
• High false positive rates for fusion gene discovery,
  novel exons, when low expression levels
CHiP-Seq
CHiP-Seq
Short Read Mapping: Placing Millions
      of Reads on Human Reference

• Problem: Efficiently place millions of reads
  (75bp – 200bp) accurately within 3.2Gb of
  reference genome
• Problem: Read may match equally well at
  more than one location (pseudogenes, copy
  number variation, repetititve elements)
• Problem: Sequencing reads may be paired
Short Read Mapping: Brute Force
                Method




Simple conceptually: Compare each query k-mer to all k-
mers of genome

Genome Size (N): 3.2 billion bases
K-mer length (M): 7
Number of comparisons((N-M + 1) * M): 21 billion
Solution


      Index the Reference Genome

Indexing the reference is like constructing a phone
book, quickly move towards the relevant portion of the
genome and ignore the rest.
Short Read Alignment: Suffix Array
Split genome into all suffixes (substrings) and sort
alphabetically

Allows query to be searched against an alphabetical
reference, skipping 96% of the genome

Ex: banana                             Sorted:
banana                                 a
anana                                  ana
nana                                   anana
ana                                    banana
na                                     nana
a                                      na
Short Read Alignment: Binary Search
 • Searching the index efficiently is still a
   problem…
                       Index        # Sequence Pos     Pos
Search for GATTACA…             1   ACAGATTACC…               6
                                2   ACC…                     13
                                3   AGATTACC…                 8
                                4   ATTACAGATTACC…            3
                                5   ATTACC…                  10
                                6   C…                       15
                                7   CAGATTACC…                7
                                8   CC…                      14
                                9   GATTACAGATTACC…           2
                               10   GATTACC…                  9
                               11   TACAGATTACC…              5
                               12   TACC…                    12
                               13   TGATTACAGATTACC…          1
                               14   TTACAGATTACC…             4
                               15   TTACC…                   11
Short Read Alignment: Binary Search
 • Searching the index efficiently is still a
   problem…
                       Index        # Sequence Pos     Pos
Search for GATTACA…             1   ACAGATTACC…               6
                                2   ACC…                     13
                                3   AGATTACC…                 8
                                4   ATTACAGATTACC…            3
                                5   ATTACC…                  10
                                6   C…                       15
                                7   CAGATTACC…                7
                                8   CC…                      14
                                9   GATTACAGATTACC…           2
                               10   GATTACC…                  9
                               11   TACAGATTACC…              5
                               12   TACC…                    12
                               13   TGATTACAGATTACC…          1
                               14   TTACAGATTACC…             4
                               15   TTACC…                   11
Short Read Alignment: Binary Search
 • Searching the index efficiently is still a
   problem…
                       Index        # Sequence Pos     Pos
Search for GATTACA…             1   ACAGATTACC…               6
                                2   ACC…                     13
                                3   AGATTACC…                 8
                                4   ATTACAGATTACC…            3
                                5   ATTACC…                  10
                                6   C…                       15
                                7   CAGATTACC…                7
                                8   CC…                      14
                                9   GATTACAGATTACC…           2
                               10   GATTACC…                  9
                               11   TACAGATTACC…              5
                               12   TACC…                    12
                               13   TGATTACAGATTACC…          1
                               14   TTACAGATTACC…             4
                               15   TTACC…                   11
Short Read Alignment: Binary Search
 • Searching the index efficiently is still a
   problem…
                       Index        # Sequence Pos     Pos
Search for GATTACA…             1   ACAGATTACC…               6
                                2   ACC…                     13
                                3   AGATTACC…                 8
                                4   ATTACAGATTACC…            3
                                5   ATTACC…                  10
                                6   C…                       15
                                7   CAGATTACC…                7
                                8   CC…                      14
                                9   GATTACAGATTACC…           2
                               10   GATTACC…                  9
                               11   TACAGATTACC…              5
                               12   TACC…                    12
                               13   TGATTACAGATTACC…          1
                               14   TTACAGATTACC…             4
                               15   TTACC…                   11
Short Read Alignment: Binary Search
 • Searching the index efficiently is still a
   problem…
                       Index        # Sequence Pos     Pos
Search for GATTACA…             1   ACAGATTACC…               6
                                2   ACC…                     13
                                3   AGATTACC…                 8
                                4   ATTACAGATTACC…            3
                                5   ATTACC…                  10
                                6   C…                       15
                                7   CAGATTACC…                7
                                8   CC…                      14
                                9   GATTACAGATTACC…           2
                               10   GATTACC…                  9
                               11   TACAGATTACC…              5
                               12   TACC…                    12
                               13   TGATTACAGATTACC…          1
                               14   TTACAGATTACC…             4
                               15   TTACC…                   11
Binary Search

• Initialize search range to entire list
   – mid = (hi+lo)/2; middle = suffix[mid]
   – if query matches middle: done
   – else if query < middle: pick low range
   – else if query > middle: pick hi range
• Repeat until done or empty range
Applied to Human Genome
• In practice simple methods of indexing the
  genome can create very large data structures
  – Suffix Array: > 12 GB
• Solution: Apply complex procedures that allow
  you to index and compress the data:
  – Burrows-Wheeler Transform
  – FM-Index
Short Read Mapping: Mapping Quality
• Have also ignored quality scores of reads
• Mapping Quality (for a read): Sum the quality
  scores at mismatched bases for alignment
  (SUM_BASE_Q(best)), also consider all other
  possible alignments

     MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10-
     SUM_BASE_Q(i))) )
Short Read Aligners
•   BLAT: BLAST-Like Alignment Tool
•   MAQ: First to take in to account quality scores
•   BWA: First to use Burrows-Wheeler Transform
•   Bowtie: Ungapped alignment only
•   Bowtie2: Allows indels
•   … and many more
Identifying and Annotating Genomic Variation for Disease Gene Discovery

LECTURE 2
Genetic Variation
• dbSNP (NCBI) catalogues > 53 million Single
  Nucleotide Variations (SNVs) in humans
  – 38 million validated
  – 22 million in genes
  – 36 million with frequencies
• 50-80% of mutations involved in inherited
  disease caused by SNVs
SNP vs SNV
• Technically a polymorphism is a variation that
  doesn’t cause disease and is common in a
  population
• What is common?
  – Greater than 5% in a population a typical
    definition
  – Definition for rare ranges from < 0.5% to < 1.5%
Discovering Genetic Variation

SNPs
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
                CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
                 GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                   TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                        GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
                         TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
       reference genome   TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
                             TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
                                   ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
                                    TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
                                    TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
                                       GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
                                            AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
                                               TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
                                                 TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

                                                                        INDELs
Variant Calling: The Absurdly Simple
                 Way
 Read depth at base: 10                     T: 4                     A: 6

                       Genotype: Heterozygous A/T


                             TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
                       TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
                    TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
  ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
  ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
           CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
            GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
              TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
              TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
              TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
                   GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
                    TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
  ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
  reference genome
Variant Calling: The Absurdly Simple
                   Way
• Algorithm:
  – Count all aligned bases that pass quality threshold
    (e.g. >Q20)
  – If #reads with alternative base > lower bound (20%)
    and < upper bound (80%) call heterozygous alt
  – Else if > upper bound call homozygous alternative
  – Else call homozygous reference
• …But what about base qualities for more than
  keeping reads?
Improving Variant Calling


• MAQ (Mapping and Assembling with Quality):
  – Short Read Mapper and Genotype Caller
  – First to use base qualities for either
  – Introduced mapping Quality
Improving Variant Calling

① Base quality can not be more reliable than
  mapping quality of read
② At most individual can have two real
  nucleotides at a position (two alleles)
  ① Only consider two most frequent nucleotides
  ② Simplify to two states: A and B
Improving Variant Calling
• Three Possible Genotypes:
  – AA, BB, AB
• Construct a model that includes base quality
  to estimate the probability of error
• Calculate the probability of each genotype
  given the data and error rate
• Genotype with highest probability is called
The Model
The Model

                  g = genotype    e = error probability




m = ploidy (2)
k = number of reads
The Model




Reads that match
   reference
The Model




        Reads that don’t match
              reference
Improving Variant Calling
• Two widely used tool sets for calling variants
  – samtools (uses MAQ-type calculation)
  – Genome Analysis Toolkit (GATK)
    UnifiedGenotyper
• UnifiedGenotyper: Capable of calling both
  indels and single nucleotide polymorphisms
  (SNPs) and allele frequencies given multiple
  samples
UnifiedGenotyper
Apply filters to discard poor reads and remove
biases:
  ① Duplicate reads
  ② Malformed reads (i.e. mismatch in #bases and base
    qualities)
  ③ Bad mate (paired-end sequencing, paired reads map
    to different chromosomes)
  ④ Mapping quality zero (maps to multiple locations
    equally well)
  ⑤ Fewer than 10% mismatch on read in 20bp to either
    side of position
Remove Duplicate Reads
Application    Avg            Read Length   Avg          Molecules
               #Molecules/Lib               #Molecules   Sampled > 1
               rary                         Sampled
30X Genome     5bn            2x100         450m         4.4%
4x Genome      5bn            2x100         60m          0.6%
100x Exome     500m           2x75          20m          2.0%




Duplicate reads break the assumption of
independent sampling from the library

Identify reads with identical start/stop positions
Sequencer-Specific Error Models

  If a base was miscalled, what is it most likely to be called
  as instead?

                                     Predicted Base
                         A           C            G          T
             A           -          57.7        17.1       25.2
Actual       C          34.9         -          11.3       53.9
 Base
             G          31.9        5.1           -        63.0
             T          45.9        22.1        32.0         -
Variant Calling


• SNP Calls infested with False Positives
  – Machine artifacts
  – Mis-mapped reads
  – Mis-aligned indels
• 5 – 20% false positive rate
Decisions and Trade-Offs
• Option 1: Use stringent program options for
  calling variants and hard filtering early to
  produce only highly-confident call set.
Decisions and Trade-Offs
• Option 1: Use stringent program options for
  calling variants and hard filtering early to
  produce only highly-confident call set.
  – Pro: Few false positives
  – Con: Will miss real variants
Decisions and Trade-Offs
• Option 1: Use stringent program options for
  calling variants and hard filtering early to
  produce only highly-confident call set.
  – Pro: Few false positives
  – Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
  options and filtering. Produce high-confidence
  call set. Progressive filtering at later stage
Decisions and Trade-Offs
• Option 1: Use stringent program options for
  calling variants and hard filtering early to produce
  only highly-confident call set.
   – Pro: Few false positives
   – Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
  options and filtering. Produce high-confidence
  call set. Progressive filtering at later stage
   – Pro: Won’t miss real variants
   – Con: Many more false positives
Decisions and Trade-Offs
• Option 1: Use stringent program options for
  calling variants and hard filtering early to produce
  only highly-confident call set.
   – Pro: Few false positives
   – Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
  options and filtering. Produce high-confidence
  call set. Progressive filtering at later stage
   – Con: False positives
   – Pro: Won’t miss real variants
How Good Are My Calls?

• How many called SNPs?
  – Human average of 1 heterozygous SNP / 1000
    bases
• Fraction of variants already in dbSNP
• Transition/Transversion ratio
  – Transitions 2x as common
     • 2.8x when looking only at exons
ANNOTATING VARIANTS
Identifying Genetic Variation Causing
           Genetic Disease
Discovering Genetic Variants Causing
         Mendelian Disease

           4 million genetic variants


           2 million associated with
             protein-coding genes

               10,000 possibly
                 of disease
                causing type

                   1500 <1%
                 frequency in
                  population
Discovering Genetic Variants Causing
         Mendelian Disease

           4 million genetic variants


           2 million associated with
             protein-coding genes

               10,000 possibly
                 of disease
                causing type

                   1500 <1%
                 frequency in
                                         Single Causal
                  population            Genetic Variant
If a problem cannot be
solved, enlarge it.
       --Dwight D. Eisenhower
TYPES OF SINGLE NUCLEOTIDE
VARIANTS
Disease Genomics: Hunting Down
         Pathogenic Genetic Variation

Reference       Exon 1   Intron 1   Exon 2



        Start
                                             TAA
                                             Stop
Disease Genomics: Hunting Down
         Pathogenic Genetic Variation
                              Splice Sites


Reference       Exon 1          Intron 1           Exon 2



        Start
                                                            TAA
                         mRNA coding for protein            Stop
Disease Genomics: Hunting Down
            Pathogenic Genetic Variation
                                 Splice Sites


Reference          Exon 1          Intron 1           Exon 2



           Start
                                                               TAA
                            mRNA coding for protein            Stop




Patient            Exon 1          Intron 1           Exon 2
Disease Genomics: Hunting Down
            Pathogenic Genetic Variation
                                 Splice Sites


Reference          Exon 1          Intron 1           Exon 2



           Start
                                                               TAA
                            mRNA coding for protein            Stop

                                                               TAC
                                                               Tyr



Patient            Exon 1          Intron 1           Exon 2
Disease Genomics: Hunting Down
            Pathogenic Genetic Variation
                                        Splice Sites


Reference          Exon 1                 Intron 1          Exon 2



           Start
                                                                     TAA
                                  mRNA coding for protein            Stop

                                                                     TAC
                            Splice Site Loss                         Tyr



Patient            Exon 1                 Intron 1          Exon 2
Disease Genomics: Hunting Down
            Pathogenic Genetic Variation
                                          Splice Sites


Reference            Exon 1                 Intron 1          Exon 2



           Start
                                                                       TAA
                                    mRNA coding for protein            Stop

                                                                       TAC
                              Splice Site Loss                         Tyr



Patient              Exon 1                 Intron 1          Exon 2



               Missense
Disease Genomics: Hunting Down
            Pathogenic Genetic Variation
                                            Splice Sites


Reference             Exon 1                  Intron 1            Exon 2



           Start
                                                                           TAA
                                      mRNA coding for protein              Stop

                                                                           TAC
                                Splice Site Loss                           Tyr



Patient               Exon 1                  Intron 1            Exon 2



          Missense/Frameshift                              Stop Gain
GENETIC REGIONS OF INTEREST
Identifying Genetic Regions of Interest
Number of Genes in Genomic Regions
            of Interest
FREQUENCY OF GENETIC VARIANTS
Frequency of Polymorphisms:
          Common vs Rare
• Mendelian disorders are caused by rare
  variation, < 1-2% frequency in the relevant
  population
• Leverage large projects aimed at assessing
  genetic diversity in populations around the
  world
  – 1000 Genomes
  – NHLBI Exome Sequencing Project
Human Populations
Population Matters
• Most variations in protein-coding genes
  occurred fairly recently (last 20,000 years)
  – Adaptation to agriculture and diet changes,
    pathogen exposure and urban living
Population Matters
• Most variations in protein-coding genes occurred
  fairly recently (last 20,000 years)
  – Adaptation to agriculture and diet changes, pathogen
    exposure and urban living
• Monogenic diseases have different prevalence in
  different populations
  –   Cystic fibrosis in European population
  –   Hereditary hemochromotosis in Northern Europeans
  –   Tay-Sachs in Ashkenazi Jews
  –   Sickle-Cell anemia in Sub-saharan Africa populations
Population Matters
• Most variations in protein-coding genes occurred
  fairly recently (last 20,000 years)
  – Adaptation to agriculture and diet changes, pathogen
    exposure and urban living
• Monogenic diseases have different prevalence in
  different populations
  –   Cystic fibrosis in European population
  –   Hereditary hemochromotosis in Northern Europeans
  –   Tay-Sachs in Ashkenazi Jews
  –   Sickle-Cell anemia in Sub-saharan Africa populations
• Polygenic disorders
1000 Genomes Project
Exome Sequencing Project
• Multi-Institutional
• Total possible patient pool of > 250,000
  individuals, well phenotyped
  – Includes healthy individuals and diseased
• Currently 6700 exomes sequenced
  – 4420 European descent
  – 2312 African American
• 1.2 million coding variations
  – Most extremely rare/unique
  – Many population specific
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
  diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
  represented population groups and sub-
  groups…
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
  diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
  represented population groups and sub-
  groups…
  – Acadians
  – Native American
  – Non-Acadian/European Descent
Population Frequency

• Mendelian disorders are rare
• If variation is in database, is it associated with
  disease?
• Causal variation also needs to be rare
   – Cutoff somewhere in the < 0.5 - < 1.5% range
   – Should appear rarely or not at all in local controls
   – Track with disease in family members under study
Predicting the Impact of Missense
               Mutations
• Most use some level of evolutionary
  conservation to determine how severe a
  mutation is
  – SIFT
  – PolyPhen
  – GERP++
  – EvoD
Example: SIFT Algorithm

                                                    Multiple
Input Query
                          Homologs                 Sequence
 Sequence
                                                   Alignment
              Psi-BLAST              Alignment



 Multiple
Sequence                   PSSM                      Score
Alignment

                                      Normalize
                                       By most
                                     frequent AA
Predicting Impact
• Other approaches include additional features:
  – Protein structure information
  – Site level annotation (active sites, binding sites,
    etc)
  – Protein domain information
  – Biophysical properties of amino acids in that
    position and of the substituted amino acid
Prediction Take-Away



The more conserved a site is the more likely
any substitution is to be deleterious

However: Current methods have pretty poor
performance, not suitable for clinical-level
diagnosis
Classifying Genetic Variants

                                      4 million
                                      variants



            Intronic                                 Exonic                             Intergenic




                                        Amino Acid
 Unknown               Splice Site                            Silent Mutation           Splice Site
                                         Changing


          Potential                                                                      Potential
       Disease Causing                                                                Disease Causing



                                                                                     Known
Known Genetic      Stop Loss / Stop                    Missense
                                                                                Polymorphism in
Disease Variant          Gain                          Mutation
                                                                                   Population
GENE LEVEL ANNOTATION
Annotating Genes and Variants
• Is variant in a known protein-coding gene?
  – What does the gene do?
                                         4 million genetic variants

  – What molecular pathways?             2 million associated with
                                           protein-coding genes

  – What protein-protein interactions?       10,000 possibly
                                               of disease


  – What tissues is it expressed in?
                                              causing type

                                                 1500 <1%
                                               frequency in
                                                population

  – When in development?
Gene Level Annotations
ADDING ANNOTATIONS TO
VARIANTS
Genomic Intervals, Searching, and
            Annotation
• Most common way of describing genomic
  features is as an interval
• Multiple formats (BED, WIG, VCF, etc)
• In common for all is location:
  – Chromosome
  – Start Position of Feature
  – End Position of Feature
  – Annotations/Info (Optional)
Searching and Annotating: Interval
                 Trees


• Interval Trees allow efficient searching of all
  overlapping intervals
• Easiest to make one tree per chromosome
• Given a set of intervals (n) on a number line
  (chromosome) construct a tree
Interval Trees

All intervals to left                         All intervals to right

                                 Node Contains:

                                 - Centre point

                                 - Intervals
                                 sorted by start

                                 - Intervals
                                 sorted by end
IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa

CASE STUDIES
IGNITE Data Pipeline and Integration
                 Gene
              Annotations       Annotated
                             Genomic Variants




 Mapped         Gene
 Region(s)    Definitions
                                  Filter
                                   Sort
                                Prioritize
Known Genes   Pathway and
              Interactions
Brain Calcification
Brain Calcification
• 84 genes in chromosome 5 region
• No likely homozygous or compound heterozygous
  variants within region shared between two
  patients
• 29 genes with at least one targeted region with
  little or no sequencing coverage
• Many only lacked coverage in 5’ and 3’ UTRs
• Collaborators performed statistical tests for
  possibly copy-number variations of targeted
  regions using exome sequencing data
Brain Calcification
Charcot-Marie-Tooth: Genetic Mapping


                      Chromosome 9:
                      120,962,282 -133,033,431
Cutis Laxa: Genetic Mapping




                Chromosome 17:
                79,596,811-81,041,077
Charcot-Marie-Tooth Cutis Laxa
• 143 genes in region        • 52 genes in region
• 13 known causative genes   • 5 known causative genes
   –   MPZ                      –   ATP6V0A2
   –   PMP22                    –   ELN
   –   GDAP1                    –   FBLN5
   –   KIF1B                    –   EFEMP2
   –   MFN2                     –   SCYL1BP1
   –   SOX                      –   ALDH18A1
   –   EGR2
   –   DNM2
   –   RAB7
   –   LITAF (SIMPLE)
   –   GARS
   –   YARS
   –   LMNA
Pathway and Interaction Data
• 37 pathways                     • 10 pathways
  – Clathrin-derived vesicle         – Phagosome
    budding                          – Collecting duct acid
  – Lysosome vesicle                   secretion
    biogenesis                       – Lysosome
  – Endocytosis                      – Protein digestion and
  – Golgi-associated vesicle           absorption
    biogenesis                       – Metabolic pathways
  – Membrane trafficking             – Oxidative phosphorylation
  – Trans-Golgi network vesicle      – Arginine and proline
    budding                            metabolism
• Primarily LMNA or DNM2          • Primarily ATP6V0A2
Results: Charcot-Marie-Tooth
• 8 Genes Prioritized
Gene          Interactions Pathway
LRSAM1    MultipleEndocytosis
DNM1          DNM2             -
FNBP1         DNM2             -
TOR1A         MNA              -
STXBP1    Multiple        Five
SH3GLB2          -     Endocytosis
PIP5KL1          -     Endocytosis
FAM125B          -     Endocytosis


• For more information
   – Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
Results: Cutis Laxa
• 10 genes prioritized
Gene             Interactions Pathway
HEXDC       Multiple       Phagosome
HG5            -       Phagosome
HG5         Multiple       Lysosome, Protein digestion
SIRT7            Multiple      Metabolic Pathways
FASN                -      Metabolic Pathways
DCXR                -      Metabolic Pathways
PYCR1          -       Metabolic Pathways,
    Arginine/Proline
PCYT2                -     Metabolic Pathways
ARHGDIA               -        Oxidative Phosphorylation

• For more information
    – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Mais conteúdo relacionado

Semelhante a Bioc4010 lectures 1 and 2

Analisis Horta Guinardo 2008 03 10
Analisis Horta Guinardo 2008 03 10Analisis Horta Guinardo 2008 03 10
Analisis Horta Guinardo 2008 03 10
1977bcn
 
Presentatie maastricht
Presentatie maastrichtPresentatie maastricht
Presentatie maastricht
riannefijten
 
Border patrol statistics 5 20-2011
Border patrol statistics 5 20-2011Border patrol statistics 5 20-2011
Border patrol statistics 5 20-2011
Jacob Sapochnick
 
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
Romain Chion
 
Avance de Crédito Infonavit al 27 may 12
Avance de Crédito Infonavit al 27 may 12Avance de Crédito Infonavit al 27 may 12
Avance de Crédito Infonavit al 27 may 12
Sergio Velazco
 

Semelhante a Bioc4010 lectures 1 and 2 (20)

生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA
 
Aug2014 acrometrix
Aug2014 acrometrixAug2014 acrometrix
Aug2014 acrometrix
 
NGS in Forensics Genetics – examples using the GS Junior. Sponsored by Roche ...
NGS in Forensics Genetics – examples using the GS Junior. Sponsored by Roche ...NGS in Forensics Genetics – examples using the GS Junior. Sponsored by Roche ...
NGS in Forensics Genetics – examples using the GS Junior. Sponsored by Roche ...
 
IDNADEX: Improving DNA Data Exchange Validation Studies of a Global STR System
IDNADEX: Improving DNA Data Exchange Validation Studies of a Global STR SystemIDNADEX: Improving DNA Data Exchange Validation Studies of a Global STR System
IDNADEX: Improving DNA Data Exchange Validation Studies of a Global STR System
 
Genetic diversity assessment of east African finger millet and cost-effective...
Genetic diversity assessment of east African finger millet and cost-effective...Genetic diversity assessment of east African finger millet and cost-effective...
Genetic diversity assessment of east African finger millet and cost-effective...
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Primer designgeneprediction
Primer designgenepredictionPrimer designgeneprediction
Primer designgeneprediction
 
Esmo 2010 Powerpoint Slides
Esmo 2010 Powerpoint SlidesEsmo 2010 Powerpoint Slides
Esmo 2010 Powerpoint Slides
 
Analisis Horta Guinardo 2008 03 10
Analisis Horta Guinardo 2008 03 10Analisis Horta Guinardo 2008 03 10
Analisis Horta Guinardo 2008 03 10
 
Avance de Crédito Infonavit al 08 julio 2012
Avance de Crédito Infonavit al 08 julio 2012Avance de Crédito Infonavit al 08 julio 2012
Avance de Crédito Infonavit al 08 julio 2012
 
Moldova Internet audience Report 02.2013
Moldova Internet audience Report 02.2013Moldova Internet audience Report 02.2013
Moldova Internet audience Report 02.2013
 
Tabelas
TabelasTabelas
Tabelas
 
Avance crediticio al 24 junio 2012
Avance crediticio al 24 junio 2012Avance crediticio al 24 junio 2012
Avance crediticio al 24 junio 2012
 
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
 
Presentatie maastricht
Presentatie maastrichtPresentatie maastricht
Presentatie maastricht
 
Border patrol statistics 5 20-2011
Border patrol statistics 5 20-2011Border patrol statistics 5 20-2011
Border patrol statistics 5 20-2011
 
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
Presentation Stage Graphe de Connectivité du Cerveau 2014 (FR)
 
6th July 2012
6th July 20126th July 2012
6th July 2012
 
Avance de Crédito Infonavit al 27 may 12
Avance de Crédito Infonavit al 27 may 12Avance de Crédito Infonavit al 27 may 12
Avance de Crédito Infonavit al 27 may 12
 
The Caring Does Matter (CDM) Initiative: To Improve Cardiovascular Medication...
The Caring Does Matter (CDM) Initiative: To Improve Cardiovascular Medication...The Caring Does Matter (CDM) Initiative: To Improve Cardiovascular Medication...
The Caring Does Matter (CDM) Initiative: To Improve Cardiovascular Medication...
 

Mais de Dan Gaston

Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
Dan Gaston
 

Mais de Dan Gaston (10)

Population and evolutionary genetics 1
Population and evolutionary genetics 1Population and evolutionary genetics 1
Population and evolutionary genetics 1
 
2016 ngs health_lecture
2016 ngs health_lecture2016 ngs health_lecture
2016 ngs health_lecture
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
 
2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture
 
Bioc4700 2014 Guest Lecture
Bioc4700   2014 Guest LectureBioc4700   2014 Guest Lecture
Bioc4700 2014 Guest Lecture
 
Protein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthProtein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human Health
 
Bioc4010 sample questions
Bioc4010 sample questionsBioc4010 sample questions
Bioc4010 sample questions
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinformatics in Gene Research
Bioinformatics in Gene ResearchBioinformatics in Gene Research
Bioinformatics in Gene Research
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 

Bioc4010 lectures 1 and 2

  • 1. Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab Bioinformatician, IGNITE Project
  • 2. Introduction to Next-Gen Sequencing LECTURE 1
  • 3. Overview: Lecture 1 • Introduction AKA “Why does this matter?” • “Next-Gen” Sequencing • Bioinformatics Workflows • Types of Next-Gen Experiments • Working with the Human Genome • Slides available on slideshare: – http://www.slideshare.net/DanGaston
  • 4. Major Areas in Human Disease Genomics • Complex diseases – Genome Wide Association Studies (GWAS) • Cancer – Tumour genomics (Driver mutations) – Transcriptomics • Mendelian disease – Whole Genome/Exome Sequencing – Transcriptomics – Genetic Linkage
  • 5. Diagnosing Genetic Diseases • Genetic Counselors/Physicians order individual testing of genes based on patient phenotype • For rare diseases or unusual phenotypes may run tens to hundreds of tests • …..EXPENSIVE (Easily thousands of dollars)
  • 7. Genetic Disease Research: Cutis Laxa Chromosome 9: 120,962,282 -133,033,431
  • 8. Cutis Laxa • Linked Genomic Region ~13Mb in size • Contains 143 Genes • Prioritize and select genes for individual sanger sequencing • …Slow • …Laborious • …Can be expensive
  • 9.
  • 11. Human Genomics • $5,000 - $10,000 to sequence whole genome • $1000 to sequence only protein-coding portion (exome, later)
  • 12. Clinical Genomics • Rapid diagnosis of genetic disease in NICU cases • Quicker and cheaper than sequential genetic testing (traditional method)
  • 13. Cancer Genomics Welch JS, et al. JAMA, 2011;305, 1577
  • 15. Human Disease Genomics at Dalhousie • IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada – 3 year, $2.5 million Genome Canada Project – Currently working on >10 different diseases including two inherited cancer’s – Sequenced >20 individual exomes, 4 whole genomes, and several transcriptomes – More on Thursday… • Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines
  • 16.
  • 17.
  • 18. Short Reads Millions of paired “short reads”, 75-150bp each
  • 19. FastQ Format Read ID Sequence Quality line
  • 20. FastQ Quality Scores Quality Score (Q) Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.90% 40 1 in 10000 99.99% 50 1 in 100000 100.00% Q = -10 log10 P
  • 21. Quality Scores of Sequencing Reads
  • 22. General Genomics Workflow Raw Data Quality Control of Raw Analysis Data Whole Genome Alignment to reference Mapping genome Variant Calling Detection of genetic variation (SNPs, Indels, SV) Linking variants to biological Annotation information
  • 23. Short Read Mapping …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… 1) Report location of genome where read matches best 2) Minimize mismatches 3) Mismatches with lower quality bases better than mismatches with higher quality bases
  • 24. Discovering Genetic Variation SNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  • 25. Next-Gen Sequencing Experiments • Whole Genome Sequencing • Targeted Exome Sequencing • RNA-Seq • ChIP-Seq • CLIP-Seq
  • 26. Next-Gen Sequencing Experiments • Whole Genome Sequencing • Targeted Exome Sequencing • RNA-Seq • ChIP-Seq • CLIP-Seq
  • 27. Composition of Human Genome Size: 3.2 Gb
  • 28. Genomic Content Chromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA 1 249,250,621 4,401,091 2,012 31 1,130 134 66 106 2 243,199,373 4,607,702 1,203 50 948 115 40 93 3 198,022,430 3,894,345 1,040 25 719 99 29 77 4 191,154,276 3,673,892 718 39 698 92 24 71 5 180,915,260 3,436,667 849 24 676 83 25 68 6 171,115,067 3,360,890 1,002 39 731 81 26 67 7 159,138,663 3,045,992 866 34 803 90 24 70 8 146,364,022 2,890,692 659 39 568 80 28 42 9 141,213,431 2,581,827 785 15 714 69 19 55 10 135,534,747 2,609,802 745 18 500 64 32 56 11 135,006,516 2,607,254 1,258 48 775 63 24 53 12 133,851,895 2,482,194 1,003 47 582 72 27 69 13 115,169,878 1,814,242 318 8 323 42 16 36 14 107,349,540 1,712,799 601 50 472 92 10 46 15 102,531,392 1,577,346 562 43 473 78 13 39 16 90,354,753 1,747,136 805 65 429 52 32 34 17 81,195,210 1,491,841 1,158 44 300 61 15 46 18 78,077,248 1,448,602 268 20 59 32 13 25 19 59,128,983 1,171,356 1,399 26 181 110 13 15 20 63,025,520 1,206,753 533 13 213 57 15 34 21 48,129,895 787,784 225 8 150 16 5 8 22 51,304,566 745,778 431 21 308 31 5 23 X 155,270,560 2,174,952 815 23 780 128 22 52 Y 59,373,566 286,812 45 8 327 15 7 2 mtDNA 16,569 929 13 0 0 0 2 22
  • 30. Transcriptomics: RNA-Seq • Sequence the actively transcribed genes in a cell line or tissue – Only about 20% of genes are transcribed in particular cell types • Two types: – Poly-A selection – Total RNA + ribodepletion • Many experimental questions can be addressed
  • 31. RNA-Seq: Gene Expression Condition 1 Condition 2
  • 33. RNA-Seq: Novel/Non-Canonical Exon Discovery Exon1 Exon 2 Exon X Exon 3
  • 34. RNA-Seq: Gene Fusion Events Exon1 Exon 2 Exon 3 Gene 2 Exon 4
  • 35. RNA-Seq • Important to take in to account biological variability. A sample of cells is a mixed population – Replicates! • Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA) • High false positive rates for fusion gene discovery, novel exons, when low expression levels
  • 38. Short Read Mapping: Placing Millions of Reads on Human Reference • Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome • Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements) • Problem: Sequencing reads may be paired
  • 39. Short Read Mapping: Brute Force Method Simple conceptually: Compare each query k-mer to all k- mers of genome Genome Size (N): 3.2 billion bases K-mer length (M): 7 Number of comparisons((N-M + 1) * M): 21 billion
  • 40. Solution Index the Reference Genome Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.
  • 41. Short Read Alignment: Suffix Array Split genome into all suffixes (substrings) and sort alphabetically Allows query to be searched against an alphabetical reference, skipping 96% of the genome Ex: banana Sorted: banana a anana ana nana anana ana banana na nana a na
  • 42. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos Search for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • 43. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos Search for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • 44. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos Search for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • 45. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos Search for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • 46. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos Search for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • 47. Binary Search • Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range • Repeat until done or empty range
  • 48. Applied to Human Genome • In practice simple methods of indexing the genome can create very large data structures – Suffix Array: > 12 GB • Solution: Apply complex procedures that allow you to index and compress the data: – Burrows-Wheeler Transform – FM-Index
  • 49. Short Read Mapping: Mapping Quality • Have also ignored quality scores of reads • Mapping Quality (for a read): Sum the quality scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10- SUM_BASE_Q(i))) )
  • 50. Short Read Aligners • BLAT: BLAST-Like Alignment Tool • MAQ: First to take in to account quality scores • BWA: First to use Burrows-Wheeler Transform • Bowtie: Ungapped alignment only • Bowtie2: Allows indels • … and many more
  • 51. Identifying and Annotating Genomic Variation for Disease Gene Discovery LECTURE 2
  • 52. Genetic Variation • dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans – 38 million validated – 22 million in genes – 36 million with frequencies • 50-80% of mutations involved in inherited disease caused by SNVs
  • 53. SNP vs SNV • Technically a polymorphism is a variation that doesn’t cause disease and is common in a population • What is common? – Greater than 5% in a population a typical definition – Definition for rare ranges from < 0.5% to < 1.5%
  • 54. Discovering Genetic Variation SNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  • 55. Variant Calling: The Absurdly Simple Way Read depth at base: 10 T: 4 A: 6 Genotype: Heterozygous A/T TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome
  • 56. Variant Calling: The Absurdly Simple Way • Algorithm: – Count all aligned bases that pass quality threshold (e.g. >Q20) – If #reads with alternative base > lower bound (20%) and < upper bound (80%) call heterozygous alt – Else if > upper bound call homozygous alternative – Else call homozygous reference • …But what about base qualities for more than keeping reads?
  • 57. Improving Variant Calling • MAQ (Mapping and Assembling with Quality): – Short Read Mapper and Genotype Caller – First to use base qualities for either – Introduced mapping Quality
  • 58. Improving Variant Calling ① Base quality can not be more reliable than mapping quality of read ② At most individual can have two real nucleotides at a position (two alleles) ① Only consider two most frequent nucleotides ② Simplify to two states: A and B
  • 59. Improving Variant Calling • Three Possible Genotypes: – AA, BB, AB • Construct a model that includes base quality to estimate the probability of error • Calculate the probability of each genotype given the data and error rate • Genotype with highest probability is called
  • 61. The Model g = genotype e = error probability m = ploidy (2) k = number of reads
  • 62. The Model Reads that match reference
  • 63. The Model Reads that don’t match reference
  • 64. Improving Variant Calling • Two widely used tool sets for calling variants – samtools (uses MAQ-type calculation) – Genome Analysis Toolkit (GATK) UnifiedGenotyper • UnifiedGenotyper: Capable of calling both indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples
  • 65. UnifiedGenotyper Apply filters to discard poor reads and remove biases: ① Duplicate reads ② Malformed reads (i.e. mismatch in #bases and base qualities) ③ Bad mate (paired-end sequencing, paired reads map to different chromosomes) ④ Mapping quality zero (maps to multiple locations equally well) ⑤ Fewer than 10% mismatch on read in 20bp to either side of position
  • 66. Remove Duplicate Reads Application Avg Read Length Avg Molecules #Molecules/Lib #Molecules Sampled > 1 rary Sampled 30X Genome 5bn 2x100 450m 4.4% 4x Genome 5bn 2x100 60m 0.6% 100x Exome 500m 2x75 20m 2.0% Duplicate reads break the assumption of independent sampling from the library Identify reads with identical start/stop positions
  • 67. Sequencer-Specific Error Models If a base was miscalled, what is it most likely to be called as instead? Predicted Base A C G T A - 57.7 17.1 25.2 Actual C 34.9 - 11.3 53.9 Base G 31.9 5.1 - 63.0 T 45.9 22.1 32.0 -
  • 68. Variant Calling • SNP Calls infested with False Positives – Machine artifacts – Mis-mapped reads – Mis-aligned indels • 5 – 20% false positive rate
  • 69. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.
  • 70. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants
  • 71. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage
  • 72. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Pro: Won’t miss real variants – Con: Many more false positives
  • 73. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Con: False positives – Pro: Won’t miss real variants
  • 74. How Good Are My Calls? • How many called SNPs? – Human average of 1 heterozygous SNP / 1000 bases • Fraction of variants already in dbSNP • Transition/Transversion ratio – Transitions 2x as common • 2.8x when looking only at exons
  • 76. Identifying Genetic Variation Causing Genetic Disease
  • 77. Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  • 78. Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in Single Causal population Genetic Variant
  • 79. If a problem cannot be solved, enlarge it. --Dwight D. Eisenhower
  • 80. TYPES OF SINGLE NUCLEOTIDE VARIANTS
  • 81. Disease Genomics: Hunting Down Pathogenic Genetic Variation Reference Exon 1 Intron 1 Exon 2 Start TAA Stop
  • 82. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop
  • 83. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop Patient Exon 1 Intron 1 Exon 2
  • 84. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Tyr Patient Exon 1 Intron 1 Exon 2
  • 85. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss Tyr Patient Exon 1 Intron 1 Exon 2
  • 86. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss Tyr Patient Exon 1 Intron 1 Exon 2 Missense
  • 87. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice Sites Reference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss Tyr Patient Exon 1 Intron 1 Exon 2 Missense/Frameshift Stop Gain
  • 88. GENETIC REGIONS OF INTEREST
  • 90. Number of Genes in Genomic Regions of Interest
  • 92. Frequency of Polymorphisms: Common vs Rare • Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population • Leverage large projects aimed at assessing genetic diversity in populations around the world – 1000 Genomes – NHLBI Exome Sequencing Project
  • 94. Population Matters • Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living
  • 95. Population Matters • Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living • Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations
  • 96. Population Matters • Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living • Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations • Polygenic disorders
  • 98. Exome Sequencing Project • Multi-Institutional • Total possible patient pool of > 250,000 individuals, well phenotyped – Includes healthy individuals and diseased • Currently 6700 exomes sequenced – 4420 European descent – 2312 African American • 1.2 million coding variations – Most extremely rare/unique – Many population specific
  • 99. IGNITE Project: Local Controls • IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada • Atlantic Canada harbours several non- represented population groups and sub- groups…
  • 100. IGNITE Project: Local Controls • IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada • Atlantic Canada harbours several non- represented population groups and sub- groups… – Acadians – Native American – Non-Acadian/European Descent
  • 101. Population Frequency • Mendelian disorders are rare • If variation is in database, is it associated with disease? • Causal variation also needs to be rare – Cutoff somewhere in the < 0.5 - < 1.5% range – Should appear rarely or not at all in local controls – Track with disease in family members under study
  • 102. Predicting the Impact of Missense Mutations • Most use some level of evolutionary conservation to determine how severe a mutation is – SIFT – PolyPhen – GERP++ – EvoD
  • 103. Example: SIFT Algorithm Multiple Input Query Homologs Sequence Sequence Alignment Psi-BLAST Alignment Multiple Sequence PSSM Score Alignment Normalize By most frequent AA
  • 104. Predicting Impact • Other approaches include additional features: – Protein structure information – Site level annotation (active sites, binding sites, etc) – Protein domain information – Biophysical properties of amino acids in that position and of the substituted amino acid
  • 105. Prediction Take-Away The more conserved a site is the more likely any substitution is to be deleterious However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis
  • 106. Classifying Genetic Variants 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Causing Disease Causing Known Known Genetic Stop Loss / Stop Missense Polymorphism in Disease Variant Gain Mutation Population
  • 108. Annotating Genes and Variants • Is variant in a known protein-coding gene? – What does the gene do? 4 million genetic variants – What molecular pathways? 2 million associated with protein-coding genes – What protein-protein interactions? 10,000 possibly of disease – What tissues is it expressed in? causing type 1500 <1% frequency in population – When in development?
  • 111. Genomic Intervals, Searching, and Annotation • Most common way of describing genomic features is as an interval • Multiple formats (BED, WIG, VCF, etc) • In common for all is location: – Chromosome – Start Position of Feature – End Position of Feature – Annotations/Info (Optional)
  • 112. Searching and Annotating: Interval Trees • Interval Trees allow efficient searching of all overlapping intervals • Easiest to make one tree per chromosome • Given a set of intervals (n) on a number line (chromosome) construct a tree
  • 113. Interval Trees All intervals to left All intervals to right Node Contains: - Centre point - Intervals sorted by start - Intervals sorted by end
  • 114. IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa CASE STUDIES
  • 115. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort Prioritize Known Genes Pathway and Interactions
  • 117. Brain Calcification • 84 genes in chromosome 5 region • No likely homozygous or compound heterozygous variants within region shared between two patients • 29 genes with at least one targeted region with little or no sequencing coverage • Many only lacked coverage in 5’ and 3’ UTRs • Collaborators performed statistical tests for possibly copy-number variations of targeted regions using exome sequencing data
  • 119. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 -133,033,431
  • 120. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811-81,041,077
  • 121. Charcot-Marie-Tooth Cutis Laxa • 143 genes in region • 52 genes in region • 13 known causative genes • 5 known causative genes – MPZ – ATP6V0A2 – PMP22 – ELN – GDAP1 – FBLN5 – KIF1B – EFEMP2 – MFN2 – SCYL1BP1 – SOX – ALDH18A1 – EGR2 – DNM2 – RAB7 – LITAF (SIMPLE) – GARS – YARS – LMNA
  • 122. Pathway and Interaction Data • 37 pathways • 10 pathways – Clathrin-derived vesicle – Phagosome budding – Collecting duct acid – Lysosome vesicle secretion biogenesis – Lysosome – Endocytosis – Protein digestion and – Golgi-associated vesicle absorption biogenesis – Metabolic pathways – Membrane trafficking – Oxidative phosphorylation – Trans-Golgi network vesicle – Arginine and proline budding metabolism • Primarily LMNA or DNM2 • Primarily ATP6V0A2
  • 123. Results: Charcot-Marie-Tooth • 8 Genes Prioritized Gene Interactions Pathway LRSAM1 MultipleEndocytosis DNM1 DNM2 - FNBP1 DNM2 - TOR1A MNA - STXBP1 Multiple Five SH3GLB2 - Endocytosis PIP5KL1 - Endocytosis FAM125B - Endocytosis • For more information – Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  • 124. Results: Cutis Laxa • 10 genes prioritized Gene Interactions Pathway HEXDC Multiple Phagosome HG5 - Phagosome HG5 Multiple Lysosome, Protein digestion SIRT7 Multiple Metabolic Pathways FASN - Metabolic Pathways DCXR - Metabolic Pathways PYCR1 - Metabolic Pathways, Arginine/Proline PCYT2 - Metabolic Pathways ARHGDIA - Oxidative Phosphorylation • For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9