SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
MixSIH: a mixture model for
single individual haplotyping
    Hirotaka Matsumoto , Hisanori Kiryu
    Department of Computational Biology
    Graduate School of Frontier Sciences
    The University of Tokyo
• Introduction
   – What is haplotype ?
   – What is Single Individual Haplotyping (SIH)?
   – An unnoticed problem in SIH (No confidence score)
• Methods
   – Probabilistic model for SIH (Simplified version)
   – Confidence score based on our model
   – Actual model and optimization procedure
• Results
   – Dataset
   – Comparison of accuracies
   – Other analyses

                                                         2
Introduction
• Human somatic cells are diploid and contain
  two homologous copies of chromosomes.



                    homologous
• The two chromosomes differ at a number of
  loci such as SNP.
                           A     A
                           G     A   SNP
                           T     T
                           T     T
                           C     C              3
What is haplotype ?
• Haplotype is the combination of alleles on a
  single chromosome.

• If there are two heterozygous loci, there are
  two possible haplotypes.

                               haplotype
                             ---AAATGGCT---
       genotype

                      ?
                             ---AGATGTCT---
        A   G
    ---A ATG CT---
        G   T
                             ---AAATGTCT---
                             ---AGATGGCT---
                                                  4
The importance of haplotype
• Haplotype information is valuable for GWAS,
  analyzing genetic structures, cancer evolution and so
  on.
• A simple example
   – There are two SNP loci, each of which has an independent deleterious
     mutation, in a gene coding region.
                                          haplotype
                                      ---AAATGGCT---
     genotype

                            ?
                                      ---AGATGTCT---
      A   G
  ---A ATG CT---
      G   T
                                      ---AAATGTCT---
                                      ---AGATGGCT---
                                                                            5
Approaches for haplotype inference
• It is difficult to determine haplotypes experimentally,
  and there are several computational approaches for
  haplotype inference.

1. Statistically construct a set of haplotypes from population
  genotypes. (statistical haplotype phasing)


2. Reconstruct haplotypes by using genotypes of pedigree.

3. Infer an individual’s haplotypes from sequenced DNA
  fragments. (single individual haplotyping (SIH))

                                                                 6
Infer haplotypes from sequenced DNA fragments
      Single Individual Haplotyping (SIH)

aligned reads




   reference


 • (A)→(B)
    – Extract the fragments from the heterozygous
      alleles in aligned reads.

                                                      7
Infer haplotypes from sequenced DNA fragments
   Single Individual Haplotyping (SIH)




• (B)→(D) (i)
   – Co-occurrence of alleles in the same read (intra).



                                                          8
Infer haplotypes from sequenced DNA fragments
   Single Individual Haplotyping (SIH)




• (B)→(D) (ii)
   – Overlap between the reads (inter).



                                                   9
The problem and the view of SIH
• SIH uses the reads which span multiple
  heterozygous loci.
  – next-generations sequencing is not long enough
  – Sanger sequencing is too expensive


• This situation is changing rapidly with the
  advent of experimental techniques.
  – real-time single molecule sequencing
  – fosmid pool-based next generation sequencing

                                                     10
Important point in haplotype inference
• The haplotype information which contains errors
  is likely to lead to wrong results in downstream
  analyses.
  – In detecting the recombination events


• To use haplotype information in downstream
  analyses while avoiding such harmful influence of
  errors, it is important not only to assemble
  haplotypes as long as possible but also to provide
  means to extract highly reliable haplotype
  regions.

                                                       11
The problem of SIH algorithms
• In the statistical haplotype phasing, reliable
  haplotype regions are determined by selecting the
  blocks of limited haplotype diversity and level of
  linkage disequilibrium (LD).

• Although there are many algorithms for SIH, none of
  these algorithms can provide confidence scores to
  extract reliable haplotype regions.


• We developed an algorithm which provides
  the confidence scores of the regions.
                                                        12
• Introduction
   – What is haplotype ?
   – What is Single Individual Haplotyping (SIH)?
   – An unnoticed problem in SIH (No confidence score)
• Methods
   – Probabilistic model for SIH (Simplified version)
   – Confidence score based on our model
   – Actual model and optimization procedure
• Results
   – Dataset
   – Comparison of accuracies
   – Other analyses

                                                         13
Notation
• We only consider the heterozygous sites and
  represent them as binary.

• We extract the heterozygous alleles from the DNA
  sequenced reads and describe these reads as
  “fragments”.

    mapped reads




       reference


                                                     14
Process behind the sequencing
• Each sequenced fragments are derived from
  one of the haplotypes.


         Haplotypes            Fragments
                              001011
                               010110

hap 0 0 0 1 0 1 1 0 0 0 0      10100
                                 0110000
hap 1 1 1 0 1 0 0 1 1 1 1        1001111
                                   1000
                                       1111
                                              15
Observed data
• The contents of the fragments are only observed and
  haplotype states and the derivation of the fragments
  are not observed.
                                    Observed data
                                      (input data)
                                     001011
                                      010110

            ?
                           ?
hap 0                                 10100
                                        0110000
hap 1       ?                           1001111
                                          1000
                                              1111
                                                         16
Parameters to represent unobserved data

• We set parameter θ and latent value Z to
  represent unobserved processes.
         j
hap 0

hap 1
        j                Zi      fi   1001111




                                                 17
Parameters to represent unobserved data
            j
hap 0

hap 1
            j                             fi   1001111



1. Zi is the latent value to represent the derivation of fi
                       1     0
                 Z i    or   
                        0    
                              1
                             


    hap 0                       hap 0               fi
    hap 1
                        fi      hap 1
                                                              18
Parameters to represent unobserved data
            j
hap 0

hap 1
           j                            fi   1001111



2. θj is the set of parameters to represent the state of
   site j
                                0
                           a:
              j , a           1
        j                                j , a   j ,b  1
              j ,b
                                 j
                                1
                           b:   0
                                                                   19
Parameters to represent unobserved data
                    j
 hap 0

 hap 1
                  j                                                  fi   1001111



  The probability P( f i |  ) is as follows;
                                                                                                 Zi , j
                   1                                             1                          
P( f i |  )   P(hi  j ) P( f i | hi , )              0.5   k , ( fi ,k ,hi ) 
                                                 Zi , j
                                                                   
                                                            Z j 0 
                                                                                             
              Z   j 0                                                k X ( f i )           

                              where
                            ・ hi   derivation of f i   P(hi  0)  P(hi  1)  0.5
                                 is the                   and
                            ・k  X ( fi ) means the site which f i  
                                                                   covers
                                                 a if ( f i ,k , hi )  (0,0)  or  1)
                                                                                    (1,
                            ・ ( f i ,k , hi )  
                                                 b if ( f i ,k , hi )  (0,1)  or  0)
                                                                                   (1,            20
Parameters optimization and
              haplotype inference
               N              N           1
 P( F |  )   P( f i |  )   P(hi  j ) P( f i | hi , ) 
                                                                Zi , j

               i 1          i 1   Z    j 0

• Optimizing Z and θ simultaneously is impossible, and
  we use EM algorithm for optimization.


• The haplotypes of site j is the state whose probability
  is higher than another.
   – If    j , a   j ,b              a:      0
                                                1
                                                j

                                                                         21
Confidence score of a site
• The confidence of connection of haplotypes at site j
  is calculated from the optimized parameter.

                          P( F |  )         P( F ( j ) |  ) 
  connectivity( j )  log
                          P( F |  )   log P( F ( j ) |  ) 
                                                                
                                                               

where
        k ,a   k ,a , k ,b   k ,b
                                             (k  j )
    
        k ,a   k ,b , k ,b   k ,a
                                             (k  j )



                                                                      22
Confidence score of a site
 • This is the illustration of “connectivity”.

                                P( F |  )         P( F ( j ) |  ) 
        connectivity( j )  log
                                P( F |  )   log P( F ( j ) |  ) 
                                                                      
                                                                     

                                                                      j
hap 0                                                hap 0
hap 1                               F ( j)           hap 1




             P( F ( j ) |  )                    P( F ( j ) |  )
                                                                            23
Confidence score of a region
• We extend “connectivity” to the confidence scores of
  the regions (MC).

       MC( j1 , j2 )  min connectivity( j )
                       j1  j  j2


• MC is the minimum ”connectivitiy” in the region.

• We tested whether reliable regions could be
  extracted by using MC values.


                                                         24
Actual model
    • Add sequencing error term 
P( f i | hi , )      
                     k X ( f i )
                                    k ,  ( f i ,k , hi )    P( f i | hi , )       (1   )
                                                                                   kX ( f i )
                                                                                                   k ,  ( f i ,k , hi )    k , ( fi ,k ,hi )


    • Define the prior distribution and optimize parameter
      with Variational Bayes EM (VBEM) algorithm.




                                                                                                                                           25
Actual parameter optimization
• Iterative twist operations to avoid sub-optimal
  solutions.
     ①            ②             ③           ④          
                                                   if P( F |  )  P( F |  )
                                 ⑤

①   Do Variational Bayes EM with initial parameter.
②   Select the site which has smallest “connectivity”.
③   Do VBEM with twisted parameter.
④   Compare the probabilities and select better parameter.
⑤   Iterate from step ② until smallest connectivity over 7.0.
                                                                        26
• Introduction
   – What is haplotype ?
   – What is Single Individual Haplotyping (SIH)?
   – An unnoticed problem in SIH (No confidence score)
• Methods
   – Probabilistic model for SIH (Simplified version)
   – Confidence score based on our model
   – Actual model and optimization procedure
• Results
   – Dataset
   – Comparison of accuracies
   – Other analyses

                                                         27
Dataset (Simulation data)
• True data
  – Generate M binary heterozygous loci randomly.


• Input data
  – Replicate each true haplotype c times and
    randomly divide them into subsequences of
    length between l1 and l2. Then randomly flipped
    the binary values of the fragment from 0(1) to 1(0)
    with probability e.


                                                          28
Dataset (Real data)
• Input data
  – Duitama’s work who conducted fosmid pool-
    based NGS for HapMap trio child NA12878 from
    the CEU population.

• True data
  – Haplotypes of about 82% (1.36*10^6/1.65*10^6)
    sites are determined by trio-based statistical
    phasing method and are conducted by 1000
    Genomes Project.

                                                     29
Fosmid pool-based NGS (1)


              ①                         ②




① Genomic DNA is fragmented into pieces of length about 40 kilo-
  bases and construct fosmid library.

② Fosmid clones are randomly partitioned into barcoded 32 pools.

                                                                   30
Fosmid pool-based NGS (2)
                              ④
                                  A   G
        ③                         A              A     G
                                           ⑤
                                      G
                                  A   C          A     C
                                  G   G          G     G

                  reference

③ Sequencing and mapping.

④ Reads draw a block which corresponds to a fosmid library.

⑤ Convert a block into a fragment.
                                                        31
True data (Trio-based data)

A|C|A|G             A|C|T|G
C|G|T|G             A|G|T|T


          A|C|A|G
          C|G|T|T



                              32
True data (Trio-based data)

A|C|A|G             A|C|T|G
C|G|T|G             A|G|T|T


          A|C|A|G
          C|G|T|T

          C ? A T
          A ? T G
                              33
Accuracy measures
• Pairwise accuracy
                            CP   IP

True 0
     000000000        Inferred
                               0000111000
    1111111111                 1111000111



                          Consistent pair (CP)
                          Inconsistent pair (IP)
                          Precision = CPs/(IPs+CPs)

                                                   34
Accuracy measures
• Pairwise accuracy


Inferred
         0000111000    Inferred
                                0000111000
         1111000111             1111000111



             ↓                      ↓
     Precision ≒ 0.5         Precision = 1.0


                                               35
Comparison of Pairwise Accuracies
           Simulation data        Real data




• The allows indicate the threshold of MC.


                                              36
Comparison of Pairwise Accuracies
           Simulation data       Real data




• The precisions without MC threshold is almost same.


                                                        37
Comparison of Pairwise Accuracies
           Simulation data       Real data




• The precision of MixSIH increases with high MC
  threshold.
                                                   38
Comparison of Pairwise Accuracies
           Simulation data        Real data




• The precision of MixSIH does not saturate with high
  MC threshold in real data.
                                                        39
Problem of fosmid pool-based NGS
                            ①      ②         ③④
             ②   ①
                     ③
             ④

homologous



 • Fosmid pool-based NGS has potential to
   produce chimeric fragment accidentally.


                                                  40
Remove the chimeric fragments
• Calculate the “chimerity” of the fragment by
  comparing the true haplotypes data.




          where n(f,h) is the number of sites at which the fragment f matches with the true haplotype h
          f<=i and f>j represent the left and right parts of the fragment f divided at site j, and α0=0.028
          is The empirical sequence error rate.



• Remove the fragments whose chimerity are over 10
  which correspond to the case that 1.65% of the
  fragments were removed.

                                                                                                      41
Pairwise accuracy on the real data in which
   the chimeric fragments are removed




• The precision of MixSIH reaches that of unassembled prediction.

                                                               42
Dependency of MC values on
            the fragment parameters
• The minimal MC threshold that achieves precision
  >= 0.95 is plotted for different fragment length range,
  coverage, and error rates.




• The overall scale of the MC thresholds changes relatively
  moderately at MC =6.0 and we set the default MC threshold
  to 6.0.
                                                              43
Spatial distribution of MC values
• This is an example of the spatial distribution of the
  precision of MixSIH and linkage disequilibrium (LD).




• SIH can accurately infer the haplotypes in regions with low LD,
  but there are also regions with reduced precision and high LD
  values.

            Unify SIH and statistical phasing method.
                                                                    44
Optimality of inferred parameters
• Test whether our iterative optimization method
  succeed to avoid sub-optimal solutions.




• The parameters converge to the global optimum
  upon repeating the twist operation.

                                                   45
Running time




• Our method applies the VBEM algorithm repeatedly and
  hence is rather slow.

• Considering the number of heterozygous sites, it is roguhly estimated that
  MixSIH takes about 15 days to finish haplotyping for the chromosome-
  wide data.
                                                                               46
Conclusion
• We have developed a probabilistic model for SIH and
  defined the minimal connectivity (MC) score.

• Our algorithm can extract highly accurate haplotype
  regions by using MC values.

• We have also found evidence that there are a small number of
  chimeric fragments in an existing dataset of fosmid pool-
  based NGS, and these fragments considerably reduce the
  quality of the SIH.



                                                                 47
Acknowledgement
• Department of Computational Biology, the University of Tokyo
   – Hisanori Kiryu
   – Kiryu Lab. Members
       • Tsukasa Fukunaga
       • Yuki Kashihara
       • Risa Kawaguchi




                                                                 48

Mais conteúdo relacionado

Destaque

Super life on mars(amezing concept be pankaj)
Super life on mars(amezing concept be pankaj)Super life on mars(amezing concept be pankaj)
Super life on mars(amezing concept be pankaj)Punk Pankaj
 
Native American Mitochondrial Haplogroup Discoveries
Native American Mitochondrial Haplogroup DiscoveriesNative American Mitochondrial Haplogroup Discoveries
Native American Mitochondrial Haplogroup DiscoveriesFamily Tree DNA
 
Genetic Analysis Solutions for Plant Sciences
Genetic Analysis Solutions for Plant SciencesGenetic Analysis Solutions for Plant Sciences
Genetic Analysis Solutions for Plant SciencesThermo Fisher Scientific
 
R1b and the People of Europe: An Ancient DNA Update
R1b and the People of Europe: An Ancient DNA UpdateR1b and the People of Europe: An Ancient DNA Update
R1b and the People of Europe: An Ancient DNA UpdateFamily Tree DNA
 
Earland Plant Evolution Trends
Earland Plant Evolution TrendsEarland Plant Evolution Trends
Earland Plant Evolution Trendsearland
 
Genotyping by Sequencing
Genotyping by SequencingGenotyping by Sequencing
Genotyping by SequencingSenthil Natesan
 
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...L'Institut Paris Region
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Data Science Thailand
 
Why arabidopsis is a model plant
Why arabidopsis is a model plantWhy arabidopsis is a model plant
Why arabidopsis is a model plantSachin Ekatpure
 
Molecular Breeding in Plants is an introduction to the fundamental techniques...
Molecular Breeding in Plants is an introduction to the fundamental techniques...Molecular Breeding in Plants is an introduction to the fundamental techniques...
Molecular Breeding in Plants is an introduction to the fundamental techniques...UNIVERSITI MALAYSIA SABAH
 

Destaque (20)

Technology
Technology Technology
Technology
 
plant growth essentials
plant growth essentialsplant growth essentials
plant growth essentials
 
Super life on mars(amezing concept be pankaj)
Super life on mars(amezing concept be pankaj)Super life on mars(amezing concept be pankaj)
Super life on mars(amezing concept be pankaj)
 
Plantkunde (2)
Plantkunde (2)Plantkunde (2)
Plantkunde (2)
 
MarsHabitat
MarsHabitatMarsHabitat
MarsHabitat
 
Plantenrijk (1)
Plantenrijk (1)Plantenrijk (1)
Plantenrijk (1)
 
Native American Mitochondrial Haplogroup Discoveries
Native American Mitochondrial Haplogroup DiscoveriesNative American Mitochondrial Haplogroup Discoveries
Native American Mitochondrial Haplogroup Discoveries
 
Space Breeding
Space BreedingSpace Breeding
Space Breeding
 
Genetic Analysis Solutions for Plant Sciences
Genetic Analysis Solutions for Plant SciencesGenetic Analysis Solutions for Plant Sciences
Genetic Analysis Solutions for Plant Sciences
 
Kalıtım Biçimleri
Kalıtım BiçimleriKalıtım Biçimleri
Kalıtım Biçimleri
 
SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
R1b and the People of Europe: An Ancient DNA Update
R1b and the People of Europe: An Ancient DNA UpdateR1b and the People of Europe: An Ancient DNA Update
R1b and the People of Europe: An Ancient DNA Update
 
Earland Plant Evolution Trends
Earland Plant Evolution TrendsEarland Plant Evolution Trends
Earland Plant Evolution Trends
 
4 mars-geschiedenis
4 mars-geschiedenis4 mars-geschiedenis
4 mars-geschiedenis
 
Genotyping by Sequencing
Genotyping by SequencingGenotyping by Sequencing
Genotyping by Sequencing
 
SNP Genotyping Technologies
SNP Genotyping TechnologiesSNP Genotyping Technologies
SNP Genotyping Technologies
 
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...
Présentation du rapport sur les scénarios alternatifs à la fermeture totale d...
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)
 
Why arabidopsis is a model plant
Why arabidopsis is a model plantWhy arabidopsis is a model plant
Why arabidopsis is a model plant
 
Molecular Breeding in Plants is an introduction to the fundamental techniques...
Molecular Breeding in Plants is an introduction to the fundamental techniques...Molecular Breeding in Plants is an introduction to the fundamental techniques...
Molecular Breeding in Plants is an introduction to the fundamental techniques...
 

Mais de Hirotaka Matsumoto

球面と双曲面の幾何学入門の入門
球面と双曲面の幾何学入門の入門球面と双曲面の幾何学入門の入門
球面と双曲面の幾何学入門の入門Hirotaka Matsumoto
 
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacodeHirotaka Matsumoto
 

Mais de Hirotaka Matsumoto (7)

球面と双曲面の幾何学入門の入門
球面と双曲面の幾何学入門の入門球面と双曲面の幾何学入門の入門
球面と双曲面の幾何学入門の入門
 
ISMB/ECCB2019読み会_松本
ISMB/ECCB2019読み会_松本ISMB/ECCB2019読み会_松本
ISMB/ECCB2019読み会_松本
 
ISMB2018読み会
ISMB2018読み会ISMB2018読み会
ISMB2018読み会
 
PRML11.2 - 11.6
PRML11.2 - 11.6PRML11.2 - 11.6
PRML11.2 - 11.6
 
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode
次元圧縮周りでの気付き&1細胞発現データにおける次元圧縮の利用例@第3回wacode
 
CSP
CSPCSP
CSP
 
Prml11 sup
Prml11 supPrml11 sup
Prml11 sup
 

MixSIH: a mixture model for single individual haplotyping

  • 1. MixSIH: a mixture model for single individual haplotyping Hirotaka Matsumoto , Hisanori Kiryu Department of Computational Biology Graduate School of Frontier Sciences The University of Tokyo
  • 2. • Introduction – What is haplotype ? – What is Single Individual Haplotyping (SIH)? – An unnoticed problem in SIH (No confidence score) • Methods – Probabilistic model for SIH (Simplified version) – Confidence score based on our model – Actual model and optimization procedure • Results – Dataset – Comparison of accuracies – Other analyses 2
  • 3. Introduction • Human somatic cells are diploid and contain two homologous copies of chromosomes. homologous • The two chromosomes differ at a number of loci such as SNP. A A G A SNP T T T T C C 3
  • 4. What is haplotype ? • Haplotype is the combination of alleles on a single chromosome. • If there are two heterozygous loci, there are two possible haplotypes. haplotype ---AAATGGCT--- genotype ? ---AGATGTCT--- A G ---A ATG CT--- G T ---AAATGTCT--- ---AGATGGCT--- 4
  • 5. The importance of haplotype • Haplotype information is valuable for GWAS, analyzing genetic structures, cancer evolution and so on. • A simple example – There are two SNP loci, each of which has an independent deleterious mutation, in a gene coding region. haplotype ---AAATGGCT--- genotype ? ---AGATGTCT--- A G ---A ATG CT--- G T ---AAATGTCT--- ---AGATGGCT--- 5
  • 6. Approaches for haplotype inference • It is difficult to determine haplotypes experimentally, and there are several computational approaches for haplotype inference. 1. Statistically construct a set of haplotypes from population genotypes. (statistical haplotype phasing) 2. Reconstruct haplotypes by using genotypes of pedigree. 3. Infer an individual’s haplotypes from sequenced DNA fragments. (single individual haplotyping (SIH)) 6
  • 7. Infer haplotypes from sequenced DNA fragments Single Individual Haplotyping (SIH) aligned reads reference • (A)→(B) – Extract the fragments from the heterozygous alleles in aligned reads. 7
  • 8. Infer haplotypes from sequenced DNA fragments Single Individual Haplotyping (SIH) • (B)→(D) (i) – Co-occurrence of alleles in the same read (intra). 8
  • 9. Infer haplotypes from sequenced DNA fragments Single Individual Haplotyping (SIH) • (B)→(D) (ii) – Overlap between the reads (inter). 9
  • 10. The problem and the view of SIH • SIH uses the reads which span multiple heterozygous loci. – next-generations sequencing is not long enough – Sanger sequencing is too expensive • This situation is changing rapidly with the advent of experimental techniques. – real-time single molecule sequencing – fosmid pool-based next generation sequencing 10
  • 11. Important point in haplotype inference • The haplotype information which contains errors is likely to lead to wrong results in downstream analyses. – In detecting the recombination events • To use haplotype information in downstream analyses while avoiding such harmful influence of errors, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. 11
  • 12. The problem of SIH algorithms • In the statistical haplotype phasing, reliable haplotype regions are determined by selecting the blocks of limited haplotype diversity and level of linkage disequilibrium (LD). • Although there are many algorithms for SIH, none of these algorithms can provide confidence scores to extract reliable haplotype regions. • We developed an algorithm which provides the confidence scores of the regions. 12
  • 13. • Introduction – What is haplotype ? – What is Single Individual Haplotyping (SIH)? – An unnoticed problem in SIH (No confidence score) • Methods – Probabilistic model for SIH (Simplified version) – Confidence score based on our model – Actual model and optimization procedure • Results – Dataset – Comparison of accuracies – Other analyses 13
  • 14. Notation • We only consider the heterozygous sites and represent them as binary. • We extract the heterozygous alleles from the DNA sequenced reads and describe these reads as “fragments”. mapped reads reference 14
  • 15. Process behind the sequencing • Each sequenced fragments are derived from one of the haplotypes. Haplotypes Fragments 001011 010110 hap 0 0 0 1 0 1 1 0 0 0 0 10100 0110000 hap 1 1 1 0 1 0 0 1 1 1 1 1001111 1000 1111 15
  • 16. Observed data • The contents of the fragments are only observed and haplotype states and the derivation of the fragments are not observed. Observed data (input data) 001011 010110 ? ? hap 0 10100 0110000 hap 1 ? 1001111 1000 1111 16
  • 17. Parameters to represent unobserved data • We set parameter θ and latent value Z to represent unobserved processes. j hap 0 hap 1 j Zi fi 1001111 17
  • 18. Parameters to represent unobserved data j hap 0 hap 1 j fi 1001111 1. Zi is the latent value to represent the derivation of fi 1  0 Z i    or     0   1     hap 0 hap 0 fi hap 1 fi hap 1 18
  • 19. Parameters to represent unobserved data j hap 0 hap 1 j fi 1001111 2. θj is the set of parameters to represent the state of site j 0 a:  j , a 1 j     j , a   j ,b  1  j ,b j 1 b: 0 19
  • 20. Parameters to represent unobserved data j hap 0 hap 1 j fi 1001111 The probability P( f i |  ) is as follows; Zi , j 1 1   P( f i |  )   P(hi  j ) P( f i | hi , )     0.5   k , ( fi ,k ,hi )  Zi , j  Z j 0   Z j 0 k X ( f i )  where ・ hi   derivation of f i   P(hi  0)  P(hi  1)  0.5 is the and ・k  X ( fi ) means the site which f i   covers a if ( f i ,k , hi )  (0,0)  or  1) (1, ・ ( f i ,k , hi )   b if ( f i ,k , hi )  (0,1)  or  0) (1, 20
  • 21. Parameters optimization and haplotype inference N N 1 P( F |  )   P( f i |  )   P(hi  j ) P( f i | hi , )  Zi , j i 1 i 1 Z j 0 • Optimizing Z and θ simultaneously is impossible, and we use EM algorithm for optimization. • The haplotypes of site j is the state whose probability is higher than another. – If  j , a   j ,b a: 0 1 j 21
  • 22. Confidence score of a site • The confidence of connection of haplotypes at site j is calculated from the optimized parameter.  P( F |  )   P( F ( j ) |  )  connectivity( j )  log  P( F |  )   log P( F ( j ) |  )         where  k ,a   k ,a , k ,b   k ,b   (k  j )     k ,a   k ,b , k ,b   k ,a   (k  j ) 22
  • 23. Confidence score of a site • This is the illustration of “connectivity”.  P( F |  )   P( F ( j ) |  )  connectivity( j )  log  P( F |  )   log P( F ( j ) |  )         j hap 0 hap 0 hap 1 F ( j) hap 1 P( F ( j ) |  ) P( F ( j ) |  ) 23
  • 24. Confidence score of a region • We extend “connectivity” to the confidence scores of the regions (MC). MC( j1 , j2 )  min connectivity( j ) j1  j  j2 • MC is the minimum ”connectivitiy” in the region. • We tested whether reliable regions could be extracted by using MC values. 24
  • 25. Actual model • Add sequencing error term  P( f i | hi , )   k X ( f i ) k ,  ( f i ,k , hi )  P( f i | hi , )   (1   ) kX ( f i ) k ,  ( f i ,k , hi )  k , ( fi ,k ,hi ) • Define the prior distribution and optimize parameter with Variational Bayes EM (VBEM) algorithm. 25
  • 26. Actual parameter optimization • Iterative twist operations to avoid sub-optimal solutions. ①  ② ③  ④   if P( F |  )  P( F |  ) ⑤ ① Do Variational Bayes EM with initial parameter. ② Select the site which has smallest “connectivity”. ③ Do VBEM with twisted parameter. ④ Compare the probabilities and select better parameter. ⑤ Iterate from step ② until smallest connectivity over 7.0. 26
  • 27. • Introduction – What is haplotype ? – What is Single Individual Haplotyping (SIH)? – An unnoticed problem in SIH (No confidence score) • Methods – Probabilistic model for SIH (Simplified version) – Confidence score based on our model – Actual model and optimization procedure • Results – Dataset – Comparison of accuracies – Other analyses 27
  • 28. Dataset (Simulation data) • True data – Generate M binary heterozygous loci randomly. • Input data – Replicate each true haplotype c times and randomly divide them into subsequences of length between l1 and l2. Then randomly flipped the binary values of the fragment from 0(1) to 1(0) with probability e. 28
  • 29. Dataset (Real data) • Input data – Duitama’s work who conducted fosmid pool- based NGS for HapMap trio child NA12878 from the CEU population. • True data – Haplotypes of about 82% (1.36*10^6/1.65*10^6) sites are determined by trio-based statistical phasing method and are conducted by 1000 Genomes Project. 29
  • 30. Fosmid pool-based NGS (1) ① ② ① Genomic DNA is fragmented into pieces of length about 40 kilo- bases and construct fosmid library. ② Fosmid clones are randomly partitioned into barcoded 32 pools. 30
  • 31. Fosmid pool-based NGS (2) ④ A G ③ A A G ⑤ G A C A C G G G G reference ③ Sequencing and mapping. ④ Reads draw a block which corresponds to a fosmid library. ⑤ Convert a block into a fragment. 31
  • 32. True data (Trio-based data) A|C|A|G A|C|T|G C|G|T|G A|G|T|T A|C|A|G C|G|T|T 32
  • 33. True data (Trio-based data) A|C|A|G A|C|T|G C|G|T|G A|G|T|T A|C|A|G C|G|T|T C ? A T A ? T G 33
  • 34. Accuracy measures • Pairwise accuracy CP IP True 0 000000000 Inferred 0000111000 1111111111 1111000111 Consistent pair (CP) Inconsistent pair (IP) Precision = CPs/(IPs+CPs) 34
  • 35. Accuracy measures • Pairwise accuracy Inferred 0000111000 Inferred 0000111000 1111000111 1111000111 ↓ ↓ Precision ≒ 0.5 Precision = 1.0 35
  • 36. Comparison of Pairwise Accuracies Simulation data Real data • The allows indicate the threshold of MC. 36
  • 37. Comparison of Pairwise Accuracies Simulation data Real data • The precisions without MC threshold is almost same. 37
  • 38. Comparison of Pairwise Accuracies Simulation data Real data • The precision of MixSIH increases with high MC threshold. 38
  • 39. Comparison of Pairwise Accuracies Simulation data Real data • The precision of MixSIH does not saturate with high MC threshold in real data. 39
  • 40. Problem of fosmid pool-based NGS ① ② ③④ ② ① ③ ④ homologous • Fosmid pool-based NGS has potential to produce chimeric fragment accidentally. 40
  • 41. Remove the chimeric fragments • Calculate the “chimerity” of the fragment by comparing the true haplotypes data. where n(f,h) is the number of sites at which the fragment f matches with the true haplotype h f<=i and f>j represent the left and right parts of the fragment f divided at site j, and α0=0.028 is The empirical sequence error rate. • Remove the fragments whose chimerity are over 10 which correspond to the case that 1.65% of the fragments were removed. 41
  • 42. Pairwise accuracy on the real data in which the chimeric fragments are removed • The precision of MixSIH reaches that of unassembled prediction. 42
  • 43. Dependency of MC values on the fragment parameters • The minimal MC threshold that achieves precision >= 0.95 is plotted for different fragment length range, coverage, and error rates. • The overall scale of the MC thresholds changes relatively moderately at MC =6.0 and we set the default MC threshold to 6.0. 43
  • 44. Spatial distribution of MC values • This is an example of the spatial distribution of the precision of MixSIH and linkage disequilibrium (LD). • SIH can accurately infer the haplotypes in regions with low LD, but there are also regions with reduced precision and high LD values. Unify SIH and statistical phasing method. 44
  • 45. Optimality of inferred parameters • Test whether our iterative optimization method succeed to avoid sub-optimal solutions. • The parameters converge to the global optimum upon repeating the twist operation. 45
  • 46. Running time • Our method applies the VBEM algorithm repeatedly and hence is rather slow. • Considering the number of heterozygous sites, it is roguhly estimated that MixSIH takes about 15 days to finish haplotyping for the chromosome- wide data. 46
  • 47. Conclusion • We have developed a probabilistic model for SIH and defined the minimal connectivity (MC) score. • Our algorithm can extract highly accurate haplotype regions by using MC values. • We have also found evidence that there are a small number of chimeric fragments in an existing dataset of fosmid pool- based NGS, and these fragments considerably reduce the quality of the SIH. 47
  • 48. Acknowledgement • Department of Computational Biology, the University of Tokyo – Hisanori Kiryu – Kiryu Lab. Members • Tsukasa Fukunaga • Yuki Kashihara • Risa Kawaguchi 48