MixSIH: a mixture model for single individual haplotyping
1. MixSIH: a mixture model for
single individual haplotyping
Hirotaka Matsumoto , Hisanori Kiryu
Department of Computational Biology
Graduate School of Frontier Sciences
The University of Tokyo
2. • Introduction
– What is haplotype ?
– What is Single Individual Haplotyping (SIH)?
– An unnoticed problem in SIH (No confidence score)
• Methods
– Probabilistic model for SIH (Simplified version)
– Confidence score based on our model
– Actual model and optimization procedure
• Results
– Dataset
– Comparison of accuracies
– Other analyses
2
3. Introduction
• Human somatic cells are diploid and contain
two homologous copies of chromosomes.
homologous
• The two chromosomes differ at a number of
loci such as SNP.
A A
G A SNP
T T
T T
C C 3
4. What is haplotype ?
• Haplotype is the combination of alleles on a
single chromosome.
• If there are two heterozygous loci, there are
two possible haplotypes.
haplotype
---AAATGGCT---
genotype
?
---AGATGTCT---
A G
---A ATG CT---
G T
---AAATGTCT---
---AGATGGCT---
4
5. The importance of haplotype
• Haplotype information is valuable for GWAS,
analyzing genetic structures, cancer evolution and so
on.
• A simple example
– There are two SNP loci, each of which has an independent deleterious
mutation, in a gene coding region.
haplotype
---AAATGGCT---
genotype
?
---AGATGTCT---
A G
---A ATG CT---
G T
---AAATGTCT---
---AGATGGCT---
5
6. Approaches for haplotype inference
• It is difficult to determine haplotypes experimentally,
and there are several computational approaches for
haplotype inference.
1. Statistically construct a set of haplotypes from population
genotypes. (statistical haplotype phasing)
2. Reconstruct haplotypes by using genotypes of pedigree.
3. Infer an individual’s haplotypes from sequenced DNA
fragments. (single individual haplotyping (SIH))
6
7. Infer haplotypes from sequenced DNA fragments
Single Individual Haplotyping (SIH)
aligned reads
reference
• (A)→(B)
– Extract the fragments from the heterozygous
alleles in aligned reads.
7
8. Infer haplotypes from sequenced DNA fragments
Single Individual Haplotyping (SIH)
• (B)→(D) (i)
– Co-occurrence of alleles in the same read (intra).
8
9. Infer haplotypes from sequenced DNA fragments
Single Individual Haplotyping (SIH)
• (B)→(D) (ii)
– Overlap between the reads (inter).
9
10. The problem and the view of SIH
• SIH uses the reads which span multiple
heterozygous loci.
– next-generations sequencing is not long enough
– Sanger sequencing is too expensive
• This situation is changing rapidly with the
advent of experimental techniques.
– real-time single molecule sequencing
– fosmid pool-based next generation sequencing
10
11. Important point in haplotype inference
• The haplotype information which contains errors
is likely to lead to wrong results in downstream
analyses.
– In detecting the recombination events
• To use haplotype information in downstream
analyses while avoiding such harmful influence of
errors, it is important not only to assemble
haplotypes as long as possible but also to provide
means to extract highly reliable haplotype
regions.
11
12. The problem of SIH algorithms
• In the statistical haplotype phasing, reliable
haplotype regions are determined by selecting the
blocks of limited haplotype diversity and level of
linkage disequilibrium (LD).
• Although there are many algorithms for SIH, none of
these algorithms can provide confidence scores to
extract reliable haplotype regions.
• We developed an algorithm which provides
the confidence scores of the regions.
12
13. • Introduction
– What is haplotype ?
– What is Single Individual Haplotyping (SIH)?
– An unnoticed problem in SIH (No confidence score)
• Methods
– Probabilistic model for SIH (Simplified version)
– Confidence score based on our model
– Actual model and optimization procedure
• Results
– Dataset
– Comparison of accuracies
– Other analyses
13
14. Notation
• We only consider the heterozygous sites and
represent them as binary.
• We extract the heterozygous alleles from the DNA
sequenced reads and describe these reads as
“fragments”.
mapped reads
reference
14
15. Process behind the sequencing
• Each sequenced fragments are derived from
one of the haplotypes.
Haplotypes Fragments
001011
010110
hap 0 0 0 1 0 1 1 0 0 0 0 10100
0110000
hap 1 1 1 0 1 0 0 1 1 1 1 1001111
1000
1111
15
16. Observed data
• The contents of the fragments are only observed and
haplotype states and the derivation of the fragments
are not observed.
Observed data
(input data)
001011
010110
?
?
hap 0 10100
0110000
hap 1 ? 1001111
1000
1111
16
17. Parameters to represent unobserved data
• We set parameter θ and latent value Z to
represent unobserved processes.
j
hap 0
hap 1
j Zi fi 1001111
17
18. Parameters to represent unobserved data
j
hap 0
hap 1
j fi 1001111
1. Zi is the latent value to represent the derivation of fi
1 0
Z i or
0
1
hap 0 hap 0 fi
hap 1
fi hap 1
18
19. Parameters to represent unobserved data
j
hap 0
hap 1
j fi 1001111
2. θj is the set of parameters to represent the state of
site j
0
a:
j , a 1
j j , a j ,b 1
j ,b
j
1
b: 0
19
20. Parameters to represent unobserved data
j
hap 0
hap 1
j fi 1001111
The probability P( f i | ) is as follows;
Zi , j
1 1
P( f i | ) P(hi j ) P( f i | hi , ) 0.5 k , ( fi ,k ,hi )
Zi , j
Z j 0
Z j 0 k X ( f i )
where
・ hi derivation of f i P(hi 0) P(hi 1) 0.5
is the and
・k X ( fi ) means the site which f i
covers
a if ( f i ,k , hi ) (0,0) or 1)
(1,
・ ( f i ,k , hi )
b if ( f i ,k , hi ) (0,1) or 0)
(1, 20
21. Parameters optimization and
haplotype inference
N N 1
P( F | ) P( f i | ) P(hi j ) P( f i | hi , )
Zi , j
i 1 i 1 Z j 0
• Optimizing Z and θ simultaneously is impossible, and
we use EM algorithm for optimization.
• The haplotypes of site j is the state whose probability
is higher than another.
– If j , a j ,b a: 0
1
j
21
22. Confidence score of a site
• The confidence of connection of haplotypes at site j
is calculated from the optimized parameter.
P( F | ) P( F ( j ) | )
connectivity( j ) log
P( F | ) log P( F ( j ) | )
where
k ,a k ,a , k ,b k ,b
(k j )
k ,a k ,b , k ,b k ,a
(k j )
22
23. Confidence score of a site
• This is the illustration of “connectivity”.
P( F | ) P( F ( j ) | )
connectivity( j ) log
P( F | ) log P( F ( j ) | )
j
hap 0 hap 0
hap 1 F ( j) hap 1
P( F ( j ) | ) P( F ( j ) | )
23
24. Confidence score of a region
• We extend “connectivity” to the confidence scores of
the regions (MC).
MC( j1 , j2 ) min connectivity( j )
j1 j j2
• MC is the minimum ”connectivitiy” in the region.
• We tested whether reliable regions could be
extracted by using MC values.
24
25. Actual model
• Add sequencing error term
P( f i | hi , )
k X ( f i )
k , ( f i ,k , hi ) P( f i | hi , ) (1 )
kX ( f i )
k , ( f i ,k , hi ) k , ( fi ,k ,hi )
• Define the prior distribution and optimize parameter
with Variational Bayes EM (VBEM) algorithm.
25
26. Actual parameter optimization
• Iterative twist operations to avoid sub-optimal
solutions.
① ② ③ ④
if P( F | ) P( F | )
⑤
① Do Variational Bayes EM with initial parameter.
② Select the site which has smallest “connectivity”.
③ Do VBEM with twisted parameter.
④ Compare the probabilities and select better parameter.
⑤ Iterate from step ② until smallest connectivity over 7.0.
26
27. • Introduction
– What is haplotype ?
– What is Single Individual Haplotyping (SIH)?
– An unnoticed problem in SIH (No confidence score)
• Methods
– Probabilistic model for SIH (Simplified version)
– Confidence score based on our model
– Actual model and optimization procedure
• Results
– Dataset
– Comparison of accuracies
– Other analyses
27
28. Dataset (Simulation data)
• True data
– Generate M binary heterozygous loci randomly.
• Input data
– Replicate each true haplotype c times and
randomly divide them into subsequences of
length between l1 and l2. Then randomly flipped
the binary values of the fragment from 0(1) to 1(0)
with probability e.
28
29. Dataset (Real data)
• Input data
– Duitama’s work who conducted fosmid pool-
based NGS for HapMap trio child NA12878 from
the CEU population.
• True data
– Haplotypes of about 82% (1.36*10^6/1.65*10^6)
sites are determined by trio-based statistical
phasing method and are conducted by 1000
Genomes Project.
29
30. Fosmid pool-based NGS (1)
① ②
① Genomic DNA is fragmented into pieces of length about 40 kilo-
bases and construct fosmid library.
② Fosmid clones are randomly partitioned into barcoded 32 pools.
30
31. Fosmid pool-based NGS (2)
④
A G
③ A A G
⑤
G
A C A C
G G G G
reference
③ Sequencing and mapping.
④ Reads draw a block which corresponds to a fosmid library.
⑤ Convert a block into a fragment.
31
36. Comparison of Pairwise Accuracies
Simulation data Real data
• The allows indicate the threshold of MC.
36
37. Comparison of Pairwise Accuracies
Simulation data Real data
• The precisions without MC threshold is almost same.
37
38. Comparison of Pairwise Accuracies
Simulation data Real data
• The precision of MixSIH increases with high MC
threshold.
38
39. Comparison of Pairwise Accuracies
Simulation data Real data
• The precision of MixSIH does not saturate with high
MC threshold in real data.
39
40. Problem of fosmid pool-based NGS
① ② ③④
② ①
③
④
homologous
• Fosmid pool-based NGS has potential to
produce chimeric fragment accidentally.
40
41. Remove the chimeric fragments
• Calculate the “chimerity” of the fragment by
comparing the true haplotypes data.
where n(f,h) is the number of sites at which the fragment f matches with the true haplotype h
f<=i and f>j represent the left and right parts of the fragment f divided at site j, and α0=0.028
is The empirical sequence error rate.
• Remove the fragments whose chimerity are over 10
which correspond to the case that 1.65% of the
fragments were removed.
41
42. Pairwise accuracy on the real data in which
the chimeric fragments are removed
• The precision of MixSIH reaches that of unassembled prediction.
42
43. Dependency of MC values on
the fragment parameters
• The minimal MC threshold that achieves precision
>= 0.95 is plotted for different fragment length range,
coverage, and error rates.
• The overall scale of the MC thresholds changes relatively
moderately at MC =6.0 and we set the default MC threshold
to 6.0.
43
44. Spatial distribution of MC values
• This is an example of the spatial distribution of the
precision of MixSIH and linkage disequilibrium (LD).
• SIH can accurately infer the haplotypes in regions with low LD,
but there are also regions with reduced precision and high LD
values.
Unify SIH and statistical phasing method.
44
45. Optimality of inferred parameters
• Test whether our iterative optimization method
succeed to avoid sub-optimal solutions.
• The parameters converge to the global optimum
upon repeating the twist operation.
45
46. Running time
• Our method applies the VBEM algorithm repeatedly and
hence is rather slow.
• Considering the number of heterozygous sites, it is roguhly estimated that
MixSIH takes about 15 days to finish haplotyping for the chromosome-
wide data.
46
47. Conclusion
• We have developed a probabilistic model for SIH and
defined the minimal connectivity (MC) score.
• Our algorithm can extract highly accurate haplotype
regions by using MC values.
• We have also found evidence that there are a small number of
chimeric fragments in an existing dataset of fosmid pool-
based NGS, and these fragments considerably reduce the
quality of the SIH.
47
48. Acknowledgement
• Department of Computational Biology, the University of Tokyo
– Hisanori Kiryu
– Kiryu Lab. Members
• Tsukasa Fukunaga
• Yuki Kashihara
• Risa Kawaguchi
48