6. Dilu,on-‐based
sequencing
• SIH
needs
long
DNA
sequencing
reads
• Dilu,on-‐based
sequencing
can
produce
long
reads
– Fosmid
pool-‐based
NGS
– Long
fragment
technology
– Dilu,on-‐amplifica,on-‐based
sequencing
7. Process
of
dilu,on-‐based
seq
DNA
fragments
are
separated
into
mul,ple
low-‐concentra,on
dilu,ons.
ASer
sequencing
and
mapping
an
aliquot,
mapped
reads
form
clusters
which
correspond
to
DNA
fragments.
Clusters
are
merged
into
read
fragments
(SNP
fragments)
(i)
(ii)
(iii)
8. Chimeric
fragment
(CF)
• Problem
of
producing
chimeric
fragments
(CFs)
– Reads
with
different
chromosomal
origins
are
regarded
as
one
cluster
and
merged
into
a
fragment
when
an
aliquot
happen
to
have
some
long
DNA
fragments
derived
from
the
same
region.
– CFs
significantly
decrease
the
accuracy
of
SIH.
10. Detec,on
of
CFs
• Basis
of
our
strategy
– CFs
correspond
to
an
ar,ficially
recombinant
haplotype
and
differ
from
biological
haplotypes
in
the
popula,on.
11. PHASE
• Sta,s,cal
phasing
method
– Infer
haplotypes
from
popula,on.
– The
diversity
of
haplotypes
is
limited
and
there
are
conserved
haplotypes.
• We
use
PHASE
to
obtain
the
haplotype
candidates.
– Example
of
output
A
candidate
of
haplotypes
and
its
probability.
12. CF
detec,on
model
• We
model
the
probabili,es
that
a
SNP
fragment
is
normal
fragment
and
chimeric
fragment.
• With
there
probabili,es
we
develop
a
indicator
“CSP”
which
evaluates
the
chimerity
of
a
SNP
fragment.
13. NF
probability
• NF
probability
– The
probability
that
a
SNP
fragment
is
normal
fragment
(NF).
– Calculate
the
consistency
between
sta,s,cally
phased
haplotypes
and
a
fragment.
14. CF
probability
• CF
probability
– The
probability
that
a
SNP
fragment
is
chimeric
fragment.
– LeS
and
right
parts
are
derived
from
different
haplotypes.
ll
15. CSP
• Chimericy
based
on
sta,s,cal
phasing
(CSP)
• Low
CSP
values
means
– the
fragment
correspond
to
recombinant
of
sta,s,cally
phased
haplotypes.
– the
fragment
is
suspected
of
CF.
16. Sliding-‐window
approach
• Running
,me
of
PHASE
increases
according
to
SNP
fragment
size.
– Complexity
of
popula,on
haplotypes
increase
exponen,ally.
• We
use
sliding-‐window
approach
(W=5).
sliding-‐window
18. dataset
• Dilu,on-‐based
sequencing
– Kaper’s
data
– Duitama’s
data
• True
haplotypes
– Trio-‐based
haplotypes
• True
NFs
and
CFs
– Defined
by
true
haplotypes
19. CSP
distribu,on
• CSP
of
CFs
is
lower
than
that
of
NFs
Theore,cal
lowest
value
(W=5)
-‐
Change
haplotype
origin
at
second
or
third
site.
Fragment:
00011
Haplotypes:
00000
/
11111