3. Introduction
Polymorphism: set of base pair locus at
which different alleles exists in individuals in
some population
– The second most frequent allele must appear in
at least 1% of the individuals
SNP: polymorphism in a single base pair
position
SNP discovery is very important to
understand complex diseases
4. HIV Dataset
HIV genetic sequences:
– 1302 bp
– Well-conserved region
35 batches from 35 individuals:
– 6 PCR reads, with average size of 690bp
– 1 validated sequence, with manually annotated
SNPs
HIV Reference Sequence
6. Trimming Procedure
Low Quality Ends filtering
Converts phred’s quality sequence to error
probability sequence:
⇒ Q = -10 x log10(p)
Subtract 0.05 from all values (Q=13)
Maximum Score Subsequence Algorithm
7. Base Calling: Area Ratio
The base calling is made in 5 Steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks area
5. Calculation of the polymorphic/reference peak area
If the calculated ratio is above a certain threshold, the
point is considered a polymorphism.
10. Base Calling: Average Height Ratio
Almost the same steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks average height
5. Calculation of the polymorphic/reference peak average
height.
Again, if the calculated ratio is above a certain
threshold, the point is considered a polymorphism.
12. Filter Algorithm
Analyzes each sequence
Uses a window based algorithm to eliminate
adjacents SNPs
– Window size: 11 bases
– Empirical score system assigned to polymorphism
in the window
13. Consensus Algorithm
Rule-based algorithm
– Empirical rules
Analyzes the whole cross section to define a
consensus
– Take account of nucleotide frequencies and
qualities
Do not create N symbols, nor tri-allelic
polymorphisms.
14. Consensus Algorithm: Example
Sequence 1 A25 C30 C18 C30 A21
Sequence 2 A30 C25 C15 C25 A16
Sequence 3 - M18 A9 C30 -
Sequence 4 - - S12 G17 T18
Consensus A M S S W
15. Tests Protocol: Third Party Packages
Two external packages used to compare our results:
– Polybayes: SNP detection tool based on Bayesian
Methods
– Polyphred: SNP detection tool based on chromatogram
analysis
ACE file (contig and consensus) created for each
batch using phrap
ACE file analyzed by Polyphred and Polybayes
Results viewed with consed
16. Tests Protocol: Our strategy
Reads trimmed using Maximum
Subsequence Algorithm
Base-calling analysis and correction using
algorithms describe previously
SNP filtering
Multiple alignment
– Reference sequence as anchor
Consensus creation
17. Third Party Results: Polybayes
Polybayes detected SNPs in only 2 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 13 12 1 1 0 11
Batch 15 5 1 0 1 5
18. Third Party Results: Polyphred
Polyphred detected SNPs in only 4 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 07 10 1 0 1 10
Batch 14 4 3 0 3 4
Batch 32 26 1 0 1 26
Batch 35 15 8 1 7 14
19. Trimming Results
Reads average size:
– Before trimming: 690.15bp
– After trimming: 374.74bp
– Reduction of 45%
Reference sequence average base coverage
– Before trimming: 2.69
– After trimming: 1.77
24. Discussion
Polybayes and Polyphred need large sets of data to
produces good results
Our algorithm produces quite satisfactory results
taking into account data characteristics:
– Low average coverage
– High amount of low quality bases
– High amount of polymorphisms (virus DNA)
Area Ratio strategy produces better results than
Average Height strategy
25. Future Work
Test the algorithms whith larger batches,
whith higher average coverage, to improve
consensus algorithm
Reproduce the experiments using genetic
sequences of more conserved life forms,
such as mammals