New Strategy to detect SNPs

New Strategy to detect SNPs
Miguel Galves
José Augusto Quitzau
Zanoni Dias
Scylla Bioinformatics –Brazil
{miguel,jquitzau,zanoni}@scylla.com.br

Agenda
 Introduction
 HIV Dataset
 Detection Strategy
 Trimming Procedure
 Base-Calling Strategies
 Filter Algorithm
 Consensus Algorithm
 Tests Protocol
 Results
 Discussion

Introduction
 Polymorphism: set of base pair locus at
which different alleles exists in individuals in
some population
– The second most frequent allele must appear in
at least 1% of the individuals
 SNP: polymorphism in a single base pair
position
 SNP discovery is very important to
understand complex diseases

HIV Dataset
 HIV genetic sequences:
– 1302 bp
– Well-conserved region
 35 batches from 35 individuals:
– 6 PCR reads, with average size of 690bp
– 1 validated sequence, with manually annotated
SNPs
 HIV Reference Sequence

Detection Strategy: Survey
 Trimming Procedure
 Base-Calling Correction
 SNPs Filter
 Batch Consensus Algorithm

Trimming Procedure
 Low Quality Ends filtering
 Converts phred’s quality sequence to error
probability sequence:
⇒ Q = -10 x log10(p)
 Subtract 0.05 from all values (Q=13)
 Maximum Score Subsequence Algorithm

Base Calling: Area Ratio
 The base calling is made in 5 Steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks area
5. Calculation of the polymorphic/reference peak area
 If the calculated ratio is above a certain threshold, the
point is considered a polymorphism.

Base Calling: Area Delimitation

Base Calling: Peak Identification

Base Calling: Average Height Ratio
 Almost the same steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks average height
5. Calculation of the polymorphic/reference peak average
height.
 Again, if the calculated ratio is above a certain
threshold, the point is considered a polymorphism.

Filter Algorithm
 Analyzes each sequence
 Uses a window based algorithm to eliminate
adjacents SNPs
– Window size: 11 bases
– Empirical score system assigned to polymorphism
in the window

Consensus Algorithm
 Rule-based algorithm
– Empirical rules
 Analyzes the whole cross section to define a
consensus
– Take account of nucleotide frequencies and
qualities
 Do not create N symbols, nor tri-allelic
polymorphisms.

Consensus Algorithm: Example
Sequence 1 A25 C30 C18 C30 A21
Sequence 2 A30 C25 C15 C25 A16
Sequence 3 - M18 A9 C30 -
Sequence 4 - - S12 G17 T18
Consensus A M S S W

Tests Protocol: Third Party Packages
 Two external packages used to compare our results:
– Polybayes: SNP detection tool based on Bayesian
Methods
– Polyphred: SNP detection tool based on chromatogram
analysis
 ACE file (contig and consensus) created for each
batch using phrap
 ACE file analyzed by Polyphred and Polybayes
 Results viewed with consed

Tests Protocol: Our strategy
 Reads trimmed using Maximum
Subsequence Algorithm
 Base-calling analysis and correction using
algorithms describe previously
 SNP filtering
 Multiple alignment
– Reference sequence as anchor
 Consensus creation

Third Party Results: Polybayes
 Polybayes detected SNPs in only 2 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 13 12 1 1 0 11
Batch 15 5 1 0 1 5

Third Party Results: Polyphred
 Polyphred detected SNPs in only 4 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 07 10 1 0 1 10
Batch 14 4 3 0 3 4
Batch 32 26 1 0 1 26
Batch 35 15 8 1 7 14

Trimming Results
 Reads average size:
– Before trimming: 690.15bp
– After trimming: 374.74bp
– Reduction of 45%
 Reference sequence average base coverage
– Before trimming: 2.69
– After trimming: 1.77

Results: True Positive (%) x batch

Results: False Negative (%) x batch

Results: False Positive (%) x batch

Results: Summary
Polybayes Polyphred Area Avg. Height
Avg SD Avg SD Avg SD Avg SD
TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5
FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7
DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0
FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3
TP + FN + DP = 100%

Discussion
 Polybayes and Polyphred need large sets of data to
produces good results
 Our algorithm produces quite satisfactory results
taking into account data characteristics:
– Low average coverage
– High amount of low quality bases
– High amount of polymorphisms (virus DNA)
 Area Ratio strategy produces better results than
Average Height strategy

Future Work
 Test the algorithms whith larger batches,
whith higher average coverage, to improve
consensus algorithm
 Reproduce the experiments using genetic
sequences of more conserved life forms,
such as mammals

New Strategy to detect SNPs

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a New Strategy to detect SNPs

Semelhante a New Strategy to detect SNPs (20)

Mais de Miguel Galves

Mais de Miguel Galves (9)

Último

Último (20)

New Strategy to detect SNPs