Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Milko stat seq_toulouse
1. Milko Krachunov2
, Ivan Popov1
, Valeria Simeonova2
, Irena Avdjieva1
,
Paweł Szczęsny3
, Urszula Zelenkiewicz3
, Piotr Zelenkiewicz3
,
Dimitar Vassilev1
1
Bioinforomatics group, AgroBioInstitute, Bulgaria
2
Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria
3
Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland
Detection and correction of errors in
metagenomic 16S RNA parallel sequencing
2. NGS errors – common problems
Introduced errors in the assembled reads due to
imperfections both of biological and mathematical origin;
Impossibility to re-sequence the same sample again in
metagenomic studies ;
Tendency the error rate to increase in every step of the
process;
No easy way to differentiate between “sequencing error” and
“rare variant”;
Many existing methods and algorithms concerning different
aspects of the problem but no unified solutions are available;
Large amounts of data are difficult to process with common
software.
3. Significance of 16S RNA sequencing
Highly conserved between different species of bacteria and
archaea;
Sequence analysis is done with universal PCR primers;
Contains hypervariable regions that can provide species-
specific signature sequences;
Suitable for phylogenetic studies;
Suitable for metagenomic studies.
4. General approach in metagenomic biodiversity studies
454 Sequencing
Filtering / Denoising
Multiple alignment
Distance matrix
ОTU clusters with abundance count
6. A. Raw data characteristics and processing
Two separate runs of metagenomic 16S RNA fragments,
sequenced with 454 platform and converted in FASTA format:
run 02 – 46429 short reads
run 04 – 41386 short reads
Our task – extract, denoise and correct only the quality
reads.
11. Aim of the method – idea outline
To deal with the heterogeneous nature of the data, similar or
related sequences are considered more important in the error
evaluation
The naïve approach: If a base is less common than the
sequencer error rate, assume it’s likely an error and replace
with the most common base
Our modification: Calculate the occurrence of the base in
reads that are similar in the given region – assign them bigger
weights or use them exclusively
12. Progress so far
Calculate occurrence rates of every base in reads that are
identical to the evaluated read in a window with radius of n
bases
Preliminary results: The first basic implementation leads to
an increase in the number of OTUs found with ClaMS
Under development
Good choice(s) of approach for alignment of the reads
Empirical evaluation of the parameters
Comparative evaluation of the variants of the approach
13. Software used in this project:
Python: http://www.python.org/
Cython: http://cython.org/
MEGA (Molecular Evolutionary Genetics Analysis):
http://www.megasoftware.net/
Muscle: http://www.drive5.com/muscle/
SHREC (SHort Read Error Correction method):
http://ww2.cs.mu.oz.au/~schroder/shrec_www/
ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-
psf.org/
NINJA (modified): http://nimbletwist.com/software/ninja/index.html
R-package: http://www.r-project.org/