This document discusses some key errors and limitations of next-generation sequencing (NGS). It notes that while NGS has significantly reduced costs and improved throughput, it also has some drawbacks compared to previous sequencing technologies. Specifically, it outlines issues related to low quality bases, PCR errors during amplification, and high error rates that can make rare mutations difficult to detect. Limitations include short read lengths that hamper assembly of repetitive regions, contamination risks, incomplete representation of repeats, difficulties assembling segmental duplications and genes fragmented across scaffolds. The document emphasizes the need for validation of genome assemblies and development of hybrid approaches combining long and short reads to overcome these challenges.
2. Introduction
High throughput sequencing technologies has made whole genome
sequencing and resequencing available to many more researchers and
projects.
Cost and time have been greatly reduced.
The error profiles and limitations of the new platforms differ significantly
from those of previous sequencing technologies.
The selection of an appropriate sequencing platform for particular types of
experiments is an important consideration.
Requires a detailed understanding of the technologies available which
including sources of error, error rate, as well as the speed and cost of
sequencing.
4. Errors in NGS
NGS sequencing errors focuses mainly on the following
points:
1. Low quality bases
2. PCR errors
3. High Error rate
5. 1. Low quality bases
1. All the NGS companies have made big strides in improving the raw
accuracy of the bases.
2. Read lengths have increased as a result.
3. The number of reads has also increased to the point to get high
enough coverage to rule out most issues with low quality base calls.
6. 2. PCR errors
All of the current NGS systems use PCR in some form to amplify the
initial nucleic acid and to add adapters for sequencing.
1. The amount of amplification can be very high, with multiple rounds
of PCR for exome and/or amplicon applications.
2. That base differences are seen which were artefacts generated by
the PCR.
3. Several groups have published improved methods that reduce the
amount of PCR or use alternative enzymes to increase the fidelity of
the reaction, e.g. Quail et al.
7. 3. High error rate
1. High error rate prevents the accurate detection of rare mutations in
heterogeneous populations such as tumors and microbiomes.
9. Limitations of NGS
NGS has inherent limitations they are as follows :
1. Sequence properties and algorithmic challenges
2. Contamination or new insertions
3. Repeat content
4. Segmental duplications
5. Missing and fragmented genes
6. Reference index
10. 1. Sequence properties and
algorithmic challenges
NGS technologies typically generate shorter sequences with higher
error rates from relatively short insert libraries.
Illumina’s sequencing by synthesis, routinely produces read lengths of
75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp.
Short read lengths of NGS prevent the assembly of genomes with long
stretches of repetitive DNA.
11. 2. Contamination or new insertions
An important consideration of any sequencing project is DNA
contamination from other organisms.
Before analyzing the genomes are searched for possible contaminants
by comparing the genome against (NCBI) nucleotide (nt) database.
De novo sequence assemblies may be an important source for the
discovery of insertion polymorphisms sequence which require
particular scrutiny and additional validation because of their tendency
to enrich for contamination artifacts.
Discriminating such sequences before sequence assembly becomes
particularly problematic when the underlying sequence read data are
short.
12. 3. Repeat content
Any WGS-based sequence assembly algorithm will collapse identical
repeats, resulting in reduced or lost genomic complexity.
Most Alu subfamilies were underrepresented because of the shorter
sequence length of the Alu repeat class.
Most common repeat classes showed reduced representation in the YH
genome.
13. 4. Segmental duplications
Whole-Genome Assembly Comparison (WGAC) method is used to analyse
the segmental duplication.
If we limit our analysis to those duplications commonly present in the
human reference genome and duplications we detected through read-
depth analysis of a capillary sequencing–based WGS dataset (Celera) and
YH we conclude that 99.4% of true pairwise segmental duplications were
absent.
We predict that 95.6% of the duplications in the YH de novo assembly are
likely false because they did not correspond to duplications predicted by
read depth.
14. 5. Missing and fragmented genes
Genomic reduction impacted on both gene coverage and
fragmentation of genes into multiple scaffolds.
The presence of duplicated and repetitive sequences in introns
complicates complete gene assembly and annotation, leading to genes
being broken among multiple sequence scaffolds.
15. 6. Reference index
Other problem is analysing genomes without a reference index
genome.
The portions that are missing or misassembled cannot be readily
inferred and are invisible to the biologist.
Biases against duplications and repeats, as well as fragmentation,
raise questions related to the accuracy and completeness of similarly
assembled genomes.
16. Overcoming the Limitations
It is the responsibility of the scientific community to enforce
standards of quality that can be measured and assessed.
It is critical to develop new hybrid sequencing approaches, such as
multiplatform strategies including the third generation long-read
technologies, high-quality finished long-insert clones and new
assembly algorithms that can accommodate these heterogeneous
datasets.
The genome assemblies themselves must be experimentally validated.
Large-molecule, high-quality sequencing should not be abandoned
until the balance between quantity and quality of genomes has been
re-established.