1. De novo assembly, a
multi-technology approach:
Illumina, PacBio, and OpGen
PhD. Francesco Vezzi
Senior Bioinformatician, NGI-Stockholm
2. Both Stockholm and Uppsala nodes
Illumina HiSeq 2000/2500 16
Illumina MiSeq 3
Life Technologies SOLiD 5500xl 4
Life Technologies SOLiD 5500wildfire 2
Life Technologies Ion Torrent 2
Life Technologies Ion Proton 6
Life Technologies Sanger ABI3730 2
Pacific Biosciences RSII 1
Argus Whole Genome Mapping System 1
One of 3 best-equipped sequencing sites in Europe
3. In this talk
Illumina (Stockholm):
• 100/150 bp paired reads (low error rate)
• 900/200 Gbp in 6/2 day(s)
PacBio (Uppsala):
• 8.5 Kbp reads, (max 30Kbp, high error rate)
• 375 Mbp (1 SMRT Cell) in 10 hours
OpGen Argus System (Stockholm):
• ~300 Kbp maps
• 10 Gbp in ~1 day
4. Optical Maps
• Restriction Map
◦ Representation of the cut sites on a
given DNA molecule to provide spatial
information of genetic loci
• An enzyme is selected and used
to cut the molecules. This
provides a 2D representation of
the molecule structure
5. Optical Maps: workflow
DNA extraction directly
from culture
Quality control of
extracted material
Prepare a chip
Run Argus System
Data assembly
StepsTime
3-8h
1h
1.5h
1h
2-8h
Notes
6. Closing genomes with Optical Maps
De novo reconstructs parts
missing in the reference strain
Correctly assembles long tandem
repeats
De Novo assembly
(Illumina, PacBio)
Set of un-ordered and
not oriented contigs
Optical Map
Contigs
7. Case Study: Combing all the technologies
~15 Mbp genome sequenced at High Coverage with:
• Illumina HiSeq:
• 500X PE libraries (180bp and 650bp insert)
• 150X MP library (3Kbp)
• 150X MP library (7Kbp)
• PacBio
• 50/60X with reads longer than 2Kbp
• OpGen
• 3 chips (only one worked really well)
• 300X coverage
• Average map length 320Kbp
8. Assembly Strategy
https://github.com/vezzi/de_novo_scilife
Semi-automated pipeline for de novo assembly:
• Global configuration file tools and system configuration
• Sample configuration file samples description
3 modules:
1. QC-module (Illumina only):
• Adaptor removal, kmer-analysis, fastqc, (insert size estimation)
2. Assemble-module (Illumina only):
• Runs specified assemblers and outputs executed commands
3. Validation-module:
• FRCbam, coverage analysis, GC-analysis, (N50)
I NEED USERS/FEEDBACK/CONTIRBUTIONS
13. Optical Maps
PacBio produces the best assembly however 290 contigs contigs are produced.
Optical Maps allowed to obtain
the 2D representation of the 7
chromosomes.
N.B. chromosome number was
one of the biological questions of
this project!!!
But much more can be done!!!
14. Incredible tool to finish (or almost finish) genomes
% contigs placed
Total size of placed
contigs
% size placed
contigs
% genome
covered
pacBio+OpGene 94.12 11578995 97% 77.05
Allpaths+OpGene 71.88 10692027 84% 52.88
Allpaths+Masurca+Opgene 80.65 27506424 92% 69.64
Allpaths+PacBio+Opgene 82.32 22271022 91% 83.05
Masurca+PacBio+pgene 94.44 28393392 98% 83.79
Allpaths+Masurca+PacBio+Opgene 85.42 39085419 94% 87.39
Combing all the technologies
15. Conclusions – Take home message
Attempt to automate de novo assembly process:
• https://github.com/vezzi/de_novo_scilife
• Not 100% automated
Illumina, PacBio, Hybrid assemblies:
• PacBio alone seems to produce the best assemblers
• Hybrid assembly seems to not be able to correct merged-assembly
problems
Mixing technologies is always a good idea:
• Possibility to compensate technological biases
• Allows to produce better assemblies