An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
Genomics Assembly Techniques and Technologies
1. CURSO DE BIOINFORMATICA APLICADA:
Secuenciacion y Ensamblado de Genomas
Aureliano Bombarely
School of Plant and Environmental Sciences (SPES)
Virginia Tech
aurebg@vt.edu
2. 1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
3. 1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
4. 1. A brief history of the sequence assembly.
http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
5. Genome = N x Sequence of DNA
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
DNA
sequencing
???
1. A brief history of the sequence assembly.
6. DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
ddATP
ddGTP
ddCTP
ddTTP
Taq-Polymerase
STOP
time
2) Chromatographic
Separation
GTCACCCTGAAT
1) PCR with ddNTPs
3) Chromatogram Read
DNA Sanger Sequencing
1. A brief history of the sequence assembly.
7. Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACGATTTAAATGACGTTGTCGA
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACCCCGACGTTGTCGA
DNA
sequencing
ACGCGTTTTGTTGTGG
ACCCCTGGGGGGGTTGTCGA
Fragments
Sequence
assembly
ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG
1. A brief history of the sequence assembly.
8. Sequence assembly
=
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
Image courtesy of iStock photo
1. A brief history of the sequence assembly.
9. Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
BACs BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
1. A brief history of the sequence assembly.
11. 1977 MS2 Bacteriophage (3.658 Kb) 1977
N
O
SO
FTW
A
RE
U
SED
1. A brief history of the sequence assembly.
12. Haemophilus influenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER
Hardware:
SPARCenter 2000 (512 Mb RAM)
1. A brief history of the sequence assembly.
13. Haemophilus influenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER Smith-Waterman alignments
Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects
Genome Science and Technology, 1995,1:9-19
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
14. 1. A brief history of the sequence assembly.
ATGCGTGTGCGTGAGTG
GCGTGTGCGTGAGTGCC
GTGTGCGTGAGTGCCTA
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
OVERLAPPING
METHODOLOGY
Sequence
homology
match
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
|||||||||||||
Multiple
sequence
alignment
GCGTGTGCGTGAGTGCC
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
Consensus
call
overlap
ATGCGTGCGYGTGAGTGAGTGCC
Overlap
clustering
15. Homo sapiens (3.2 Gb) 2001
BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome.
Nature. 2001. 409:860-921
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
16. Homo sapiens (3.2 Gb) 2001
Whole Genome Shotgun (WGS) sequencing
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
Software:
WGAASSEMBLER (CABOG)
Hardware:
40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and
160 cores); 5 days.
1. A brief history of the sequence assembly.
17. Ailuropoda melanoleura (2.3 Gb) 2009
Whole Genome Shotgun (WGS) sequencing
Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317
Software:
SOAPdenovo
Hardware:
Supercomputer with 32 cores and 512 Gb RAM.
BRU
IJN
G
RA
PH
S
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
18. 1. A brief history of the sequence assembly.
What is a Kmer ?
ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt
Specific n-tuple or n-gram of nucleic acid or amino acid sequences.
-Wikipedia
ordered list
of elements
contiguous sequence
of n items from a given
sequence of text
5 Kmers of 20-mer
ATGCGCAGTGGAGAGAGAGC
TGCGCAGTGGAGAGAGAGCG
GCGCAGTGGAGAGAGAGCGA
CGCAGTGGAGAGAGAGCGAT
GCAGTGGAGAGAGAGCGATG
N_kmers = L_read - Kmer_size
20. Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291
1. A brief history of the sequence assembly.
21. OLC (Overlap-layout-consensus) algorithm is more
suitable for the low-coverage long reads, whereas the
DBG (De-Bruijn-Graph) algorithm is more suitable for
high-coverage short reads and especially for large
genome assembly
1. A brief history of the sequence assembly.
22. 1. A brief history of the sequence assembly.
What is a Scaffold?
https://genome.jgi.doe.gov/help/scaffolds.jsf
A scaffold is a portion of the genome sequence reconstructed from end-
sequenced whole-genome shotgun clones. Scaffolds are composed of contigs
and gaps. A contig is a contiguous length of genomic sequence in which the order
of bases is known to a high confidence level. Gaps occur where reads from the
two sequenced ends of at least one fragment overlap with other reads in two
different contigs (as long as the arrangement is otherwise consistent with the
contigs being adjacent). Since the lengths of the fragments are roughly known,
the number of bases between contigs can be estimated.
23. 1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
24. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Fastq raw
Fastq Processed
Contigs
Scaffolds
Decisions during the
Experimental Design
Software
Identity %, Kmer ...
Post-assembly Filtering
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding
25. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Decisions during the
Experimental Design
26. 2. Sequencing, tools and computers.
2.1 Technologies
Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq2500 (Seq.
by synthesis)
50-250
Single/Pairs
99% 4,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300
Single/Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Pacific Biosciences RSII
(SMRT)
10,000
Single
85%
(99.9%)
750,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02
10X Genomics
20,000
Single
99% 1,000,000,000 NA NA
27. 2. Sequencing, tools and computers.
★ Library types (orientations):
• Single reads
• Pair ends (PE) (150-800 bp insert size)
• Mate pairs (MP) (2-40 Kb insert size)
F
F
F
F
R
R
R 454/Roche
Illumina
Illumina
2.2 Libraries
28. 2. Sequencing, tools and computers.
• Why is important the pair information ?
F
- novo assembly:
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Genetic information (markers)
Pseudomolecule
(or ultracontig)
NNNNN NN
2.2 Libraries
29. 2. Sequencing, tools and computers.
2.3 Sequencing Amount
* Slide from MC. Schatz
30. 2. Sequencing, tools and computers.
2.3 Sequencing Amount
Depending of the genome complexity, technology used and assembler:
• More is better (if you have enough computational resources).
- Sanger > 10X (less for BACs-by-BACs approaches).
- 454 > 20X
- Illumina > 100X
- PacBio > 20X (corrected by Illumina) or 50X (corrected PacBio)
- 10X Genomics > 20X
• Polyploidy or high heterozygosity increase the amount of reads needed.
• The use of different library types (pair ends and mate pairs with different
insert sizes is essential).
• Longer reads is preferable.
31. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding
32. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
33. 2. Sequencing, tools and computers.
2.4 Tools: Read Quality Evaluation
a) Length of the read.
b) Bases with qscore > 20 or 30.
•FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
35. 2. Sequencing, tools and computers.
2.4 Tools: Contaminations
5. Contaminations
Contaminations can be removed mapping the reads against a reference with the
contaminants such as E. coli and human genomes.The most common tools are Bowtie
or BWA (for short reads) and Blast (for long reads).
36. 2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
Read corrections are generally based in the Kmer analysis.
Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage
Bioinformatics. 2011 27 (13):i137-i141
38. 2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections: PacBio
Three options:
- Self-alignment of PacBio reads.
• PBcR (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina contigs.
• PacBioToCA (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina reads.
• Lordec (http://www.atgc-montpellier.fr/lordec/)
• Proovread (https://github.com/BioInf-Wuerzburg/proovread)
• LSC (http://www.healthcare.uiowa.edu/labs/au/LSC/)
39. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Consensus
Scaffolding
40. 2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus
Sanger, 454, Illumina,
PacBio
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
Minia Bloom filter + Brujn graph Illumina
Really fast
Only contigs
https://github.com/GATB/minia
2.4 Tools: Assemblers
41. 2. Sequencing, tools and computers.
2.4 Tools: Assemblers
... but there are more assemblers and information... Take a look to SeqAnswers
http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application?
Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly
Also highly recommendable:
42. 2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus
Sanger, 454, Illumina,
PacBio
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
Minia Bloom filter + Brujn graph Illumina
Really fast
Only contigs
https://github.com/GATB/minia
2.4 Tools: Assemblers
43. 2. Sequencing, tools and computers.
Type Technology Used Features Link
HGAP Overlap-layout-consensus PacBio
Recommended by
PacBio. Multiple
tools. Difficult to
https://github.com/PacificBiosciences/
Bioinformatics-Training/wiki/HGAP
Falcon Overlap-layout-consensus PacBio
Faster than
HGAP or
CABOG
https://github.com/PacificBiosciences/falcon
Sprai Overlap-layout-consensus PacBio
Easy to use
interface for
CABOG
http://zombie.cb.k.u-tokyo.ac.jp/sprai/
README.html
Canu Overlap-layout-consensus
PacBio, Oxford
Nanopore
Easy to use http://canu.readthedocs.org/
MECAT Overlap-layout-consensus
PacBio, Oxford
Nanopore
Fast assembler https://github.com/xiaochuanle/MECAT
Spades Brujn graphs
Illumina, Ion Torrent,
PacBio, Oxford
Nanopore
Diverse type of
input
http://cab.spbu.ru/software/spades/
MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html
Supernova Hybrid 10X Genomics
Used with 10X
genomics
https://support.10xgenomics.com/de-novo-
assembly/software/pipelines/latest/using/running
2.4 Tools: Assemblers
44. 2. Sequencing, tools and computers.
Type Technology Used Features Link
HGAP Overlap-layout-consensus PacBio
Recommended by
PacBio. Multiple
tools. Difficult to
https://github.com/PacificBiosciences/
Bioinformatics-Training/wiki/HGAP
Falcon Overlap-layout-consensus PacBio
Faster than
HGAP or
CABOG
https://github.com/PacificBiosciences/falcon
Sprai Overlap-layout-consensus PacBio
Easy to use
interface for
CABOG
http://zombie.cb.k.u-tokyo.ac.jp/sprai/
README.html
Canu Overlap-layout-consensus
PacBio, Oxford
Nanopore
Easy to use http://canu.readthedocs.org/
MECAT Overlap-layout-consensus
PacBio, Oxford
Nanopore
Fast assembler https://github.com/xiaochuanle/MECAT
Spades Brujn graphs
Illumina, Ion Torrent,
PacBio, Oxford
Nanopore
Diverse type of
input
http://cab.spbu.ru/software/spades/
MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html
Supernova Hybrid 10X Genomics
Used with 10X
genomics
https://support.10xgenomics.com/de-novo-
assembly/software/pipelines/latest/using/running
2.4 Tools: Assemblers
45. Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance
4. Linkage groups anchoring (Since 2001) -Very Long Distance
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
46. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
1. Pair ends/Mate pair information
Contigs
Pair ends
e.g. 300 bp e.g. 2000 bp
Mate pairs
1.1 Small fragments mapping
1.2. Scaffold building
2.1. Long fragments mapping
2.2. Scaffold building
NNN
NNN
NNN NNNNNNNN
0. Input
47. 2. Sequencing, tools and computers.
Technology Used Link
SSPACE Short reads, pairs http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE
SSPACE-Long Long reads
http://www.baseclear.com/genomics/bioinformatics/basetools/
SSPACE-longread
Bambus2 Pairs https://www.cbcb.umd.edu/software/bambus2
2.4 Tools: Stand Alone Read Pairs Scaffolders
49. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolds
Software
Decisions during the
Assembly Optimization
Scaffolds
Gap Filling
50. 2. Sequencing, tools and computers.
Type Technology Used Features Link
GapCloser Overlap-layout-consensus Illumina
C++
Free
http://soap.genomics.org.cn/soapdenovo.html
GapFiller Overlap-layout-consensus Any
Perl
Commercial*
http://www.baseclear.com/landingpages/
basetools-a-wide-range-of-bioinformatics-
solutions/gapfiller/
PBSuite Overlap-layout-consensus
454,
PacBio
Python,
Long reads
http://sourceforge.net/p/pb-jelly/wiki/
2.4 Tools: Gap Filling
51. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Software
Decisions during the
Assembly Optimization
Scaffolds Chromosomes
Long distance
scaffolding
52. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance
4. Linkage groups anchoring (Since 2001) -Very Long Distance
53. 2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Optical mapping is a technique for constructing ordered, genome-wide, high-resolution
restriction maps from single, stained molecules of DNA, called "optical maps". By mapping
the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of
resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that
sequence
https://en.wikipedia.org/wiki/Optical_mapping
2. Optical mapping data (Since 2012).
54. Chromosome conformation
capture techniques (often
abbreviated to 3C technologies
or 3C-based methods) are a set
of molecular biology methods
used to analyze the spatial
organization of chromatin in a
cell. These methods quantify
the number of interactions
between genomic loci that are
nearby in 3-D space, but may
be separated by many
nucleotides in the linear
genome
3. HiC, Chromatin Conformation Capture (Since 2014).
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
https://en.wikipedia.org/wiki/Chromosome_conformation_capture
55. 3. HiC, Chromatin Conformation Capture (Since 2014).
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Dudechenko et al, Science 07 Apr 2017: Vol. 356, Issue 6333, pp. 92-95
DOI: 10.1126/science.aal3327
56. 4. Linkage groups anchoring (Since 2001) -Very Long Distance
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Argyris et al, BMC Genomics201516:4
https://doi.org/10.1186/s12864-014-1196-3
57. 2. Sequencing, tools and computers.
2.5 Computers
Bigger is better:
• How much do you need depends of:
➡ how big is your genome ?
~ Human size (3Gb) require ~256 Gb to 1 Tb
➡ how many reads do you have, ?
More reads, more memory and hard disk.
➡ what software are you going to use ?
OLC uses more memory and time than DBG.
➡ what parameters are you going to use ?
Bigger Kmers, more memory.
Four options:
1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000).
2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU).
3. Use a supercomputing center associated with NSF, NIH, USDA... where they
offer reduced prices (iPlant, Indiana University....).
4. Collaborate with some group with a big server.
58. 2. Sequencing, tools and computers.
2.6 Assembly evaluation
Different methods to evaluate an assembly:
1. Assembly stats (total assembly size vs expected size; N50/L50…)
2. Read remapping, variant calling and coverage evaluation.
3. Comparison with alternative sequencing methods.
4. Gene Space Completeness (RNASeq & EST mapping, BUSCO).
5. Comparison with genetic information.
59. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
Rationale: Close that a sequence assembly is to the expected genome size,
better it will be the genome assembly.
• Ideal case: The final assembly will have as many sequences as
chromosomes have the genome.
• Real case: The assembly will be fragmented is sequences with different
sizes. Lower number of fragments, better quality.
60. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
During the assembly optimization will be generated several assemblies.The most used
parameters to evaluate the assembly are:
1. Total Assembly Size,
How far is this value from the estimated genome size
2. Total Number of Sequences (Scaffold/Contigs)
How far is this value from the number of chromosomes.
3. Longest scaffold/contig
4. Average scaffold/contig size
5. N50/L50 (or any other N/L)
Number sequence (N) and minimum size of them (L) that represents the 50% of
the assembly if the sequences are sorted by size, from bigger to smaller.
61. 2. Sequencing, tools and computers.
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
2.6 Assembly evaluation: Assembly stats
64. P. axillaris
Current Assembly Peaxi v1.6.2
Dataset Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26
Total assembly sequences 109,892 83,639
Longest sequence (Mb) 0.57 8.56
Sequence length mean (Kb) 11.08 15.05
N90 (sequences) 13,481 1,051
L90 (Kb) 22.28 295.75
N50 (sequences) 3,943 309
L50 (Kb) 95.17 1,236.73
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
Example: Sequencing of the Petunia axillaris genome.
65. 0 10 20 30 40 50 60
0e+002e+074e+076e+078e+071e+08
63 Kmer Frequency for Petunia axillaris
Coverage
KmerCount
peak = 31
Estimated genome size = 1.38 Gb
Possiblesequencingerrors
Sequencing data (Kmer count)
2. Sequencing, tools and computers.
G = (N − B) * K / D * 2
G = genome size
N = total kmers
B = possible kmers with errors
K = kmer length
D = kmer peak
2.6 Assembly evaluation: Assembly stats
67. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Read remapping
Rationale: Low quality assembled regions will have:
• Low coverage (breaking points)
• High coverage and high number of polymorphisms (due the collapsing of
multiple copies.
68. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Read remapping
Short Processed Reads
Long Processed Reads
Assembly Mapped Reads
Bowtie2/BWA
BlasR
Coverage
Variants
Bedtools
FreeBayes
VariantsCoverageMapped reads
69. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with alternative sequencing methods
Rationale: A sequence assembly should agree with alternative sequencing
methods.
• Illumina based assembly —> PacBio based assembly
• Short distance scaffolding (< 200 Kb) —> Optical Mapping scaffolding.
• Long distance scaffolding (> 200 Kb) —> BACs, YACs and Genetic maps
Sharma et al. 2013
E.g. Potato genome
70. 2. Sequencing, tools and computers.
Rationale: The gene space is defined as the whole collection of genes in a
genome. A complete genome should have a complete gene space. There are
two ways to evaluate the completeness of a gene space.
• Check the completeness of a set of conserved genes (CEGMA/BUSCO).
- Maybe not all the conserved genes are in the studied genome
• Check the mapping rate of a transcriptome to a genome (RNASeq).
- The RNASeq may have contaminations of other organisms.
2.6 Assembly evaluation: Gene space completeness
71. 2. Sequencing, tools and computers.
Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.
http://busco.ezlab.org/
2.6 Assembly evaluation: Gene space completeness
72. 2. Sequencing, tools and computers.
Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.
2.6 Assembly evaluation: Gene space completeness
73. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
Resistent x Susceptible
F2 F2 F2 F2
F2
F2F2
F2F2
F2F2
F2F2
F2F2
F2
F2 F2
F2 F2 F2 F2 F2 F2 F2
DNA DNA DNA DNA DNA DNA DNA
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
Allele frequency
74. 2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
recombination area
scaffold misplacement
75. Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
76. 5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Arabidopsis thaliana (TAIR10)
(5 chromosomes with 119 Mb)
(96 gaps with 185 Kb total)
Solanum lycopersicum (v2.5)
(12 chromosomes with 824 Mb)
(26,866 gaps with 82 Mb total)
Petunia axillaris (v1.6.2)
(83,639 scaffolds with 1.26 Gb)
(26,268 gaps with 41 Mb total)
Nicotiana benthamiana (v0.4.4)
(141,339 scaffolds with 2.63 Gb*)
(337,433 gaps with 162 Mb total)
*0.61-0.67 Gb no assembled
77. Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
?
78. Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Genome size
vs
assembly size
79. Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Cytogenetic data
(Flow cytometry)
✴ http://data.kew.org/cvalues
Example: Petunia axillaris genome size 1.37 Gb
(White and Rees. 1987)
80. Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Sequencing data
(Kmer count)
• Jellyfish (http://www.genome.umd.edu/
jellyfish.html)
• BFCounter (http://
pritchardlab.stanford.edu/bfcounter.html)
81. 1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
82. 3. Things that you should know about genomes.
1. They have variable size, for example in angiosperm plants they range from 63 Mb
(Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data
at:
✴ Plants: http://data.kew.org/cvalues/
✴ Animals: http://www.genomesize.com/
2. They can be polyploids. It means that homoeologus regions with highly similar will
collapse during the assembly.
3. They can be highly heterozygous and polymorphic. In this case some of the alleles will
collapse, some of them not.The effective coverage will lower than expected.
4. They can have recent whole genome duplication (or triplication) events. Paralogous
genes may collapse during the assembly.
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default assemblers throw them away based in the Kmer histogram.
83. 3. Things that you should know about genomes.
3.1 Collapsing problem
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
Gene 1 A
Gene 1 B
CGCCCTTGACGACATGACGACA
Collapsed consensus Gene 1 A + Gene 1 B
✴ Polyploidy
✴ Whole Genome Duplication
Gene 1
Gene 1 A
Gene 1 B
Schmutz J et al. Genome Sequence of the
Paleoploid Soybean. Nature 2010 463:178-183
84. 3. Things that you should know about genomes.
3.1 Collapsing problem
✴ Whole Genome Duplication
http://chibba.agtec.uga.edu/duplication/index/home
85. 3. Things that you should know about genomes.
3.1 Collapsing evaluation
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
CGCCCTTGACGACATGACGACA Consensus
Reads
Read collapsing during the assembly process can be evaluated mapping reads
back to the consensus sequence and analyzing SNPs.
CACTTGACGACATGA
SNPs
=
Heterozygous Positions (AB)
+
Homoeologs Collapsing (ABBB,AABB,ABBB)
+
Paralogs Collapsing (variable)
86. 3. Things that you should know about genomes.
3.1 Collapsing solutions
1. Sequence the two progenitors and use them as a reference.
✓Approach used previously (cotton genome; Patterson A. et. al 2012).
❖ Information not contained in the references will be lost.
❖ Progenitors are not always available.
❖ Gene conversions and non-homologue recombinations
2. Use single molecule technologies such as PacBio or 10X Genomics
✓ Best approach because it will give the right phase for long sequence
chunks. It exists software to integrate big sequences (minimus2...)
3. Assembly everything with the higher Kmer, evaluate the collapsed regions
and rescue the haplotype/region using SNPs and pair end read information.
✓ Cheap, easy to apply for the current technologies.
❖ No software available
87. 3. Things that you should know about genomes.
3.2 Repeats and assemblers
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default some assemblers throw them away based in the kmer histogram. For example,
SOAPdenovo throw away anything up to 255.
It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter) before
do any assembly to analyze contaminations (bacterial contaminations may appears with
high Kmers content) and repeats.
Software to analyze repeats based
in sequence graphs:
RepeatExplorer
(http://repeatexplorer.umbr.cas.cz/)
Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next-
generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
88. 1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
89. 4. Examples of genome assemblies
Petunia axillaris Petunia inflata
Sinningia speciosa Persea americana
95. * http://data.kew.org/cvalues/
Petunia axillaris Petunia inflata
genome size
1.17 Gb
(White and Rees. 1987)
genome size
1.37 Gb
(White and Rees. 1987)
chromosome number
2n = 2x = 14
(White and Rees. 1987)
Petunia hybrida
4. Examples of genome assemblies Petunia axillaris and P. inflata
96. P. axillaris P. inflata
Processed reads*
(Millions)
Processed reads
coverage (X)
Processed reads*
(Millions)
Processed reads
coverage (X)
PE 170 bp (BGI 2011) 239 17 317 22
PE 350 bp (BGI 2011) 166 12 120 6
PE 500 bp (BGI 2011) 149 10 254 25
PE 800 bp (BGI 2011) 167 12 143 15
PE 1000 bp (UIL2013) 462 48 NA NA
TOTAL PAIR ENDS 1,163 99 834 68
MP 2 Kb (BGI 2011) 179 11 155 15
MP 5 Kb (BGI 2011) 120 8 200 20
MP 8 Kb (UIL2013) 115 10 104 17
MP 15 Kb (UIL2013) 98 9 92 15
TOTAL MATE PAIRS 513 38 551 67
TOTAL SHORT READS 1,696 137 1,385 135
PacBio Reads (UNIBE2013) 8,599 22 NA NA
* Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 20, min. length 50 bp), PrinSeq (to
remove the duplicated pairs) and Musket (To correct the reads).
Reads Stats
4. Examples of genome assemblies Petunia axillaris and P. inflata
97. Raw
PacBio
Corrected
PacBio
Processed
Illumina
Peaxi_v1.3.2
Scaffolds
PacBio 1
AHA
(assembly and scaffolding)
PBJELLY
(gap filling 1)
Pair End/Mate Pair Data
EC-TOOLS
(errorcorrection)
Scaffolds
PacBio 2
Contigs
SSPACE
(scaffolding)
Peaxi_v1.6.0Peaxi_v1.6.1
PBJELLY
(gap filling 2)
Peaxi_v1.6.2
Perl Script
(filtering)
P. axillaris Illumina-PacBio hybrid assembly
SOAPdenovo2
(assembly)
GapCloser
(gap filling 0)
* Assembly made by Michel Moser at UNIBE
4. Examples of genome assemblies Petunia axillaris and P. inflata
98. P. axillaris P. inflata
Current Assembly Peaxi v1.6.2 Peinf v1.0.2
Dataset Contigs Scaffolds Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26 1.2 1.29
Total assembly sequences 109,892 83,639 213,254 136,283
Longest sequence (Mb) 0.57 8.56 0.57 5.81
Sequence length mean (Kb) 11.08 15.05 5.63 9.44
N90 (sequences) 13,481 1,051 41,876 2,450
L90 (Kb) 22.28 295.75 4.57 60.31
N50 (sequences) 3,943 309 9,813 406
L50 (Kb) 95.17 1,236.73 34.99 884.43
Current (Frozen) Genome Assembly Stats
* P. inflata assembly was performed
with SOAPdenovo2 and GapCloser
with an optimal Kmer=79
* Hybrid assembly
4. Examples of genome assemblies Petunia axillaris and P. inflata
99. Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
• Assembly size (contigs v1.6.2) = 1.22 Gb
Estimated genome not assembled = 110 - 120 Mb
4. Examples of genome assemblies Petunia axillaris and P. inflata
100. 4. Examples of genome assemblies Persea americana
Persea americana
genome size
0.90 Gb
2n = 2x = 24
(Arumuganathan and Earle, 1991)
101. 4. Examples of genome assemblies Persea americana
Experimental design
• Sequencing of a low heterozygosity avocado accession, VC75
• Base assembly made by self-corrected long reads and Canu
(PacBio > 75X)
• Polished with high genome coverage short reads with Racon
(Illumina > 100X).
• Chromo-scaffolding with HiC
102. 4. Examples of genome assemblies Persea americana
• Base assembly made by self-corrected long reads and Canu (PacBio > 75X)
Current Assembly Persia 1.0.0
Dataset Contigs
Total assembly size (Gb) 0.86
Total assembly sequences 3,204
Longest sequence (Mb) 10.56
Sequence length mean (Kb) 268.74
N90 (sequences) 850
L90 (Kb) 102.62
N50 (sequences) 136
L50 (Mb) 1.56
103. Estimated genome size for Persea americana:
• Flow Cytometry (Manosalva, 2018) = 0.90 Gb
• Kmer Count* = 1.28 Gb
• Assembly size (contigs v1.0.0) = 0.86 Gb
4. Examples of genome assemblies Persea americana
VC75
het = 0.16%
Estimated genome not
assembled = 40 - 420 Mb