Genomics Assembly Techniques and Technologies

CURSO DE BIOINFORMATICA APLICADA:
Secuenciacion y Ensamblado de Genomas
Aureliano Bombarely
School of Plant and Environmental Sciences (SPES)
Virginia Tech
aurebg@vt.edu

1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies

http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

Genome = N x Sequence of DNA
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
DNA
sequencing
???

DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
ddATP
ddGTP
ddCTP
ddTTP
Taq-Polymerase
STOP
time
2) Chromatographic
Separation
GTCACCCTGAAT
1) PCR with ddNTPs
3) Chromatogram Read
DNA Sanger Sequencing

Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACGATTTAAATGACGTTGTCGA
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACCCCGACGTTGTCGA
DNA
sequencing
ACGCGTTTTGTTGTGG
ACCCCTGGGGGGGTTGTCGA
Fragments
Sequence
assembly
ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG

Sequence assembly
=
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
Image courtesy of iStock photo

Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
BACs BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing

1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
MS2Bacteriophage(3.66Kb)
Epstein-BarrVirus(170Kb)
Haemophilusinﬂuenzae(1.83Mb)
Saccharomycescerevisae(12.1Mb)
Caenorhabditiselegans(100Mb)
Arabidopsisthaliana(157Mb)
Homosapiens(3.2Gb)
Oryzasativa(420Mb)
Sanger
454
Solexa/Illumina
SOLiD
Ion Torrent
PacBio
OxN
2
3
4
7
2016/02/04 Sequenced
Genomes
Viridiplantae 178
Metazoa 5907
Bacteria 7897

1977 MS2 Bacteriophage (3.658 Kb) 1977
N
O
SO
FTW
A
RE
U
SED

Haemophilus inﬂuenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER
Hardware:
SPARCenter 2000 (512 Mb RAM)

Haemophilus inﬂuenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER Smith-Waterman alignments
Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects
Genome Science and Technology, 1995,1:9-19
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y

ATGCGTGTGCGTGAGTG
GCGTGTGCGTGAGTGCC
GTGTGCGTGAGTGCCTA
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
OVERLAPPING
METHODOLOGY
Sequence
homology
match
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
|||||||||||||
Multiple
sequence
alignment
GCGTGTGCGTGAGTGCC
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
Consensus
call
overlap
ATGCGTGCGYGTGAGTGAGTGCC
Overlap
clustering

Homo sapiens (3.2 Gb) 2001
BAC-by-BAC sequencing
International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome.
Nature. 2001. 409:860-921
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y

Homo sapiens (3.2 Gb) 2001
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
Software:
WGAASSEMBLER (CABOG)
Hardware:
40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and
160 cores); 5 days.

Ailuropoda melanoleura (2.3 Gb) 2009
Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317
Software:
SOAPdenovo
Hardware:
Supercomputer with 32 cores and 512 Gb RAM.
BRU
IJN
G
RA
PH
S
M
ETH
O
D
O
LO
G
Y

What is a Kmer ?
ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt
Speciﬁc n-tuple or n-gram of nucleic acid or amino acid sequences.
-Wikipedia
ordered list
of elements
contiguous sequence
of n items from a given
sequence of text
5 Kmers of 20-mer
ATGCGCAGTGGAGAGAGAGC
TGCGCAGTGGAGAGAGAGCG
GCGCAGTGGAGAGAGAGCGA
CGCAGTGGAGAGAGAGCGAT
GCAGTGGAGAGAGAGCGATG
N_kmers = L_read - Kmer_size

Kmer
transformation
(e.g. Kmer-5)
Kmer
count
Bruijn Graph
Construction
TGCGTGTGAGTGAGTGC
GCGTGTGAGTGAGTGCG
ATGCGTGTGCGTGAGT
GTGTGCGTGAGTGCCT
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTG
CGTGT
GCGTG
TGCGT
ATGCG
GTGCC
AGTGC
GAGTG
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTG
TGCCT
AGTGC
GAGTG
TGAGT
GTGAG
TGCGT
GAGTG
TGAGT
GTGAG
AGTGA
GTGCG
AGTGA
GAGTG
TGAGT
GTGAG
TGTGA
GTGTG
CGTGT
GCGTG
AGTGC
GAGTG
TGAGT
GTGAG
Kmer Count
ATGCG 1
TGCGT 1
GCGTG 1
CGTGT 1
GTGTG 2
TGTGC 2
GTGCG 2
TGCGT 2
GCGTG 3
CGTGA 3
GTGAG 3
TGAGT 3
TGCGT 1
GCGTG 2
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTGCGTGTGCGTGTGCGTATGCG
CGTGT
GCGTG
TGCGT
Bruijn Graph
Resolution
1 1 1 1 2
2
2
2
3
3
3
3
1
2
2
ATGCGTGTGCGTGAGT
TGTGA
GTGTG
CGTGT
GCGTG

Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291

OLC (Overlap-layout-consensus) algorithm is more
suitable for the low-coverage long reads, whereas the
DBG (De-Bruijn-Graph) algorithm is more suitable for
high-coverage short reads and especially for large
genome assembly

What is a Scaffold?
https://genome.jgi.doe.gov/help/scaffolds.jsf
A scaffold is a portion of the genome sequence reconstructed from end-
sequenced whole-genome shotgun clones. Scaffolds are composed of contigs
and gaps. A contig is a contiguous length of genomic sequence in which the order
of bases is known to a high conﬁdence level. Gaps occur where reads from the
two sequenced ends of at least one fragment overlap with other reads in two
different contigs (as long as the arrangement is otherwise consistent with the
contigs being adjacent). Since the lengths of the fragments are roughly known,
the number of bases between contigs can be estimated.

2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Fastq raw
Fastq Processed
Contigs
Scaffolds
Decisions during the
Experimental Design
Software
Identity %, Kmer ...
Post-assembly Filtering
Assembly Optimization
Reads
processing
and
ﬁltering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding

Technology
Library Preparation
Sequencing Amount
Experimental Design

2.1 Technologies
Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq2500 (Seq.
by synthesis)
50-250
Single/Pairs
99% 4,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300
Single/Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Paciﬁc Biosciences RSII
(SMRT)
10,000
Single
85%
(99.9%)
750,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02
10X Genomics
20,000
Single
99% 1,000,000,000 NA NA

★ Library types (orientations):
• Single reads
• Pair ends (PE) (150-800 bp insert size)
• Mate pairs (MP) (2-40 Kb insert size)
F
F
F
F
R
R
R 454/Roche
Illumina
Illumina
2.2 Libraries

• Why is important the pair information ?
F
- novo assembly:
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Genetic information (markers)
Pseudomolecule
(or ultracontig)
NNNNN NN
2.2 Libraries

2.3 Sequencing Amount
* Slide from MC. Schatz

2.3 Sequencing Amount
Depending of the genome complexity, technology used and assembler:
• More is better (if you have enough computational resources).
- Sanger > 10X (less for BACs-by-BACs approaches).
- 454 > 20X
- Illumina > 100X
- PacBio > 20X (corrected by Illumina) or 50X (corrected PacBio)
- 10X Genomics > 20X
• Polyploidy or high heterozygosity increase the amount of reads needed.
• The use of different library types (pair ends and mate pairs with different
insert sizes is essential).
• Longer reads is preferable.

Fastq raw
Fastq Processed
Contigs
Scaffolds
Software
Reads
processing
and
ﬁltering
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding

Fastq raw
Fastq Processed
Software
Reads
processing
and
ﬁltering
4. Contaminations.
5. Corrections

2.4 Tools: Read Quality Evaluation
a) Length of the read.
b) Bases with qscore > 20 or 30.
•FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

2.4 Tools: Read Trimming and Filtering
• Fastx-Toolkit (http://hannonlab.cshl.edu)
• Ea-Utils (http://code.google.com/p/ea-utils/)
• PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi)
Software Multiplexing Trimming/Filtering
Fastx-Toolkit fastx_barcode_splitter fastq_quality_ﬁlter
Ea-Utils fastq-multx fastq-mcf
PrinSeq PrinSeq PrinSeq
1. Adaptor removal.
4. PCR duplications (Only Genomes, Use PrinSeq).

2.4 Tools: Contaminations
5. Contaminations
Contaminations can be removed mapping the reads against a reference with the
contaminants such as E. coli and human genomes.The most common tools are Bowtie
or BWA (for short reads) and Blast (for long reads).

2.4 Tools: Read Corrections
6. Read Corrections
Read corrections are generally based in the Kmer analysis.
Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage
Bioinformatics. 2011 27 (13):i137-i141

6. Read Corrections: Illumina
• Quake (http://www.cbcb.umd.edu/software/quake/index.html)
• Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software)
• ECHO (http://uc-echo.sourceforge.net/)
• Corrector (http://soap.genomics.org.cn/soapdenovo.html)
• Musket (http://musket.sourceforge.net/)
Usual Suspects:

6. Read Corrections: PacBio
Three options:
- Self-alignment of PacBio reads.
• PBcR (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina contigs.
• PacBioToCA (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina reads.
• Lordec (http://www.atgc-montpellier.fr/lordec/)
• Proovread (https://github.com/BioInf-Wuerzburg/proovread)
• LSC (http://www.healthcare.uiowa.edu/labs/au/LSC/)

Fastq Processed
Contigs
Scaffolds
Software
Consensus
Scaffolding

Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus
Sanger, 454, Illumina,
PacBio
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
Minia Bloom filter + Brujn graph Illumina
Really fast
Only contigs
https://github.com/GATB/minia
2.4 Tools: Assemblers

... but there are more assemblers and information... Take a look to SeqAnswers
http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application?
Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly
Also highly recommendable:

HGAP Overlap-layout-consensus PacBio
Recommended by
PacBio. Multiple
tools. Difficult to
https://github.com/PacificBiosciences/
Bioinformatics-Training/wiki/HGAP
Falcon Overlap-layout-consensus PacBio
Faster than
HGAP or
CABOG
https://github.com/PacificBiosciences/falcon
Sprai Overlap-layout-consensus PacBio
Easy to use
interface for
CABOG
http://zombie.cb.k.u-tokyo.ac.jp/sprai/
README.html
Canu Overlap-layout-consensus
PacBio, Oxford
Nanopore
Easy to use http://canu.readthedocs.org/
MECAT Overlap-layout-consensus
PacBio, Oxford
Nanopore
Fast assembler https://github.com/xiaochuanle/MECAT
Spades Brujn graphs
Illumina, Ion Torrent,
PacBio, Oxford
Nanopore
Diverse type of
input
http://cab.spbu.ru/software/spades/
MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html
Supernova Hybrid 10X Genomics
Used with 10X
genomics
https://support.10xgenomics.com/de-novo-
assembly/software/pipelines/latest/using/running

Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance
4. Linkage groups anchoring (Since 2001) -Very Long Distance

1. Pair ends/Mate pair information
Contigs
Pair ends
e.g. 300 bp e.g. 2000 bp
Mate pairs
1.1 Small fragments mapping
1.2. Scaffold building
2.1. Long fragments mapping
2.2. Scaffold building
NNN
NNN
NNN NNNNNNNN
0. Input

Technology Used Link
SSPACE Short reads, pairs http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE
SSPACE-Long Long reads
http://www.baseclear.com/genomics/bioinformatics/basetools/
SSPACE-longread
Bambus2 Pairs https://www.cbcb.umd.edu/software/bambus2
2.4 Tools: Stand Alone Read Pairs Scaffolders

2.4 Tools: Stand Alone Scaffolders

Scaffolds
Software
Scaffolds
Gap Filling

GapCloser Overlap-layout-consensus Illumina
C++
Free
http://soap.genomics.org.cn/soapdenovo.html
GapFiller Overlap-layout-consensus Any
Perl
Commercial*
http://www.baseclear.com/landingpages/
basetools-a-wide-range-of-bioinformatics-
solutions/gapﬁller/
PBSuite Overlap-layout-consensus
454,
PacBio
Python,
Long reads
http://sourceforge.net/p/pb-jelly/wiki/
2.4 Tools: Gap Filling

Software
Scaffolds Chromosomes
Long distance
scaffolding

Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution
restriction maps from single, stained molecules of DNA, called "optical maps". By mapping
the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of
resulting DNA fragments collectively serves as a unique "ﬁngerprint" or "barcode" for that
sequence
https://en.wikipedia.org/wiki/Optical_mapping
2. Optical mapping data (Since 2012).

Chromosome conformation
capture techniques (often
abbreviated to 3C technologies
or 3C-based methods) are a set
of molecular biology methods
used to analyze the spatial
organization of chromatin in a
cell. These methods quantify
the number of interactions
between genomic loci that are
nearby in 3-D space, but may
be separated by many
nucleotides in the linear
genome
3. HiC, Chromatin Conformation Capture (Since 2014).
https://en.wikipedia.org/wiki/Chromosome_conformation_capture

3. HiC, Chromatin Conformation Capture (Since 2014).
Dudechenko et al, Science 07 Apr 2017: Vol. 356, Issue 6333, pp. 92-95
DOI: 10.1126/science.aal3327

Argyris et al, BMC Genomics201516:4
https://doi.org/10.1186/s12864-014-1196-3

2.5 Computers
Bigger is better:
• How much do you need depends of:
➡ how big is your genome ?
~ Human size (3Gb) require ~256 Gb to 1 Tb
➡ how many reads do you have, ?
More reads, more memory and hard disk.
➡ what software are you going to use ?
OLC uses more memory and time than DBG.
➡ what parameters are you going to use ?
Bigger Kmers, more memory.
Four options:
1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000).
2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU).
3. Use a supercomputing center associated with NSF, NIH, USDA... where they
offer reduced prices (iPlant, Indiana University....).
4. Collaborate with some group with a big server.

2.6 Assembly evaluation
Different methods to evaluate an assembly:
1. Assembly stats (total assembly size vs expected size; N50/L50…)
2. Read remapping, variant calling and coverage evaluation.
3. Comparison with alternative sequencing methods.
4. Gene Space Completeness (RNASeq & EST mapping, BUSCO).
5. Comparison with genetic information.

2.6 Assembly evaluation: Assembly stats
Rationale: Close that a sequence assembly is to the expected genome size,
better it will be the genome assembly.
• Ideal case: The ﬁnal assembly will have as many sequences as
chromosomes have the genome.
• Real case: The assembly will be fragmented is sequences with different
sizes. Lower number of fragments, better quality.

During the assembly optimization will be generated several assemblies.The most used
parameters to evaluate the assembly are:
1. Total Assembly Size,
How far is this value from the estimated genome size
2. Total Number of Sequences (Scaffold/Contigs)
How far is this value from the number of chromosomes.
3. Longest scaffold/contig
4. Average scaffold/contig size
5. N50/L50 (or any other N/L)
Number sequence (N) and minimum size of them (L) that represents the 50% of
the assembly if the sequences are sorted by size, from bigger to smaller.

N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)

N50/L50
93 87 75 68 62 56 50 44 37 30 25 20 18
50 % assembly: 500 Mb
N50
N50 = 7 sequences
L50 = 50 Mb

N90/L90
93 87 75 68 62 56 50 44 37 30 25 20 18
90 % assembly: 900 Mb
N90
N90 = 29 sequences
L90 = 12.5 Mb

P. axillaris
Current Assembly Peaxi v1.6.2
Dataset Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26
Total assembly sequences 109,892 83,639
Longest sequence (Mb) 0.57 8.56
Sequence length mean (Kb) 11.08 15.05
N90 (sequences) 13,481 1,051
L90 (Kb) 22.28 295.75
N50 (sequences) 3,943 309
L50 (Kb) 95.17 1,236.73
Example: Sequencing of the Petunia axillaris genome.

0 10 20 30 40 50 60
0e+002e+074e+076e+078e+071e+08
63 Kmer Frequency for Petunia axillaris
Coverage
KmerCount
peak = 31
Estimated genome size = 1.38 Gb
Possiblesequencingerrors
Sequencing data (Kmer count)
G = (N − B) * K / D * 2
G = genome size
N = total kmers
B = possible kmers with errors
K = kmer length
D = kmer peak

Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
• Assembly size (contigs v1.6.2) = 1.22 Gb
Estimated genome not assembled = 110 - 120 Mb

2.6 Assembly evaluation: Read remapping
Rationale: Low quality assembled regions will have:
• Low coverage (breaking points)
• High coverage and high number of polymorphisms (due the collapsing of
multiple copies.

2.6 Assembly evaluation: Read remapping
Short Processed Reads
Long Processed Reads
Assembly Mapped Reads
Bowtie2/BWA
BlasR
Coverage
Variants
Bedtools
FreeBayes
VariantsCoverageMapped reads

2.6 Assembly evaluation: Comparison with alternative sequencing methods
Rationale: A sequence assembly should agree with alternative sequencing
methods.
• Illumina based assembly —> PacBio based assembly
• Short distance scaffolding (< 200 Kb) —> Optical Mapping scaffolding.
• Long distance scaffolding (> 200 Kb) —> BACs, YACs and Genetic maps
Sharma et al. 2013
E.g. Potato genome

Rationale: The gene space is deﬁned as the whole collection of genes in a
genome. A complete genome should have a complete gene space. There are
two ways to evaluate the completeness of a gene space.
• Check the completeness of a set of conserved genes (CEGMA/BUSCO).
- Maybe not all the conserved genes are in the studied genome
• Check the mapping rate of a transcriptome to a genome (RNASeq).
- The RNASeq may have contaminations of other organisms.
2.6 Assembly evaluation: Gene space completeness

Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.
http://busco.ezlab.org/

Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.

2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
Resistent x Susceptible
F2 F2 F2 F2
F2
F2F2
F2F2
F2F2
F2F2
F2F2
F2
F2 F2
F2 F2 F2 F2 F2 F2 F2
DNA DNA DNA DNA DNA DNA DNA
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
Allele frequency

2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
recombination area
scaffold misplacement

Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2.7 Assembly stages

5
4
3
2
1
2.7 Assembly stages
Arabidopsis thaliana (TAIR10)
(5 chromosomes with 119 Mb)
(96 gaps with 185 Kb total)
Solanum lycopersicum (v2.5)
(12 chromosomes with 824 Mb)
(26,866 gaps with 82 Mb total)
Petunia axillaris (v1.6.2)
(83,639 scaffolds with 1.26 Gb)
Nicotiana benthamiana (v0.4.4)
(141,339 scaffolds with 2.63 Gb*)
*0.61-0.67 Gb no assembled

Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Complete for non-
repetitive regions
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2.7 Assembly stages
?

Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Complete for non-
repetitive regions
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2.7 Assembly stages
Genome size
vs
assembly size

Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Complete for non-
repetitive regions
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2.7 Assembly stages
Cytogenetic data
(Flow cytometry)
✴ http://data.kew.org/cvalues
Example: Petunia axillaris genome size 1.37 Gb
(White and Rees. 1987)

Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Complete for non-
repetitive regions
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2.7 Assembly stages
Sequencing data
(Kmer count)
• Jellyﬁsh (http://www.genome.umd.edu/
jellyﬁsh.html)
• BFCounter (http://
pritchardlab.stanford.edu/bfcounter.html)

1. They have variable size, for example in angiosperm plants they range from 63 Mb
(Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data
at:
✴ Plants: http://data.kew.org/cvalues/
✴ Animals: http://www.genomesize.com/
2. They can be polyploids. It means that homoeologus regions with highly similar will
collapse during the assembly.
3. They can be highly heterozygous and polymorphic. In this case some of the alleles will
collapse, some of them not.The effective coverage will lower than expected.
4. They can have recent whole genome duplication (or triplication) events. Paralogous
genes may collapse during the assembly.
5. Repeats, most of the genomes are full of repeats and they are difﬁcult to resolve. By
default assemblers throw them away based in the Kmer histogram.

3.1 Collapsing problem
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
Gene 1 A
Gene 1 B
CGCCCTTGACGACATGACGACA
Collapsed consensus Gene 1 A + Gene 1 B
✴ Polyploidy
✴ Whole Genome Duplication
Gene 1
Gene 1 A
Gene 1 B
Schmutz J et al. Genome Sequence of the
Paleoploid Soybean. Nature 2010 463:178-183

3.1 Collapsing problem
✴ Whole Genome Duplication
http://chibba.agtec.uga.edu/duplication/index/home

3.1 Collapsing evaluation
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
CGCCCTTGACGACATGACGACA Consensus
Reads
Read collapsing during the assembly process can be evaluated mapping reads
back to the consensus sequence and analyzing SNPs.
CACTTGACGACATGA
SNPs
=
Heterozygous Positions (AB)
+
Homoeologs Collapsing (ABBB,AABB,ABBB)
+
Paralogs Collapsing (variable)

3.1 Collapsing solutions
1. Sequence the two progenitors and use them as a reference.
✓Approach used previously (cotton genome; Patterson A. et. al 2012).
❖ Information not contained in the references will be lost.
❖ Progenitors are not always available.
❖ Gene conversions and non-homologue recombinations
2. Use single molecule technologies such as PacBio or 10X Genomics
✓ Best approach because it will give the right phase for long sequence
chunks. It exists software to integrate big sequences (minimus2...)
3. Assembly everything with the higher Kmer, evaluate the collapsed regions
and rescue the haplotype/region using SNPs and pair end read information.
✓ Cheap, easy to apply for the current technologies.
❖ No software available

3.2 Repeats and assemblers
5. Repeats, most of the genomes are full of repeats and they are difﬁcult to resolve. By
default some assemblers throw them away based in the kmer histogram. For example,
SOAPdenovo throw away anything up to 255.
It is convenient to do a Kmer analysis using Jellyﬁsh (or other Kmer counter) before
do any assembly to analyze contaminations (bacterial contaminations may appears with
high Kmers content) and repeats.
Software to analyze repeats based
in sequence graphs:
RepeatExplorer
(http://repeatexplorer.umbr.cas.cz/)
Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next-
generation sequencing data. Bmc Bioinformatics 11, 378 (2010)

4. Examples of genome assemblies
Petunia axillaris Petunia inﬂata
Sinningia speciosa Persea americana

Genome Size
174 Mb* - 573 Mb**
(commercial accession = 337Mb)
(Zaitlin and Pierce. 2010)
chromosome number
2n = 2x = 26
(Zaitlin and Pierce. 2010)
Sinningia speciosa
* Accession “Antonio Dias”
** Accession “Dona Lourdes”
4. Examples of genome assemblies Sinningia speciosa

S. speciosa
Processed reads*
(Millions)
Processed reads
coverage (X)
PE 500 bp (GRCF 2013) 253 75
TOTAL PAIR ENDS 253 75
MP 2 Kb (BGI 2011) 144 42
MP 5 Kb (BGI 2011) 166 49
TOTAL MATE PAIRS 310 91
TOTAL SHORT READS 563 166
PACBIO READS 1 21
* Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 30, min. length 50 bp), PrinSeq (to
remove the duplicated pairs) and Musket (To correct the reads).
Reads Stats

Sinningia speciosa
Avenida Niemeyer
DNA RNA
PacBio
reads
Illumina
reads
Illumina
reads
PacBio
assembly
Illumina
assembly
Hybrid
assembly
Annotated
genome
Illumina
assembly
SOAPdenovo TrinitySprai
PBJelly
SSPACE-Long
MAKER
21X 166 X

S. speciosa
Assembly
Sispe v0.0.4
(PE500b)
Sispe v0.1.2
(PE500b+MP2Kb+MP5Kb)
Sispe v0.3.8
(PE500b+MP2Kb+MP5Kb+PB)
Dataset Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds
Total assembly
size (Mb) 268 270 325 368 385 395
Total assembly
sequences 91,889 75,159 232,055 166,101 18,752 8,078
Longest sequence
(Mb) 0.25 0.34 0.37 1.84 1.61 2.93
Sequence length
mean (Kb) 2.91 3.59 1.4 2.21 20.55 48.97
N90 (sequences) 29,676 26,892 51,679 4,470 4,139 1,776
L90 (Kb) 1.16 1.27 0.57 5.68 14.78 35.34
N50 (sequences) 4,030 3,369 5,242 554 682 337
L50 (Kb) 15.07 17.21 12.9 152.59 130.05 281.78
Current Genome Assembly Stats

Estimated genome size for S. speciosa:
• Flow Cytometry (Zaitlin & Pierce, 2010)= 337 Mb
• Kmer Count* = 355 Mb
• Assembly size (scaffolds v0.1.2) = 368 Mb
• Assembly size (contigs v0.1.2) = 325 Mb
• ssembly size (scaffolds v0.3.8) = 385 Mb
• Assembly size (contigs v0.3.8) = 395 Mb
Estimated genome not assembled v0.1.2 = 12 - 30 Mb
Estimated genome not assembled v0.3.8 = Unknown

* http://data.kew.org/cvalues/
Petunia axillaris Petunia inﬂata
genome size
1.17 Gb
genome size
1.37 Gb
chromosome number
2n = 2x = 14
Petunia hybrida
4. Examples of genome assemblies Petunia axillaris and P. inﬂata

P. axillaris P. inﬂata
Processed reads*
(Millions)
Processed reads
coverage (X)
Processed reads*
(Millions)
Processed reads
coverage (X)
PE 170 bp (BGI 2011) 239 17 317 22
PE 350 bp (BGI 2011) 166 12 120 6
PE 500 bp (BGI 2011) 149 10 254 25
PE 800 bp (BGI 2011) 167 12 143 15
PE 1000 bp (UIL2013) 462 48 NA NA
TOTAL PAIR ENDS 1,163 99 834 68
MP 2 Kb (BGI 2011) 179 11 155 15
MP 5 Kb (BGI 2011) 120 8 200 20
MP 8 Kb (UIL2013) 115 10 104 17
MP 15 Kb (UIL2013) 98 9 92 15
TOTAL MATE PAIRS 513 38 551 67
TOTAL SHORT READS 1,696 137 1,385 135
PacBio Reads (UNIBE2013) 8,599 22 NA NA
* Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 20, min. length 50 bp), PrinSeq (to
remove the duplicated pairs) and Musket (To correct the reads).
Reads Stats

Raw
PacBio
Corrected
PacBio
Processed
Illumina
Peaxi_v1.3.2
Scaffolds
PacBio 1
AHA
(assembly and scaffolding)
PBJELLY
(gap filling 1)
Pair End/Mate Pair Data
EC-TOOLS
(errorcorrection)
Scaffolds
PacBio 2
Contigs
SSPACE 
(scaffolding)
Peaxi_v1.6.0Peaxi_v1.6.1
PBJELLY
(gap filling 2)
Peaxi_v1.6.2
Perl Script
(filtering)
P. axillaris Illumina-PacBio hybrid assembly
SOAPdenovo2
(assembly)
GapCloser
(gap filling 0)
* Assembly made by Michel Moser at UNIBE

P. axillaris P. inﬂata
Current Assembly Peaxi v1.6.2 Peinf v1.0.2
Dataset Contigs Scaffolds Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26 1.2 1.29
Total assembly sequences 109,892 83,639 213,254 136,283
Longest sequence (Mb) 0.57 8.56 0.57 5.81
Sequence length mean (Kb) 11.08 15.05 5.63 9.44
N90 (sequences) 13,481 1,051 41,876 2,450
L90 (Kb) 22.28 295.75 4.57 60.31
N50 (sequences) 3,943 309 9,813 406
L50 (Kb) 95.17 1,236.73 34.99 884.43
Current (Frozen) Genome Assembly Stats
* P. inﬂata assembly was performed
with SOAPdenovo2 and GapCloser
with an optimal Kmer=79
* Hybrid assembly

Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
Estimated genome not assembled = 110 - 120 Mb

4. Examples of genome assemblies Persea americana
Persea americana
genome size
0.90 Gb
2n = 2x = 24
(Arumuganathan and Earle, 1991)

Experimental design
• Sequencing of a low heterozygosity avocado accession, VC75
• Base assembly made by self-corrected long reads and Canu
(PacBio > 75X)
• Polished with high genome coverage short reads with Racon
(Illumina > 100X).
• Chromo-scaffolding with HiC

• Base assembly made by self-corrected long reads and Canu (PacBio > 75X)
Current Assembly Persia 1.0.0
Dataset Contigs
Total assembly size (Gb) 0.86
Total assembly sequences 3,204
Longest sequence (Mb) 10.56
Sequence length mean (Kb) 268.74
N90 (sequences) 850
L90 (Kb) 102.62
N50 (sequences) 136
L50 (Mb) 1.56

Estimated genome size for Persea americana:
• Flow Cytometry (Manosalva, 2018) = 0.90 Gb
VC75
het = 0.16%
Estimated genome not
assembled = 40 - 420 Mb

assembly
assembly
0MB100MB200MB300MB400MB500MB600MB700MB800MB900MB
mismatch_wide.at.step.0.bed
mismatch_narrow.at.step.0.bed
suspect.at.step.0.bed
repeats_wide.at.step.0.bed
depletion_score_narrow.at.step.0.wig
coverage_wide.at.step.0.wig
depletion_score_wide.at.step.0.wig
Coverage
0 MB 100 MB 200 MB 300 MB 400 MB 500 MB 600 MB 700 MB 800 MB 900 MB
• Chromo-scaffolding with HiC

Genomics Assembly Techniques and Technologies

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Genomics Assembly Techniques and Technologies

Semelhante a Genomics Assembly Techniques and Technologies (20)

Mais de Aureliano Bombarely

Mais de Aureliano Bombarely (10)

Último

Último (20)

Genomics Assembly Techniques and Technologies