SlideShare uma empresa Scribd logo
1 de 105
Baixar para ler offline
CURSO DE BIOINFORMATICA APLICADA:
Secuenciacion y Ensamblado de Genomas
Aureliano Bombarely
School of Plant and Environmental Sciences (SPES)
Virginia Tech
aurebg@vt.edu
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
1. A brief history of the sequence assembly.
http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
Genome = N x Sequence of DNA
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
DNA
sequencing
???
1. A brief history of the sequence assembly.
DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
ddATP
ddGTP
ddCTP
ddTTP
Taq-Polymerase
STOP
time
2) Chromatographic
Separation
GTCACCCTGAAT
1) PCR with ddNTPs
3) Chromatogram Read
DNA Sanger Sequencing
1. A brief history of the sequence assembly.
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACGATTTAAATGACGTTGTCGA
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACCCCGACGTTGTCGA
DNA
sequencing
ACGCGTTTTGTTGTGG
ACCCCTGGGGGGGTTGTCGA
Fragments
Sequence
assembly
ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG
1. A brief history of the sequence assembly.
Sequence assembly
=
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
Image courtesy of iStock photo
1. A brief history of the sequence assembly.
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
BACs BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
MS2Bacteriophage(3.66Kb)
Epstein-BarrVirus(170Kb)
Haemophilusinfluenzae(1.83Mb)
Saccharomycescerevisae(12.1Mb)
Caenorhabditiselegans(100Mb)
Arabidopsisthaliana(157Mb)
Homosapiens(3.2Gb)
Oryzasativa(420Mb)
Sanger
454
Solexa/Illumina
SOLiD
Ion Torrent
PacBio
OxN
2
3
4
7
2016/02/04 Sequenced
Genomes
Viridiplantae 178
Metazoa 5907
Bacteria 7897
1977 MS2 Bacteriophage (3.658 Kb) 1977
N
O
SO
FTW
A
RE
U
SED
1. A brief history of the sequence assembly.
Haemophilus influenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER
Hardware:
SPARCenter 2000 (512 Mb RAM)
1. A brief history of the sequence assembly.
Haemophilus influenzae (1.83 Mb) 1995
Software:
TIGR ASSEMBLER Smith-Waterman alignments
Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects
Genome Science and Technology, 1995,1:9-19
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
ATGCGTGTGCGTGAGTG
GCGTGTGCGTGAGTGCC
GTGTGCGTGAGTGCCTA
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
OVERLAPPING
METHODOLOGY
Sequence
homology
match
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
|||||||||||||
Multiple
sequence
alignment
GCGTGTGCGTGAGTGCC
TGCGCGTGCGTGAGTGC
GCGTGTGCGTGAGTGCG
TGCGTGAGCGTGAGTGCC
ATGCGTGTGCGTGAGTG
GTGTGCGTGAGTGCCTA
Consensus
call
overlap
ATGCGTGCGYGTGAGTGAGTGCC
Overlap
clustering
Homo sapiens (3.2 Gb) 2001
BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome.
Nature. 2001. 409:860-921
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
O
VERLA
PPIN
G
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
Homo sapiens (3.2 Gb) 2001
Whole Genome Shotgun (WGS) sequencing
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
Software:
WGAASSEMBLER (CABOG)
Hardware:
40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and
160 cores); 5 days.
1. A brief history of the sequence assembly.
Ailuropoda melanoleura (2.3 Gb) 2009
Whole Genome Shotgun (WGS) sequencing
Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317
Software:
SOAPdenovo
Hardware:
Supercomputer with 32 cores and 512 Gb RAM.
BRU
IJN
G
RA
PH
S
M
ETH
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
What is a Kmer ?
ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt
Specific n-tuple or n-gram of nucleic acid or amino acid sequences.
-Wikipedia
ordered list
of elements
contiguous sequence
of n items from a given
sequence of text
5 Kmers of 20-mer
ATGCGCAGTGGAGAGAGAGC
TGCGCAGTGGAGAGAGAGCG
GCGCAGTGGAGAGAGAGCGA
CGCAGTGGAGAGAGAGCGAT
GCAGTGGAGAGAGAGCGATG
N_kmers = L_read - Kmer_size
1. A brief history of the sequence assembly.
Kmer
transformation
(e.g. Kmer-5)
Kmer
count
Bruijn Graph
Construction
TGCGTGTGAGTGAGTGC
GCGTGTGAGTGAGTGCG
ATGCGTGTGCGTGAGT
GTGTGCGTGAGTGCCT
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTG
CGTGT
GCGTG
TGCGT
ATGCG
GTGCC
AGTGC
GAGTG
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTG
TGCCT
AGTGC
GAGTG
TGAGT
GTGAG
TGCGT
GAGTG
TGAGT
GTGAG
AGTGA
GTGCG
AGTGA
GAGTG
TGAGT
GTGAG
TGTGA
GTGTG
CGTGT
GCGTG
AGTGC
GAGTG
TGAGT
GTGAG
Kmer Count
ATGCG 1
TGCGT 1
GCGTG 1
CGTGT 1
GTGTG 2
TGTGC 2
GTGCG 2
TGCGT 2
GCGTG 3
CGTGA 3
GTGAG 3
TGAGT 3
TGCGT 1
GCGTG 2
TGAGT
GTGAG
CGTGA
GCGTG
TGCGT
GTGCG
TGTGC
GTGTGCGTGTGCGTGTGCGTATGCG
CGTGT
GCGTG
TGCGT
Bruijn Graph
Resolution
1 1 1 1 2
2
2
2
3
3
3
3
1
2
2
ATGCGTGTGCGTGAGT
TGTGA
GTGTG
CGTGT
GCGTG
Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291
1. A brief history of the sequence assembly.
OLC (Overlap-layout-consensus) algorithm is more
suitable for the low-coverage long reads, whereas the
DBG (De-Bruijn-Graph) algorithm is more suitable for
high-coverage short reads and especially for large
genome assembly
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
What is a Scaffold?
https://genome.jgi.doe.gov/help/scaffolds.jsf
A scaffold is a portion of the genome sequence reconstructed from end-
sequenced whole-genome shotgun clones. Scaffolds are composed of contigs
and gaps. A contig is a contiguous length of genomic sequence in which the order
of bases is known to a high confidence level. Gaps occur where reads from the
two sequenced ends of at least one fragment overlap with other reads in two
different contigs (as long as the arrangement is otherwise consistent with the
contigs being adjacent). Since the lengths of the fragments are roughly known,
the number of bases between contigs can be estimated.
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Fastq raw
Fastq Processed
Contigs
Scaffolds
Decisions during the
Experimental Design
Software
Identity %, Kmer ...
Post-assembly Filtering
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Decisions during the
Experimental Design
2. Sequencing, tools and computers.
2.1 Technologies
Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq2500 (Seq.
by synthesis)
50-250
Single/Pairs
99% 4,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300
Single/Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Pacific Biosciences RSII
(SMRT)
10,000
Single
85%
(99.9%)
750,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02
10X Genomics
20,000
Single
99% 1,000,000,000 NA NA
2. Sequencing, tools and computers.
★ Library types (orientations):
• Single reads
• Pair ends (PE) (150-800 bp insert size)
• Mate pairs (MP) (2-40 Kb insert size)
F
F
F
F
R
R
R 454/Roche
Illumina
Illumina
2.2 Libraries
2. Sequencing, tools and computers.
• Why is important the pair information ?
F
- novo assembly:
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Genetic information (markers)
Pseudomolecule
(or ultracontig)
NNNNN NN
2.2 Libraries
2. Sequencing, tools and computers.
2.3 Sequencing Amount
* Slide from MC. Schatz
2. Sequencing, tools and computers.
2.3 Sequencing Amount
Depending of the genome complexity, technology used and assembler:
• More is better (if you have enough computational resources).
- Sanger > 10X (less for BACs-by-BACs approaches).
- 454 > 20X
- Illumina > 100X
- PacBio > 20X (corrected by Illumina) or 50X (corrected PacBio)
- 10X Genomics > 20X
• Polyploidy or high heterozygosity increase the amount of reads needed.
• The use of different library types (pair ends and mate pairs with different
insert sizes is essential).
• Longer reads is preferable.
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
Chromosomes
Long distance
scaffolding
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
2. Sequencing, tools and computers.
2.4 Tools: Read Quality Evaluation
a) Length of the read.
b) Bases with qscore > 20 or 30.
•FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
2. Sequencing, tools and computers.
2.4 Tools: Read Trimming and Filtering
• Fastx-Toolkit (http://hannonlab.cshl.edu)
• Ea-Utils (http://code.google.com/p/ea-utils/)
• PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi)
Software Multiplexing Trimming/Filtering
Fastx-Toolkit fastx_barcode_splitter fastq_quality_filter
Ea-Utils fastq-multx fastq-mcf
PrinSeq PrinSeq PrinSeq
1. Adaptor removal.
2. Low quality reads (qscore) (Q30)
3. Short reads (L50)
4. PCR duplications (Only Genomes, Use PrinSeq).
2. Sequencing, tools and computers.
2.4 Tools: Contaminations
5. Contaminations
Contaminations can be removed mapping the reads against a reference with the
contaminants such as E. coli and human genomes.The most common tools are Bowtie
or BWA (for short reads) and Blast (for long reads).
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
Read corrections are generally based in the Kmer analysis.
Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage
Bioinformatics. 2011 27 (13):i137-i141
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections: Illumina
• Quake (http://www.cbcb.umd.edu/software/quake/index.html)
• Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software)
• ECHO (http://uc-echo.sourceforge.net/)
• Corrector (http://soap.genomics.org.cn/soapdenovo.html)
• Musket (http://musket.sourceforge.net/)
Usual Suspects:
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections: PacBio
Three options:
- Self-alignment of PacBio reads.
• PBcR (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina contigs.
• PacBioToCA (Celera Assembler: http://wgs-assembler.sourceforge.net/)
- Alignment of PacBio reads with Illumina reads.
• Lordec (http://www.atgc-montpellier.fr/lordec/)
• Proovread (https://github.com/BioInf-Wuerzburg/proovread)
• LSC (http://www.healthcare.uiowa.edu/labs/au/LSC/)
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Consensus
Scaffolding
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus
Sanger, 454, Illumina,
PacBio
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
Minia Bloom filter + Brujn graph Illumina
Really fast
Only contigs
https://github.com/GATB/minia
2.4 Tools: Assemblers
2. Sequencing, tools and computers.
2.4 Tools: Assemblers
... but there are more assemblers and information... Take a look to SeqAnswers
http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application?
Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly
Also highly recommendable:
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus
Sanger, 454, Illumina,
PacBio
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
Minia Bloom filter + Brujn graph Illumina
Really fast
Only contigs
https://github.com/GATB/minia
2.4 Tools: Assemblers
2. Sequencing, tools and computers.
Type Technology Used Features Link
HGAP Overlap-layout-consensus PacBio
Recommended by
PacBio. Multiple
tools. Difficult to
https://github.com/PacificBiosciences/
Bioinformatics-Training/wiki/HGAP
Falcon Overlap-layout-consensus PacBio
Faster than
HGAP or
CABOG
https://github.com/PacificBiosciences/falcon
Sprai Overlap-layout-consensus PacBio
Easy to use
interface for
CABOG
http://zombie.cb.k.u-tokyo.ac.jp/sprai/
README.html
Canu Overlap-layout-consensus
PacBio, Oxford
Nanopore
Easy to use http://canu.readthedocs.org/
MECAT Overlap-layout-consensus
PacBio, Oxford
Nanopore
Fast assembler https://github.com/xiaochuanle/MECAT
Spades Brujn graphs
Illumina, Ion Torrent,
PacBio, Oxford
Nanopore
Diverse type of
input
http://cab.spbu.ru/software/spades/
MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html
Supernova Hybrid 10X Genomics
Used with 10X
genomics
https://support.10xgenomics.com/de-novo-
assembly/software/pipelines/latest/using/running
2.4 Tools: Assemblers
2. Sequencing, tools and computers.
Type Technology Used Features Link
HGAP Overlap-layout-consensus PacBio
Recommended by
PacBio. Multiple
tools. Difficult to
https://github.com/PacificBiosciences/
Bioinformatics-Training/wiki/HGAP
Falcon Overlap-layout-consensus PacBio
Faster than
HGAP or
CABOG
https://github.com/PacificBiosciences/falcon
Sprai Overlap-layout-consensus PacBio
Easy to use
interface for
CABOG
http://zombie.cb.k.u-tokyo.ac.jp/sprai/
README.html
Canu Overlap-layout-consensus
PacBio, Oxford
Nanopore
Easy to use http://canu.readthedocs.org/
MECAT Overlap-layout-consensus
PacBio, Oxford
Nanopore
Fast assembler https://github.com/xiaochuanle/MECAT
Spades Brujn graphs
Illumina, Ion Torrent,
PacBio, Oxford
Nanopore
Diverse type of
input
http://cab.spbu.ru/software/spades/
MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html
Supernova Hybrid 10X Genomics
Used with 10X
genomics
https://support.10xgenomics.com/de-novo-
assembly/software/pipelines/latest/using/running
2.4 Tools: Assemblers
Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance
4. Linkage groups anchoring (Since 2001) -Very Long Distance
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
1. Pair ends/Mate pair information
Contigs
Pair ends
e.g. 300 bp e.g. 2000 bp
Mate pairs
1.1 Small fragments mapping
1.2. Scaffold building
2.1. Long fragments mapping
2.2. Scaffold building
NNN
NNN
NNN NNNNNNNN
0. Input
2. Sequencing, tools and computers.
Technology Used Link
SSPACE Short reads, pairs http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE
SSPACE-Long Long reads
http://www.baseclear.com/genomics/bioinformatics/basetools/
SSPACE-longread
Bambus2 Pairs https://www.cbcb.umd.edu/software/bambus2
2.4 Tools: Stand Alone Read Pairs Scaffolders
2. Sequencing, tools and computers.
2.4 Tools: Stand Alone Scaffolders
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolds
Software
Decisions during the
Assembly Optimization
Scaffolds
Gap Filling
2. Sequencing, tools and computers.
Type Technology Used Features Link
GapCloser Overlap-layout-consensus Illumina
C++
Free
http://soap.genomics.org.cn/soapdenovo.html
GapFiller Overlap-layout-consensus Any
Perl
Commercial*
http://www.baseclear.com/landingpages/
basetools-a-wide-range-of-bioinformatics-
solutions/gapfiller/
PBSuite Overlap-layout-consensus
454,
PacBio
Python,
Long reads
http://sourceforge.net/p/pb-jelly/wiki/
2.4 Tools: Gap Filling
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Software
Decisions during the
Assembly Optimization
Scaffolds Chromosomes
Long distance
scaffolding
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolding approaches:
1. Pair ends/Mate pair information (Since 2008) - Short Distances
2. Optical mapping data (Since 2012) - Medium-Long Distance
3. HiC (Since 2014) - Long-Very Long Distance
4. Linkage groups anchoring (Since 2001) -Very Long Distance
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Optical mapping is a technique for constructing ordered, genome-wide, high-resolution
restriction maps from single, stained molecules of DNA, called "optical maps". By mapping
the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of
resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that
sequence
https://en.wikipedia.org/wiki/Optical_mapping
2. Optical mapping data (Since 2012).
Chromosome conformation
capture techniques (often
abbreviated to 3C technologies
or 3C-based methods) are a set
of molecular biology methods
used to analyze the spatial
organization of chromatin in a
cell. These methods quantify
the number of interactions
between genomic loci that are
nearby in 3-D space, but may
be separated by many
nucleotides in the linear
genome
3. HiC, Chromatin Conformation Capture (Since 2014).
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
https://en.wikipedia.org/wiki/Chromosome_conformation_capture
3. HiC, Chromatin Conformation Capture (Since 2014).
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Dudechenko et al, Science 07 Apr 2017: Vol. 356, Issue 6333, pp. 92-95
DOI: 10.1126/science.aal3327
4. Linkage groups anchoring (Since 2001) -Very Long Distance
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Argyris et al, BMC Genomics201516:4
https://doi.org/10.1186/s12864-014-1196-3
2. Sequencing, tools and computers.
2.5 Computers
Bigger is better:
• How much do you need depends of:
➡ how big is your genome ?
~ Human size (3Gb) require ~256 Gb to 1 Tb
➡ how many reads do you have, ?
More reads, more memory and hard disk.
➡ what software are you going to use ?
OLC uses more memory and time than DBG.
➡ what parameters are you going to use ?
Bigger Kmers, more memory.
Four options:
1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000).
2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU).
3. Use a supercomputing center associated with NSF, NIH, USDA... where they
offer reduced prices (iPlant, Indiana University....).
4. Collaborate with some group with a big server.
2. Sequencing, tools and computers.
2.6 Assembly evaluation
Different methods to evaluate an assembly:
1. Assembly stats (total assembly size vs expected size; N50/L50…)
2. Read remapping, variant calling and coverage evaluation.
3. Comparison with alternative sequencing methods.
4. Gene Space Completeness (RNASeq & EST mapping, BUSCO).
5. Comparison with genetic information.
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
Rationale: Close that a sequence assembly is to the expected genome size,
better it will be the genome assembly.
• Ideal case: The final assembly will have as many sequences as
chromosomes have the genome.
• Real case: The assembly will be fragmented is sequences with different
sizes. Lower number of fragments, better quality.
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
During the assembly optimization will be generated several assemblies.The most used
parameters to evaluate the assembly are:
1. Total Assembly Size,
How far is this value from the estimated genome size
2. Total Number of Sequences (Scaffold/Contigs)
How far is this value from the number of chromosomes.
3. Longest scaffold/contig
4. Average scaffold/contig size
5. N50/L50 (or any other N/L)
Number sequence (N) and minimum size of them (L) that represents the 50% of
the assembly if the sequences are sorted by size, from bigger to smaller.
2. Sequencing, tools and computers.
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
2.6 Assembly evaluation: Assembly stats
2. Sequencing, tools and computers.
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
50 % assembly: 500 Mb
N50
N50 = 7 sequences
L50 = 50 Mb
2.6 Assembly evaluation: Assembly stats
2. Sequencing, tools and computers.
N90/L90
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
90 % assembly: 900 Mb
N90
N90 = 29 sequences
L90 = 12.5 Mb
2.6 Assembly evaluation: Assembly stats
P. axillaris
Current Assembly Peaxi v1.6.2
Dataset Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26
Total assembly sequences 109,892 83,639
Longest sequence (Mb) 0.57 8.56
Sequence length mean (Kb) 11.08 15.05
N90 (sequences) 13,481 1,051
L90 (Kb) 22.28 295.75
N50 (sequences) 3,943 309
L50 (Kb) 95.17 1,236.73
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
Example: Sequencing of the Petunia axillaris genome.
0 10 20 30 40 50 60
0e+002e+074e+076e+078e+071e+08
63 Kmer Frequency for Petunia axillaris
Coverage
KmerCount
peak = 31
Estimated genome size = 1.38 Gb
Possiblesequencingerrors
Sequencing data (Kmer count)
2. Sequencing, tools and computers.
G = (N − B) * K / D * 2
G = genome size
N = total kmers
B = possible kmers with errors
K = kmer length
D = kmer peak
2.6 Assembly evaluation: Assembly stats
Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
• Assembly size (contigs v1.6.2) = 1.22 Gb
Estimated genome not assembled = 110 - 120 Mb
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Assembly stats
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Read remapping
Rationale: Low quality assembled regions will have:
• Low coverage (breaking points)
• High coverage and high number of polymorphisms (due the collapsing of
multiple copies.
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Read remapping
Short Processed Reads
Long Processed Reads
Assembly Mapped Reads
Bowtie2/BWA
BlasR
Coverage
Variants
Bedtools
FreeBayes
VariantsCoverageMapped reads
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with alternative sequencing methods
Rationale: A sequence assembly should agree with alternative sequencing
methods.
• Illumina based assembly —> PacBio based assembly
• Short distance scaffolding (< 200 Kb) —> Optical Mapping scaffolding.
• Long distance scaffolding (> 200 Kb) —> BACs, YACs and Genetic maps
Sharma et al. 2013
E.g. Potato genome
2. Sequencing, tools and computers.
Rationale: The gene space is defined as the whole collection of genes in a
genome. A complete genome should have a complete gene space. There are
two ways to evaluate the completeness of a gene space.
• Check the completeness of a set of conserved genes (CEGMA/BUSCO).
- Maybe not all the conserved genes are in the studied genome
• Check the mapping rate of a transcriptome to a genome (RNASeq).
- The RNASeq may have contaminations of other organisms.
2.6 Assembly evaluation: Gene space completeness
2. Sequencing, tools and computers.
Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.
http://busco.ezlab.org/
2.6 Assembly evaluation: Gene space completeness
2. Sequencing, tools and computers.
Other way to estimate how good is an assembly, it is to run BUSCO over the
assembly and evaluate how much of the gene space was captured.
2.6 Assembly evaluation: Gene space completeness
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
Resistent x Susceptible
F2 F2 F2 F2
F2
F2F2
F2F2
F2F2
F2F2
F2F2
F2
F2 F2
F2 F2 F2 F2 F2 F2 F2
DNA DNA DNA DNA DNA DNA DNA
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
000001111111111110
000000011111111110
000001111110000000
001111111110000000
000011111111111100
011111111111111100
000011111111000000
011111111111111100
00111111100000000
000000011111110000
011111111111100000
000000111111100000
001111111111111000
000111111111100000
Allele frequency
2. Sequencing, tools and computers.
2.6 Assembly evaluation: Comparison with Genetic Information
Rationale: The assembly information should be in agreement with genetic
maps and recombination studies.
recombination area
scaffold misplacement
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Arabidopsis thaliana (TAIR10)
(5 chromosomes with 119 Mb)
(96 gaps with 185 Kb total)
Solanum lycopersicum (v2.5)
(12 chromosomes with 824 Mb)
(26,866 gaps with 82 Mb total)
Petunia axillaris (v1.6.2)
(83,639 scaffolds with 1.26 Gb)
(26,268 gaps with 41 Mb total)
Nicotiana benthamiana (v0.4.4)
(141,339 scaffolds with 2.63 Gb*)
(337,433 gaps with 162 Mb total)
*0.61-0.67 Gb no assembled
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
?
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Genome size
vs
assembly size
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Cytogenetic data
(Flow cytometry)
✴ http://data.kew.org/cvalues
Example: Petunia axillaris genome size 1.37 Gb
(White and Rees. 1987)
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Sequencing data
(Kmer count)
• Jellyfish (http://www.genome.umd.edu/
jellyfish.html)
• BFCounter (http://
pritchardlab.stanford.edu/bfcounter.html)
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
3. Things that you should know about genomes.
1. They have variable size, for example in angiosperm plants they range from 63 Mb
(Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data
at:
✴ Plants: http://data.kew.org/cvalues/
✴ Animals: http://www.genomesize.com/
2. They can be polyploids. It means that homoeologus regions with highly similar will
collapse during the assembly.
3. They can be highly heterozygous and polymorphic. In this case some of the alleles will
collapse, some of them not.The effective coverage will lower than expected.
4. They can have recent whole genome duplication (or triplication) events. Paralogous
genes may collapse during the assembly.
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default assemblers throw them away based in the Kmer histogram.
3. Things that you should know about genomes.
3.1 Collapsing problem
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
Gene 1 A
Gene 1 B
CGCCCTTGACGACATGACGACA
Collapsed consensus Gene 1 A + Gene 1 B
✴ Polyploidy
✴ Whole Genome Duplication
Gene 1
Gene 1 A
Gene 1 B
Schmutz J et al. Genome Sequence of the
Paleoploid Soybean. Nature 2010 463:178-183
3. Things that you should know about genomes.
3.1 Collapsing problem
✴ Whole Genome Duplication
http://chibba.agtec.uga.edu/duplication/index/home
3. Things that you should know about genomes.
3.1 Collapsing evaluation
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
CGCCCTTGACGACATGACGACA Consensus
Reads
Read collapsing during the assembly process can be evaluated mapping reads
back to the consensus sequence and analyzing SNPs.
CACTTGACGACATGA
SNPs
=
Heterozygous Positions (AB)
+
Homoeologs Collapsing (ABBB,AABB,ABBB)
+
Paralogs Collapsing (variable)
3. Things that you should know about genomes.
3.1 Collapsing solutions
1. Sequence the two progenitors and use them as a reference.
✓Approach used previously (cotton genome; Patterson A. et. al 2012).
❖ Information not contained in the references will be lost.
❖ Progenitors are not always available.
❖ Gene conversions and non-homologue recombinations
2. Use single molecule technologies such as PacBio or 10X Genomics
✓ Best approach because it will give the right phase for long sequence
chunks. It exists software to integrate big sequences (minimus2...)
3. Assembly everything with the higher Kmer, evaluate the collapsed regions
and rescue the haplotype/region using SNPs and pair end read information.
✓ Cheap, easy to apply for the current technologies.
❖ No software available
3. Things that you should know about genomes.
3.2 Repeats and assemblers
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default some assemblers throw them away based in the kmer histogram. For example,
SOAPdenovo throw away anything up to 255.
It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter) before
do any assembly to analyze contaminations (bacterial contaminations may appears with
high Kmers content) and repeats.
Software to analyze repeats based
in sequence graphs:
RepeatExplorer
(http://repeatexplorer.umbr.cas.cz/)
Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next-
generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
4. Examples of genome assemblies
4. Examples of genome assemblies
Petunia axillaris Petunia inflata
Sinningia speciosa Persea americana
Genome Size
174 Mb* - 573 Mb**
(commercial accession = 337Mb)
(Zaitlin and Pierce. 2010)
chromosome number
2n = 2x = 26
(Zaitlin and Pierce. 2010)
Sinningia speciosa
* Accession “Antonio Dias”
** Accession “Dona Lourdes”
4. Examples of genome assemblies Sinningia speciosa
S. speciosa
Processed reads*
(Millions)
Processed reads
coverage (X)
PE 500 bp (GRCF 2013) 253 75
TOTAL PAIR ENDS 253 75
MP 2 Kb (BGI 2011) 144 42
MP 5 Kb (BGI 2011) 166 49
TOTAL MATE PAIRS 310 91
TOTAL SHORT READS 563 166
PACBIO READS 1 21
* Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 30, min. length 50 bp), PrinSeq (to
remove the duplicated pairs) and Musket (To correct the reads).
Reads Stats
4. Examples of genome assemblies Sinningia speciosa
4. Examples of genome assemblies Sinningia speciosa
Sinningia speciosa
Avenida Niemeyer
DNA RNA
PacBio
reads
Illumina
reads
Illumina
reads
PacBio
assembly
Illumina
assembly
Hybrid
assembly
Annotated
genome
Illumina
assembly
SOAPdenovo TrinitySprai
PBJelly
SSPACE-Long
MAKER
21X 166 X
S. speciosa
Assembly
Sispe v0.0.4
(PE500b)
Sispe v0.1.2
(PE500b+MP2Kb+MP5Kb)
Sispe v0.3.8
(PE500b+MP2Kb+MP5Kb+PB)
Dataset Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds
Total assembly
size (Mb) 268 270 325 368 385 395
Total assembly
sequences 91,889 75,159 232,055 166,101 18,752 8,078
Longest sequence
(Mb) 0.25 0.34 0.37 1.84 1.61 2.93
Sequence length
mean (Kb) 2.91 3.59 1.4 2.21 20.55 48.97
N90 (sequences) 29,676 26,892 51,679 4,470 4,139 1,776
L90 (Kb) 1.16 1.27 0.57 5.68 14.78 35.34
N50 (sequences) 4,030 3,369 5,242 554 682 337
L50 (Kb) 15.07 17.21 12.9 152.59 130.05 281.78
Current Genome Assembly Stats
4. Examples of genome assemblies Sinningia speciosa
Estimated genome size for S. speciosa:
• Flow Cytometry (Zaitlin & Pierce, 2010)= 337 Mb
• Kmer Count* = 355 Mb
• Assembly size (scaffolds v0.1.2) = 368 Mb
• Assembly size (contigs v0.1.2) = 325 Mb
• ssembly size (scaffolds v0.3.8) = 385 Mb
• Assembly size (contigs v0.3.8) = 395 Mb
Estimated genome not assembled v0.1.2 = 12 - 30 Mb
Estimated genome not assembled v0.3.8 = Unknown
4. Examples of genome assemblies Sinningia speciosa
* http://data.kew.org/cvalues/
Petunia axillaris Petunia inflata
genome size
1.17 Gb
(White and Rees. 1987)
genome size
1.37 Gb
(White and Rees. 1987)
chromosome number
2n = 2x = 14
(White and Rees. 1987)
Petunia hybrida
4. Examples of genome assemblies Petunia axillaris and P. inflata
P. axillaris P. inflata
Processed reads*
(Millions)
Processed reads
coverage (X)
Processed reads*
(Millions)
Processed reads
coverage (X)
PE 170 bp (BGI 2011) 239 17 317 22
PE 350 bp (BGI 2011) 166 12 120 6
PE 500 bp (BGI 2011) 149 10 254 25
PE 800 bp (BGI 2011) 167 12 143 15
PE 1000 bp (UIL2013) 462 48 NA NA
TOTAL PAIR ENDS 1,163 99 834 68
MP 2 Kb (BGI 2011) 179 11 155 15
MP 5 Kb (BGI 2011) 120 8 200 20
MP 8 Kb (UIL2013) 115 10 104 17
MP 15 Kb (UIL2013) 98 9 92 15
TOTAL MATE PAIRS 513 38 551 67
TOTAL SHORT READS 1,696 137 1,385 135
PacBio Reads (UNIBE2013) 8,599 22 NA NA
* Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 20, min. length 50 bp), PrinSeq (to
remove the duplicated pairs) and Musket (To correct the reads).
Reads Stats
4. Examples of genome assemblies Petunia axillaris and P. inflata
Raw
PacBio
Corrected
PacBio
Processed
Illumina
Peaxi_v1.3.2
Scaffolds
PacBio 1
AHA
(assembly and scaffolding)
PBJELLY
(gap filling 1)
Pair End/Mate Pair Data
EC-TOOLS
(errorcorrection)
Scaffolds
PacBio 2
Contigs
SSPACE

(scaffolding)
Peaxi_v1.6.0Peaxi_v1.6.1
PBJELLY
(gap filling 2)
Peaxi_v1.6.2
Perl Script
(filtering)
P. axillaris Illumina-PacBio hybrid assembly
SOAPdenovo2
(assembly)
GapCloser
(gap filling 0)
* Assembly made by Michel Moser at UNIBE
4. Examples of genome assemblies Petunia axillaris and P. inflata
P. axillaris P. inflata
Current Assembly Peaxi v1.6.2 Peinf v1.0.2
Dataset Contigs Scaffolds Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26 1.2 1.29
Total assembly sequences 109,892 83,639 213,254 136,283
Longest sequence (Mb) 0.57 8.56 0.57 5.81
Sequence length mean (Kb) 11.08 15.05 5.63 9.44
N90 (sequences) 13,481 1,051 41,876 2,450
L90 (Kb) 22.28 295.75 4.57 60.31
N50 (sequences) 3,943 309 9,813 406
L50 (Kb) 95.17 1,236.73 34.99 884.43
Current (Frozen) Genome Assembly Stats
* P. inflata assembly was performed
with SOAPdenovo2 and GapCloser
with an optimal Kmer=79
* Hybrid assembly
4. Examples of genome assemblies Petunia axillaris and P. inflata
Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
• Assembly size (contigs v1.6.2) = 1.22 Gb
Estimated genome not assembled = 110 - 120 Mb
4. Examples of genome assemblies Petunia axillaris and P. inflata
4. Examples of genome assemblies Persea americana
Persea americana
genome size
0.90 Gb
2n = 2x = 24
(Arumuganathan and Earle, 1991)
4. Examples of genome assemblies Persea americana
Experimental design
• Sequencing of a low heterozygosity avocado accession, VC75
• Base assembly made by self-corrected long reads and Canu
(PacBio > 75X)
• Polished with high genome coverage short reads with Racon
(Illumina > 100X).
• Chromo-scaffolding with HiC
4. Examples of genome assemblies Persea americana
• Base assembly made by self-corrected long reads and Canu (PacBio > 75X)
Current Assembly Persia 1.0.0
Dataset Contigs
Total assembly size (Gb) 0.86
Total assembly sequences 3,204
Longest sequence (Mb) 10.56
Sequence length mean (Kb) 268.74
N90 (sequences) 850
L90 (Kb) 102.62
N50 (sequences) 136
L50 (Mb) 1.56
Estimated genome size for Persea americana:
• Flow Cytometry (Manosalva, 2018) = 0.90 Gb
• Kmer Count* = 1.28 Gb
• Assembly size (contigs v1.0.0) = 0.86 Gb
4. Examples of genome assemblies Persea americana
VC75
het = 0.16%
Estimated genome not
assembled = 40 - 420 Mb
4. Examples of genome assemblies Persea americana
assembly
assembly
0MB100MB200MB300MB400MB500MB600MB700MB800MB900MB
mismatch_wide.at.step.0.bed
mismatch_narrow.at.step.0.bed
suspect.at.step.0.bed
repeats_wide.at.step.0.bed
depletion_score_narrow.at.step.0.wig
coverage_wide.at.step.0.wig
depletion_score_wide.at.step.0.wig
Coverage
0 MB 100 MB 200 MB 300 MB 400 MB 500 MB 600 MB 700 MB 800 MB 900 MB
• Chromo-scaffolding with HiC
Recommended lecture

Mais conteúdo relacionado

Mais procurados

Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentGeethanjaliAnilkumar2
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsAthira RG
 
Next generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedNext generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedShweta Tiwari
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
Illumina infinium sequencing
Illumina infinium sequencingIllumina infinium sequencing
Illumina infinium sequencingAyush Jain
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and AssemblyShaun Jackman
 

Mais procurados (20)

Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Next generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedNext generation-sequencing.ppt-converted
Next generation-sequencing.ppt-converted
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
Primer designing
Primer designingPrimer designing
Primer designing
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
SNP Genotyping Technologies
SNP Genotyping TechnologiesSNP Genotyping Technologies
SNP Genotyping Technologies
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
Illumina infinium sequencing
Illumina infinium sequencingIllumina infinium sequencing
Illumina infinium sequencing
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
Ion torrent
Ion torrentIon torrent
Ion torrent
 
Global alignment
Global alignmentGlobal alignment
Global alignment
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
ILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptxILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptx
 

Semelhante a Genomics Assembly Techniques and Technologies

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Deanna Church
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisdrelamuruganvet
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfCRISTIANALONSORODRIG1
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for educationaryajayakottarathil
 
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)bedutilh
 
Sequence based Markers
Sequence based MarkersSequence based Markers
Sequence based Markerssukruthaa
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf7006ASWATHIRR
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Genome sequencing in vegetable crops
Genome sequencing in vegetable cropsGenome sequencing in vegetable crops
Genome sequencing in vegetable cropsBommesh
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 

Semelhante a Genomics Assembly Techniques and Technologies (20)

Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
CROP GENOME SEQUENCING
CROP GENOME SEQUENCINGCROP GENOME SEQUENCING
CROP GENOME SEQUENCING
 
New generation Sequencing
New generation Sequencing New generation Sequencing
New generation Sequencing
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education
 
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
 
Sequence based Markers
Sequence based MarkersSequence based Markers
Sequence based Markers
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Genome sequencing in vegetable crops
Genome sequencing in vegetable cropsGenome sequencing in vegetable crops
Genome sequencing in vegetable crops
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 

Mais de Aureliano Bombarely (10)

Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927
 
PlastidEvolution
PlastidEvolutionPlastidEvolution
PlastidEvolution
 
Genomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to VariantsGenomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to Variants
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
BasicLinux
BasicLinuxBasicLinux
BasicLinux
 
PerlTesting
PerlTestingPerlTesting
PerlTesting
 
PerlScripting
PerlScriptingPerlScripting
PerlScripting
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 

Último

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 

Último (20)

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 

Genomics Assembly Techniques and Technologies

  • 1. CURSO DE BIOINFORMATICA APLICADA: Secuenciacion y Ensamblado de Genomas Aureliano Bombarely School of Plant and Environmental Sciences (SPES) Virginia Tech aurebg@vt.edu
  • 2. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes. 4. Examples of genome assemblies
  • 3. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes. 4. Examples of genome assemblies
  • 4. 1. A brief history of the sequence assembly. http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
  • 5. Genome = N x Sequence of DNA Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 DNA sequencing ??? 1. A brief history of the sequence assembly.
  • 6. DNA Sequencing: “Process of determining the precise order of nucleotides within a DNA molecule.” -Wikipedia ddATP ddGTP ddCTP ddTTP Taq-Polymerase STOP time 2) Chromatographic Separation GTCACCCTGAAT 1) PCR with ddNTPs 3) Chromatogram Read DNA Sanger Sequencing 1. A brief history of the sequence assembly.
  • 7. Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACGATTTAAATGACGTTGTCGA ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACCCCGACGTTGTCGA DNA sequencing ACGCGTTTTGTTGTGG ACCCCTGGGGGGGTTGTCGA Fragments Sequence assembly ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG 1. A brief history of the sequence assembly.
  • 8. Sequence assembly = Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) Image courtesy of iStock photo 1. A brief history of the sequence assembly.
  • 9. Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) BACs BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing 1. A brief history of the sequence assembly.
  • 10. 1. A brief history of the sequence assembly. 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 MS2Bacteriophage(3.66Kb) Epstein-BarrVirus(170Kb) Haemophilusinfluenzae(1.83Mb) Saccharomycescerevisae(12.1Mb) Caenorhabditiselegans(100Mb) Arabidopsisthaliana(157Mb) Homosapiens(3.2Gb) Oryzasativa(420Mb) Sanger 454 Solexa/Illumina SOLiD Ion Torrent PacBio OxN 2 3 4 7 2016/02/04 Sequenced Genomes Viridiplantae 178 Metazoa 5907 Bacteria 7897
  • 11. 1977 MS2 Bacteriophage (3.658 Kb) 1977 N O SO FTW A RE U SED 1. A brief history of the sequence assembly.
  • 12. Haemophilus influenzae (1.83 Mb) 1995 Software: TIGR ASSEMBLER Hardware: SPARCenter 2000 (512 Mb RAM) 1. A brief history of the sequence assembly.
  • 13. Haemophilus influenzae (1.83 Mb) 1995 Software: TIGR ASSEMBLER Smith-Waterman alignments Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects Genome Science and Technology, 1995,1:9-19 O VERLA PPIN G M ETH O D O LO G Y 1. A brief history of the sequence assembly.
  • 14. 1. A brief history of the sequence assembly. ATGCGTGTGCGTGAGTG GCGTGTGCGTGAGTGCC GTGTGCGTGAGTGCCTA TGCGCGTGCGTGAGTGC GCGTGTGCGTGAGTGCG TGCGTGAGCGTGAGTGCC OVERLAPPING METHODOLOGY Sequence homology match ATGCGTGTGCGTGAGTG GTGTGCGTGAGTGCCTA ||||||||||||| Multiple sequence alignment GCGTGTGCGTGAGTGCC TGCGCGTGCGTGAGTGC GCGTGTGCGTGAGTGCG TGCGTGAGCGTGAGTGCC ATGCGTGTGCGTGAGTG GTGTGCGTGAGTGCCTA Consensus call overlap ATGCGTGCGYGTGAGTGAGTGCC Overlap clustering
  • 15. Homo sapiens (3.2 Gb) 2001 BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome. Nature. 2001. 409:860-921 Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 O VERLA PPIN G M ETH O D O LO G Y 1. A brief history of the sequence assembly.
  • 16. Homo sapiens (3.2 Gb) 2001 Whole Genome Shotgun (WGS) sequencing Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 Software: WGAASSEMBLER (CABOG) Hardware: 40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and 160 cores); 5 days. 1. A brief history of the sequence assembly.
  • 17. Ailuropoda melanoleura (2.3 Gb) 2009 Whole Genome Shotgun (WGS) sequencing Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317 Software: SOAPdenovo Hardware: Supercomputer with 32 cores and 512 Gb RAM. BRU IJN G RA PH S M ETH O D O LO G Y 1. A brief history of the sequence assembly.
  • 18. 1. A brief history of the sequence assembly. What is a Kmer ? ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt Specific n-tuple or n-gram of nucleic acid or amino acid sequences. -Wikipedia ordered list of elements contiguous sequence of n items from a given sequence of text 5 Kmers of 20-mer ATGCGCAGTGGAGAGAGAGC TGCGCAGTGGAGAGAGAGCG GCGCAGTGGAGAGAGAGCGA CGCAGTGGAGAGAGAGCGAT GCAGTGGAGAGAGAGCGATG N_kmers = L_read - Kmer_size
  • 19. 1. A brief history of the sequence assembly. Kmer transformation (e.g. Kmer-5) Kmer count Bruijn Graph Construction TGCGTGTGAGTGAGTGC GCGTGTGAGTGAGTGCG ATGCGTGTGCGTGAGT GTGTGCGTGAGTGCCT TGAGT GTGAG CGTGA GCGTG TGCGT GTGCG TGTGC GTGTG CGTGT GCGTG TGCGT ATGCG GTGCC AGTGC GAGTG TGAGT GTGAG CGTGA GCGTG TGCGT GTGCG TGTGC GTGTG TGCCT AGTGC GAGTG TGAGT GTGAG TGCGT GAGTG TGAGT GTGAG AGTGA GTGCG AGTGA GAGTG TGAGT GTGAG TGTGA GTGTG CGTGT GCGTG AGTGC GAGTG TGAGT GTGAG Kmer Count ATGCG 1 TGCGT 1 GCGTG 1 CGTGT 1 GTGTG 2 TGTGC 2 GTGCG 2 TGCGT 2 GCGTG 3 CGTGA 3 GTGAG 3 TGAGT 3 TGCGT 1 GCGTG 2 TGAGT GTGAG CGTGA GCGTG TGCGT GTGCG TGTGC GTGTGCGTGTGCGTGTGCGTATGCG CGTGT GCGTG TGCGT Bruijn Graph Resolution 1 1 1 1 2 2 2 2 3 3 3 3 1 2 2 ATGCGTGTGCGTGAGT TGTGA GTGTG CGTGT GCGTG
  • 20. Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291 1. A brief history of the sequence assembly.
  • 21. OLC (Overlap-layout-consensus) algorithm is more suitable for the low-coverage long reads, whereas the DBG (De-Bruijn-Graph) algorithm is more suitable for high-coverage short reads and especially for large genome assembly 1. A brief history of the sequence assembly.
  • 22. 1. A brief history of the sequence assembly. What is a Scaffold? https://genome.jgi.doe.gov/help/scaffolds.jsf A scaffold is a portion of the genome sequence reconstructed from end- sequenced whole-genome shotgun clones. Scaffolds are composed of contigs and gaps. A contig is a contiguous length of genomic sequence in which the order of bases is known to a high confidence level. Gaps occur where reads from the two sequenced ends of at least one fragment overlap with other reads in two different contigs (as long as the arrangement is otherwise consistent with the contigs being adjacent). Since the lengths of the fragments are roughly known, the number of bases between contigs can be estimated.
  • 23. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes. 4. Examples of genome assemblies
  • 24. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Fastq raw Fastq Processed Contigs Scaffolds Decisions during the Experimental Design Software Identity %, Kmer ... Post-assembly Filtering Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling Chromosomes Long distance scaffolding
  • 25. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Decisions during the Experimental Design
  • 26. 2. Sequencing, tools and computers. 2.1 Technologies Technology Read length (bp) Accuracy Reads/Run Time/Run Cost/Mb Applied Bio 3730XL (Sanger) 400 - 900 99.9% 384 4 h (12 runs/day) $2,400 Roche 454 GS FLX (Pyrosequencing) 700 Single/Pairs 99.9% 1,000,000 24h $10 Illumina HiSeq2500 (Seq. by synthesis) 50-250 Single/Pairs 99% 4,000,000,000 24 to 120 h $0.05 to $0.15 Ilumina MiSeq (Seq. by synthesis) 50-300 Single/Pairs 99% 44,000,000 24 to 72 h $0.17 SOLiD 4 (Seq. by ligation) 25-50 Single/Pairs 99.9% 1,400,000,000 168 h $0.13 ION Torrent (Seq. by semiconductor) 170-400 Single 98% 80,000,000 2 h $2 Pacific Biosciences RSII (SMRT) 10,000 Single 85% (99.9%) 750,000 4 h $0.6 Oxford N. Minion (Nanopore sequencing) 10,000 Single 62% (96%) 4,400,000 48 h $0.02 10X Genomics 20,000 Single 99% 1,000,000,000 NA NA
  • 27. 2. Sequencing, tools and computers. ★ Library types (orientations): • Single reads • Pair ends (PE) (150-800 bp insert size) • Mate pairs (MP) (2-40 Kb insert size) F F F F R R R 454/Roche Illumina Illumina 2.2 Libraries
  • 28. 2. Sequencing, tools and computers. • Why is important the pair information ? F - novo assembly: Consensus sequence (Contig) Reads Scaffold (or Supercontig) Pair Read information NNNNN Genetic information (markers) Pseudomolecule (or ultracontig) NNNNN NN 2.2 Libraries
  • 29. 2. Sequencing, tools and computers. 2.3 Sequencing Amount * Slide from MC. Schatz
  • 30. 2. Sequencing, tools and computers. 2.3 Sequencing Amount Depending of the genome complexity, technology used and assembler: • More is better (if you have enough computational resources). - Sanger > 10X (less for BACs-by-BACs approaches). - 454 > 20X - Illumina > 100X - PacBio > 20X (corrected by Illumina) or 50X (corrected PacBio) - 10X Genomics > 20X • Polyploidy or high heterozygosity increase the amount of reads needed. • The use of different library types (pair ends and mate pairs with different insert sizes is essential). • Longer reads is preferable.
  • 31. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling Chromosomes Long distance scaffolding
  • 32. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections
  • 33. 2. Sequencing, tools and computers. 2.4 Tools: Read Quality Evaluation a) Length of the read. b) Bases with qscore > 20 or 30. •FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
  • 34. 2. Sequencing, tools and computers. 2.4 Tools: Read Trimming and Filtering • Fastx-Toolkit (http://hannonlab.cshl.edu) • Ea-Utils (http://code.google.com/p/ea-utils/) • PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) Software Multiplexing Trimming/Filtering Fastx-Toolkit fastx_barcode_splitter fastq_quality_filter Ea-Utils fastq-multx fastq-mcf PrinSeq PrinSeq PrinSeq 1. Adaptor removal. 2. Low quality reads (qscore) (Q30) 3. Short reads (L50) 4. PCR duplications (Only Genomes, Use PrinSeq).
  • 35. 2. Sequencing, tools and computers. 2.4 Tools: Contaminations 5. Contaminations Contaminations can be removed mapping the reads against a reference with the contaminants such as E. coli and human genomes.The most common tools are Bowtie or BWA (for short reads) and Blast (for long reads).
  • 36. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections Read corrections are generally based in the Kmer analysis. Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage Bioinformatics. 2011 27 (13):i137-i141
  • 37. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections: Illumina • Quake (http://www.cbcb.umd.edu/software/quake/index.html) • Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software) • ECHO (http://uc-echo.sourceforge.net/) • Corrector (http://soap.genomics.org.cn/soapdenovo.html) • Musket (http://musket.sourceforge.net/) Usual Suspects:
  • 38. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections: PacBio Three options: - Self-alignment of PacBio reads. • PBcR (Celera Assembler: http://wgs-assembler.sourceforge.net/) - Alignment of PacBio reads with Illumina contigs. • PacBioToCA (Celera Assembler: http://wgs-assembler.sourceforge.net/) - Alignment of PacBio reads with Illumina reads. • Lordec (http://www.atgc-montpellier.fr/lordec/) • Proovread (https://github.com/BioInf-Wuerzburg/proovread) • LSC (http://www.healthcare.uiowa.edu/labs/au/LSC/)
  • 39. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Consensus Scaffolding
  • 40. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina, PacBio Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ Minia Bloom filter + Brujn graph Illumina Really fast Only contigs https://github.com/GATB/minia 2.4 Tools: Assemblers
  • 41. 2. Sequencing, tools and computers. 2.4 Tools: Assemblers ... but there are more assemblers and information... Take a look to SeqAnswers http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application? Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly Also highly recommendable:
  • 42. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina, PacBio Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo2 Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ Minia Bloom filter + Brujn graph Illumina Really fast Only contigs https://github.com/GATB/minia 2.4 Tools: Assemblers
  • 43. 2. Sequencing, tools and computers. Type Technology Used Features Link HGAP Overlap-layout-consensus PacBio Recommended by PacBio. Multiple tools. Difficult to https://github.com/PacificBiosciences/ Bioinformatics-Training/wiki/HGAP Falcon Overlap-layout-consensus PacBio Faster than HGAP or CABOG https://github.com/PacificBiosciences/falcon Sprai Overlap-layout-consensus PacBio Easy to use interface for CABOG http://zombie.cb.k.u-tokyo.ac.jp/sprai/ README.html Canu Overlap-layout-consensus PacBio, Oxford Nanopore Easy to use http://canu.readthedocs.org/ MECAT Overlap-layout-consensus PacBio, Oxford Nanopore Fast assembler https://github.com/xiaochuanle/MECAT Spades Brujn graphs Illumina, Ion Torrent, PacBio, Oxford Nanopore Diverse type of input http://cab.spbu.ru/software/spades/ MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html Supernova Hybrid 10X Genomics Used with 10X genomics https://support.10xgenomics.com/de-novo- assembly/software/pipelines/latest/using/running 2.4 Tools: Assemblers
  • 44. 2. Sequencing, tools and computers. Type Technology Used Features Link HGAP Overlap-layout-consensus PacBio Recommended by PacBio. Multiple tools. Difficult to https://github.com/PacificBiosciences/ Bioinformatics-Training/wiki/HGAP Falcon Overlap-layout-consensus PacBio Faster than HGAP or CABOG https://github.com/PacificBiosciences/falcon Sprai Overlap-layout-consensus PacBio Easy to use interface for CABOG http://zombie.cb.k.u-tokyo.ac.jp/sprai/ README.html Canu Overlap-layout-consensus PacBio, Oxford Nanopore Easy to use http://canu.readthedocs.org/ MECAT Overlap-layout-consensus PacBio, Oxford Nanopore Fast assembler https://github.com/xiaochuanle/MECAT Spades Brujn graphs Illumina, Ion Torrent, PacBio, Oxford Nanopore Diverse type of input http://cab.spbu.ru/software/spades/ MaSurCa Hybrid PacBio + Illumina Hybrid assembly http://www.genome.umd.edu/masurca.html Supernova Hybrid 10X Genomics Used with 10X genomics https://support.10xgenomics.com/de-novo- assembly/software/pipelines/latest/using/running 2.4 Tools: Assemblers
  • 45. Scaffolding approaches: 1. Pair ends/Mate pair information (Since 2008) - Short Distances 2. Optical mapping data (Since 2012) - Medium-Long Distance 3. HiC (Since 2014) - Long-Very Long Distance 4. Linkage groups anchoring (Since 2001) -Very Long Distance 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly
  • 46. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly 1. Pair ends/Mate pair information Contigs Pair ends e.g. 300 bp e.g. 2000 bp Mate pairs 1.1 Small fragments mapping 1.2. Scaffold building 2.1. Long fragments mapping 2.2. Scaffold building NNN NNN NNN NNNNNNNN 0. Input
  • 47. 2. Sequencing, tools and computers. Technology Used Link SSPACE Short reads, pairs http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE SSPACE-Long Long reads http://www.baseclear.com/genomics/bioinformatics/basetools/ SSPACE-longread Bambus2 Pairs https://www.cbcb.umd.edu/software/bambus2 2.4 Tools: Stand Alone Read Pairs Scaffolders
  • 48. 2. Sequencing, tools and computers. 2.4 Tools: Stand Alone Scaffolders
  • 49. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Scaffolds Software Decisions during the Assembly Optimization Scaffolds Gap Filling
  • 50. 2. Sequencing, tools and computers. Type Technology Used Features Link GapCloser Overlap-layout-consensus Illumina C++ Free http://soap.genomics.org.cn/soapdenovo.html GapFiller Overlap-layout-consensus Any Perl Commercial* http://www.baseclear.com/landingpages/ basetools-a-wide-range-of-bioinformatics- solutions/gapfiller/ PBSuite Overlap-layout-consensus 454, PacBio Python, Long reads http://sourceforge.net/p/pb-jelly/wiki/ 2.4 Tools: Gap Filling
  • 51. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Software Decisions during the Assembly Optimization Scaffolds Chromosomes Long distance scaffolding
  • 52. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Scaffolding approaches: 1. Pair ends/Mate pair information (Since 2008) - Short Distances 2. Optical mapping data (Since 2012) - Medium-Long Distance 3. HiC (Since 2014) - Long-Very Long Distance 4. Linkage groups anchoring (Since 2001) -Very Long Distance
  • 53. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called "optical maps". By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence https://en.wikipedia.org/wiki/Optical_mapping 2. Optical mapping data (Since 2012).
  • 54. Chromosome conformation capture techniques (often abbreviated to 3C technologies or 3C-based methods) are a set of molecular biology methods used to analyze the spatial organization of chromatin in a cell. These methods quantify the number of interactions between genomic loci that are nearby in 3-D space, but may be separated by many nucleotides in the linear genome 3. HiC, Chromatin Conformation Capture (Since 2014). 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly https://en.wikipedia.org/wiki/Chromosome_conformation_capture
  • 55. 3. HiC, Chromatin Conformation Capture (Since 2014). 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Dudechenko et al, Science 07 Apr 2017: Vol. 356, Issue 6333, pp. 92-95 DOI: 10.1126/science.aal3327
  • 56. 4. Linkage groups anchoring (Since 2001) -Very Long Distance 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Argyris et al, BMC Genomics201516:4 https://doi.org/10.1186/s12864-014-1196-3
  • 57. 2. Sequencing, tools and computers. 2.5 Computers Bigger is better: • How much do you need depends of: ➡ how big is your genome ? ~ Human size (3Gb) require ~256 Gb to 1 Tb ➡ how many reads do you have, ? More reads, more memory and hard disk. ➡ what software are you going to use ? OLC uses more memory and time than DBG. ➡ what parameters are you going to use ? Bigger Kmers, more memory. Four options: 1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000). 2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU). 3. Use a supercomputing center associated with NSF, NIH, USDA... where they offer reduced prices (iPlant, Indiana University....). 4. Collaborate with some group with a big server.
  • 58. 2. Sequencing, tools and computers. 2.6 Assembly evaluation Different methods to evaluate an assembly: 1. Assembly stats (total assembly size vs expected size; N50/L50…) 2. Read remapping, variant calling and coverage evaluation. 3. Comparison with alternative sequencing methods. 4. Gene Space Completeness (RNASeq & EST mapping, BUSCO). 5. Comparison with genetic information.
  • 59. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Assembly stats Rationale: Close that a sequence assembly is to the expected genome size, better it will be the genome assembly. • Ideal case: The final assembly will have as many sequences as chromosomes have the genome. • Real case: The assembly will be fragmented is sequences with different sizes. Lower number of fragments, better quality.
  • 60. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Assembly stats During the assembly optimization will be generated several assemblies.The most used parameters to evaluate the assembly are: 1. Total Assembly Size, How far is this value from the estimated genome size 2. Total Number of Sequences (Scaffold/Contigs) How far is this value from the number of chromosomes. 3. Longest scaffold/contig 4. Average scaffold/contig size 5. N50/L50 (or any other N/L) Number sequence (N) and minimum size of them (L) that represents the 50% of the assembly if the sequences are sorted by size, from bigger to smaller.
  • 61. 2. Sequencing, tools and computers. N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 2.6 Assembly evaluation: Assembly stats
  • 62. 2. Sequencing, tools and computers. N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 50 % assembly: 500 Mb N50 N50 = 7 sequences L50 = 50 Mb 2.6 Assembly evaluation: Assembly stats
  • 63. 2. Sequencing, tools and computers. N90/L90 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 90 % assembly: 900 Mb N90 N90 = 29 sequences L90 = 12.5 Mb 2.6 Assembly evaluation: Assembly stats
  • 64. P. axillaris Current Assembly Peaxi v1.6.2 Dataset Contigs Scaffolds Total assembly size (Gb) 1.22 1.26 Total assembly sequences 109,892 83,639 Longest sequence (Mb) 0.57 8.56 Sequence length mean (Kb) 11.08 15.05 N90 (sequences) 13,481 1,051 L90 (Kb) 22.28 295.75 N50 (sequences) 3,943 309 L50 (Kb) 95.17 1,236.73 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Assembly stats Example: Sequencing of the Petunia axillaris genome.
  • 65. 0 10 20 30 40 50 60 0e+002e+074e+076e+078e+071e+08 63 Kmer Frequency for Petunia axillaris Coverage KmerCount peak = 31 Estimated genome size = 1.38 Gb Possiblesequencingerrors Sequencing data (Kmer count) 2. Sequencing, tools and computers. G = (N − B) * K / D * 2 G = genome size N = total kmers B = possible kmers with errors K = kmer length D = kmer peak 2.6 Assembly evaluation: Assembly stats
  • 66. Estimated genome size for P. axillaris: • Flow Cytometry (White & Rees,1987)= 1.37 Gb • Kmer Count* = 1.38 Gb • Assembly size (scaffolds v1.6.2) = 1.26 Gb • Assembly size (contigs v1.6.2) = 1.22 Gb Estimated genome not assembled = 110 - 120 Mb 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Assembly stats
  • 67. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Read remapping Rationale: Low quality assembled regions will have: • Low coverage (breaking points) • High coverage and high number of polymorphisms (due the collapsing of multiple copies.
  • 68. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Read remapping Short Processed Reads Long Processed Reads Assembly Mapped Reads Bowtie2/BWA BlasR Coverage Variants Bedtools FreeBayes VariantsCoverageMapped reads
  • 69. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Comparison with alternative sequencing methods Rationale: A sequence assembly should agree with alternative sequencing methods. • Illumina based assembly —> PacBio based assembly • Short distance scaffolding (< 200 Kb) —> Optical Mapping scaffolding. • Long distance scaffolding (> 200 Kb) —> BACs, YACs and Genetic maps Sharma et al. 2013 E.g. Potato genome
  • 70. 2. Sequencing, tools and computers. Rationale: The gene space is defined as the whole collection of genes in a genome. A complete genome should have a complete gene space. There are two ways to evaluate the completeness of a gene space. • Check the completeness of a set of conserved genes (CEGMA/BUSCO). - Maybe not all the conserved genes are in the studied genome • Check the mapping rate of a transcriptome to a genome (RNASeq). - The RNASeq may have contaminations of other organisms. 2.6 Assembly evaluation: Gene space completeness
  • 71. 2. Sequencing, tools and computers. Other way to estimate how good is an assembly, it is to run BUSCO over the assembly and evaluate how much of the gene space was captured. http://busco.ezlab.org/ 2.6 Assembly evaluation: Gene space completeness
  • 72. 2. Sequencing, tools and computers. Other way to estimate how good is an assembly, it is to run BUSCO over the assembly and evaluate how much of the gene space was captured. 2.6 Assembly evaluation: Gene space completeness
  • 73. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Comparison with Genetic Information Rationale: The assembly information should be in agreement with genetic maps and recombination studies. Resistent x Susceptible F2 F2 F2 F2 F2 F2F2 F2F2 F2F2 F2F2 F2F2 F2 F2 F2 F2 F2 F2 F2 F2 F2 F2 DNA DNA DNA DNA DNA DNA DNA 000001111111111110 000000011111111110 000001111110000000 001111111110000000 000011111111111100 011111111111111100 000011111111000000 011111111111111100 00111111100000000 000000011111110000 011111111111100000 000000111111100000 001111111111111000 000111111111100000 000001111111111110 000000011111111110 000001111110000000 001111111110000000 000011111111111100 011111111111111100 000011111111000000 011111111111111100 00111111100000000 000000011111110000 011111111111100000 000000111111100000 001111111111111000 000111111111100000 Allele frequency
  • 74. 2. Sequencing, tools and computers. 2.6 Assembly evaluation: Comparison with Genetic Information Rationale: The assembly information should be in agreement with genetic maps and recombination studies. recombination area scaffold misplacement
  • 75. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages
  • 76. 5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Arabidopsis thaliana (TAIR10) (5 chromosomes with 119 Mb) (96 gaps with 185 Kb total) Solanum lycopersicum (v2.5) (12 chromosomes with 824 Mb) (26,866 gaps with 82 Mb total) Petunia axillaris (v1.6.2) (83,639 scaffolds with 1.26 Gb) (26,268 gaps with 41 Mb total) Nicotiana benthamiana (v0.4.4) (141,339 scaffolds with 2.63 Gb*) (337,433 gaps with 162 Mb total) *0.61-0.67 Gb no assembled
  • 77. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages ?
  • 78. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Genome size vs assembly size
  • 79. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Cytogenetic data (Flow cytometry) ✴ http://data.kew.org/cvalues Example: Petunia axillaris genome size 1.37 Gb (White and Rees. 1987)
  • 80. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Sequencing data (Kmer count) • Jellyfish (http://www.genome.umd.edu/ jellyfish.html) • BFCounter (http:// pritchardlab.stanford.edu/bfcounter.html)
  • 81. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes. 4. Examples of genome assemblies
  • 82. 3. Things that you should know about genomes. 1. They have variable size, for example in angiosperm plants they range from 63 Mb (Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data at: ✴ Plants: http://data.kew.org/cvalues/ ✴ Animals: http://www.genomesize.com/ 2. They can be polyploids. It means that homoeologus regions with highly similar will collapse during the assembly. 3. They can be highly heterozygous and polymorphic. In this case some of the alleles will collapse, some of them not.The effective coverage will lower than expected. 4. They can have recent whole genome duplication (or triplication) events. Paralogous genes may collapse during the assembly. 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default assemblers throw them away based in the Kmer histogram.
  • 83. 3. Things that you should know about genomes. 3.1 Collapsing problem CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA Gene 1 A Gene 1 B CGCCCTTGACGACATGACGACA Collapsed consensus Gene 1 A + Gene 1 B ✴ Polyploidy ✴ Whole Genome Duplication Gene 1 Gene 1 A Gene 1 B Schmutz J et al. Genome Sequence of the Paleoploid Soybean. Nature 2010 463:178-183
  • 84. 3. Things that you should know about genomes. 3.1 Collapsing problem ✴ Whole Genome Duplication http://chibba.agtec.uga.edu/duplication/index/home
  • 85. 3. Things that you should know about genomes. 3.1 Collapsing evaluation CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA CGCCCTTGACGACATGACGACA Consensus Reads Read collapsing during the assembly process can be evaluated mapping reads back to the consensus sequence and analyzing SNPs. CACTTGACGACATGA SNPs = Heterozygous Positions (AB) + Homoeologs Collapsing (ABBB,AABB,ABBB) + Paralogs Collapsing (variable)
  • 86. 3. Things that you should know about genomes. 3.1 Collapsing solutions 1. Sequence the two progenitors and use them as a reference. ✓Approach used previously (cotton genome; Patterson A. et. al 2012). ❖ Information not contained in the references will be lost. ❖ Progenitors are not always available. ❖ Gene conversions and non-homologue recombinations 2. Use single molecule technologies such as PacBio or 10X Genomics ✓ Best approach because it will give the right phase for long sequence chunks. It exists software to integrate big sequences (minimus2...) 3. Assembly everything with the higher Kmer, evaluate the collapsed regions and rescue the haplotype/region using SNPs and pair end read information. ✓ Cheap, easy to apply for the current technologies. ❖ No software available
  • 87. 3. Things that you should know about genomes. 3.2 Repeats and assemblers 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default some assemblers throw them away based in the kmer histogram. For example, SOAPdenovo throw away anything up to 255. It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter) before do any assembly to analyze contaminations (bacterial contaminations may appears with high Kmers content) and repeats. Software to analyze repeats based in sequence graphs: RepeatExplorer (http://repeatexplorer.umbr.cas.cz/) Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next- generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
  • 88. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes. 4. Examples of genome assemblies
  • 89. 4. Examples of genome assemblies Petunia axillaris Petunia inflata Sinningia speciosa Persea americana
  • 90. Genome Size 174 Mb* - 573 Mb** (commercial accession = 337Mb) (Zaitlin and Pierce. 2010) chromosome number 2n = 2x = 26 (Zaitlin and Pierce. 2010) Sinningia speciosa * Accession “Antonio Dias” ** Accession “Dona Lourdes” 4. Examples of genome assemblies Sinningia speciosa
  • 91. S. speciosa Processed reads* (Millions) Processed reads coverage (X) PE 500 bp (GRCF 2013) 253 75 TOTAL PAIR ENDS 253 75 MP 2 Kb (BGI 2011) 144 42 MP 5 Kb (BGI 2011) 166 49 TOTAL MATE PAIRS 310 91 TOTAL SHORT READS 563 166 PACBIO READS 1 21 * Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 30, min. length 50 bp), PrinSeq (to remove the duplicated pairs) and Musket (To correct the reads). Reads Stats 4. Examples of genome assemblies Sinningia speciosa
  • 92. 4. Examples of genome assemblies Sinningia speciosa Sinningia speciosa Avenida Niemeyer DNA RNA PacBio reads Illumina reads Illumina reads PacBio assembly Illumina assembly Hybrid assembly Annotated genome Illumina assembly SOAPdenovo TrinitySprai PBJelly SSPACE-Long MAKER 21X 166 X
  • 93. S. speciosa Assembly Sispe v0.0.4 (PE500b) Sispe v0.1.2 (PE500b+MP2Kb+MP5Kb) Sispe v0.3.8 (PE500b+MP2Kb+MP5Kb+PB) Dataset Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds Total assembly size (Mb) 268 270 325 368 385 395 Total assembly sequences 91,889 75,159 232,055 166,101 18,752 8,078 Longest sequence (Mb) 0.25 0.34 0.37 1.84 1.61 2.93 Sequence length mean (Kb) 2.91 3.59 1.4 2.21 20.55 48.97 N90 (sequences) 29,676 26,892 51,679 4,470 4,139 1,776 L90 (Kb) 1.16 1.27 0.57 5.68 14.78 35.34 N50 (sequences) 4,030 3,369 5,242 554 682 337 L50 (Kb) 15.07 17.21 12.9 152.59 130.05 281.78 Current Genome Assembly Stats 4. Examples of genome assemblies Sinningia speciosa
  • 94. Estimated genome size for S. speciosa: • Flow Cytometry (Zaitlin & Pierce, 2010)= 337 Mb • Kmer Count* = 355 Mb • Assembly size (scaffolds v0.1.2) = 368 Mb • Assembly size (contigs v0.1.2) = 325 Mb • ssembly size (scaffolds v0.3.8) = 385 Mb • Assembly size (contigs v0.3.8) = 395 Mb Estimated genome not assembled v0.1.2 = 12 - 30 Mb Estimated genome not assembled v0.3.8 = Unknown 4. Examples of genome assemblies Sinningia speciosa
  • 95. * http://data.kew.org/cvalues/ Petunia axillaris Petunia inflata genome size 1.17 Gb (White and Rees. 1987) genome size 1.37 Gb (White and Rees. 1987) chromosome number 2n = 2x = 14 (White and Rees. 1987) Petunia hybrida 4. Examples of genome assemblies Petunia axillaris and P. inflata
  • 96. P. axillaris P. inflata Processed reads* (Millions) Processed reads coverage (X) Processed reads* (Millions) Processed reads coverage (X) PE 170 bp (BGI 2011) 239 17 317 22 PE 350 bp (BGI 2011) 166 12 120 6 PE 500 bp (BGI 2011) 149 10 254 25 PE 800 bp (BGI 2011) 167 12 143 15 PE 1000 bp (UIL2013) 462 48 NA NA TOTAL PAIR ENDS 1,163 99 834 68 MP 2 Kb (BGI 2011) 179 11 155 15 MP 5 Kb (BGI 2011) 120 8 200 20 MP 8 Kb (UIL2013) 115 10 104 17 MP 15 Kb (UIL2013) 98 9 92 15 TOTAL MATE PAIRS 513 38 551 67 TOTAL SHORT READS 1,696 137 1,385 135 PacBio Reads (UNIBE2013) 8,599 22 NA NA * Reads were processed using Fastq-mcf to trim low quality extremes (min. quality 20, min. length 50 bp), PrinSeq (to remove the duplicated pairs) and Musket (To correct the reads). Reads Stats 4. Examples of genome assemblies Petunia axillaris and P. inflata
  • 97. Raw PacBio Corrected PacBio Processed Illumina Peaxi_v1.3.2 Scaffolds PacBio 1 AHA (assembly and scaffolding) PBJELLY (gap filling 1) Pair End/Mate Pair Data EC-TOOLS (errorcorrection) Scaffolds PacBio 2 Contigs SSPACE
 (scaffolding) Peaxi_v1.6.0Peaxi_v1.6.1 PBJELLY (gap filling 2) Peaxi_v1.6.2 Perl Script (filtering) P. axillaris Illumina-PacBio hybrid assembly SOAPdenovo2 (assembly) GapCloser (gap filling 0) * Assembly made by Michel Moser at UNIBE 4. Examples of genome assemblies Petunia axillaris and P. inflata
  • 98. P. axillaris P. inflata Current Assembly Peaxi v1.6.2 Peinf v1.0.2 Dataset Contigs Scaffolds Contigs Scaffolds Total assembly size (Gb) 1.22 1.26 1.2 1.29 Total assembly sequences 109,892 83,639 213,254 136,283 Longest sequence (Mb) 0.57 8.56 0.57 5.81 Sequence length mean (Kb) 11.08 15.05 5.63 9.44 N90 (sequences) 13,481 1,051 41,876 2,450 L90 (Kb) 22.28 295.75 4.57 60.31 N50 (sequences) 3,943 309 9,813 406 L50 (Kb) 95.17 1,236.73 34.99 884.43 Current (Frozen) Genome Assembly Stats * P. inflata assembly was performed with SOAPdenovo2 and GapCloser with an optimal Kmer=79 * Hybrid assembly 4. Examples of genome assemblies Petunia axillaris and P. inflata
  • 99. Estimated genome size for P. axillaris: • Flow Cytometry (White & Rees,1987)= 1.37 Gb • Kmer Count* = 1.38 Gb • Assembly size (scaffolds v1.6.2) = 1.26 Gb • Assembly size (contigs v1.6.2) = 1.22 Gb Estimated genome not assembled = 110 - 120 Mb 4. Examples of genome assemblies Petunia axillaris and P. inflata
  • 100. 4. Examples of genome assemblies Persea americana Persea americana genome size 0.90 Gb 2n = 2x = 24 (Arumuganathan and Earle, 1991)
  • 101. 4. Examples of genome assemblies Persea americana Experimental design • Sequencing of a low heterozygosity avocado accession, VC75 • Base assembly made by self-corrected long reads and Canu (PacBio > 75X) • Polished with high genome coverage short reads with Racon (Illumina > 100X). • Chromo-scaffolding with HiC
  • 102. 4. Examples of genome assemblies Persea americana • Base assembly made by self-corrected long reads and Canu (PacBio > 75X) Current Assembly Persia 1.0.0 Dataset Contigs Total assembly size (Gb) 0.86 Total assembly sequences 3,204 Longest sequence (Mb) 10.56 Sequence length mean (Kb) 268.74 N90 (sequences) 850 L90 (Kb) 102.62 N50 (sequences) 136 L50 (Mb) 1.56
  • 103. Estimated genome size for Persea americana: • Flow Cytometry (Manosalva, 2018) = 0.90 Gb • Kmer Count* = 1.28 Gb • Assembly size (contigs v1.0.0) = 0.86 Gb 4. Examples of genome assemblies Persea americana VC75 het = 0.16% Estimated genome not assembled = 40 - 420 Mb
  • 104. 4. Examples of genome assemblies Persea americana assembly assembly 0MB100MB200MB300MB400MB500MB600MB700MB800MB900MB mismatch_wide.at.step.0.bed mismatch_narrow.at.step.0.bed suspect.at.step.0.bed repeats_wide.at.step.0.bed depletion_score_narrow.at.step.0.wig coverage_wide.at.step.0.wig depletion_score_wide.at.step.0.wig Coverage 0 MB 100 MB 200 MB 300 MB 400 MB 500 MB 600 MB 700 MB 800 MB 900 MB • Chromo-scaffolding with HiC