The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the causal agent for the citrus greening or Huanglongbing disease which threatens citrus industry worldwide. This vector is the primary target of approaches to stop the spread of the pathogen.
Single copy marker analysis using BUSCO of the current genome shows a significant proportion of 3,350 single-copy markers, which are conserved in Hemipterans, to be missing (25%) with only 74% present in full-length copies. The manual genome annotation identified a number of misassemblies and missing genes in the current genome. This is, in-part, due to the complexity introduced when assembling a heterogeneous sample containing DNA from multiple psyllids and potentially exacerbated by the use of short reads. To improve quality of genome assembly, we have generated 36.2Gb of Pacbio long reads from 41 SMRT cells with a coverage of 80X for the 400-450Mb genome. The Canu assembler was used to create an interim assembly (Diaci v1.9) with a contig N50 of 115.8kb and 8300 contigs. We are employing Dovetail chicago libraries and 10X Illumina library generated from a single psyllid in conjunction with Bionano optical maps to achieve long-range scaffolding of the genome. The final assembly will be polished with Pacbio and Illumina paired-end reads followed by scaffolding with Illumina mate-pair reads. This will be the first time all these methods have been applied to resolve an insect genome from a highly heterogeneous sample. The new assembly will be available on https://citrusgreening.org/.
High quality arthropod genome assembly with single molecule reads and long-range scaffolding
1. www.citrusgreening.org
High quality arthropod genome assembly with
single molecule reads and long-range
scaffolding
Prashant S Hosmani1, Mirella Flores-Gonzalez1, Wayne Hunter2, Lukas A.
Mueller1, Susan Brown3, and Surya Saha1
1Boyce Thompson Institute; 2USDA-ARS U.S. Horticultural Research Laboratory; 3Kansas
State University
ss2489@cornell.edu @SahaSurya
Entomology 2017
Advances in Arthropod Genomics Workshop
3. www.citrusgreening.org
Citrus Greening: Huanglongbing
• Most significant disease of citrus worldwide
• More than $4.5 billion in lost citrus production and more than 8,200 lost jobs
(2006/07 to 2010/11)
• Associated with gram negative bacterium Candidatus Liberibacter asiaticus (CLas)
• Spread by insect vector, Diaphorina citri (Asian citrus psyllid, ACP)
Annie Kruse
4. www.citrusgreening.org
Omics resources and databases are required for
identification of targets for interdiction
4
Genome Annotation
Target for interdiction molecules
Pathway Databases
Expression Networks
…….
Host
Vector
Pathogen
5. www.citrusgreening.org
Genome Diaci1.1
Contigs 161,988
Total
Length
485 Mb
Longest 1 Mb
Shortest 201bp
Ns 19.3 Mb
Scaffold N50: 109,898 bp
Contig N50: 34,407bp
Highly fragmented
Many examples of
misassemblies!!
Current Illumina assembly
http://biobeans.blogspot.com/2012/11/bioinformatics-genome-assembly.html
6. www.citrusgreening.org
Pacbio assembly
Error rate 0.013 Error rate 0.015
Number of
contigs
7,832 8,030
Total bases 462.8 Mb 493.1 Mb
Longest 1.6 Mb 1.7 Mb
Shortest 4.4 Kbp 5 Kbp
Average
length
59.9 Kb 61.4 Kb
Contig N50 85.8 Kb 92.6 Kb
Koren 2017
Contiguous assembly with longer contigs
Multiple individuals in DNA sample
http://canu.readthedocs.io/en/stable/
7. www.citrusgreening.org
PBJelly scaffolding
Canu assembly Scaffolded Assembly
v1.9
Number of contigs 7,832 8,352
Total bases 462.8 Mb 591.7 Mb
Longest 1.6 Mb 2 Mb
Shortest 4.4 Kb 1.5 Kb
Average length 59 Kb 70.8 Kb
Contig N50 85.8 Kb 115.8 Kb
5,290 gap extensions
535 gaps filled
Number of Ns: 0 bp
English 2012
8. www.citrusgreening.org
v1.91 v1.92
REFERENCE
v1.92
ALTERNATE
Number of
contigs
3,681 1,918 1,763
Total bases 596 Mb 513 Mb 83.4 Mb
Longest 4.2 Mb 4.2 Mb 760.6 Kb
Shortest 1.5 Kb 6 Kb 1.5 Kb
Average
length
162 Kb 267 Kb 47.3 Kb
Contig N50 620 Kb 755.7 Kb 75.1 Kb
Ns 5.1 Mb 4.6 Mb 467 Kb
500ng input DNA from single male psyllid
Duplicated contigs added to alternate assembly
https://github.com/Gabaldonlab/redundans
https://github.com/broadinstitute/pilon/wiki
Error correction
• DNA sequencing data
• RNA sequencing data
• Duplication removal
• Scaffolding
scaffolding
9. www.citrusgreening.org
Gene isoform sequencing (Iso-Seq)
Accurate gene models are
necessary for targeting assays
• Majority of genes are alternatively
spliced to produce multiple
transcript isoforms.
• Iso-Seq generates full-length cDNA
sequences (full-length transcripts
and gene isoforms).
Current MCOT (de novo and genome-based)
transcriptome is useful but fragmented
Korf 2013
11. www.citrusgreening.org
Mapping to D. citri genome
Isoforms mapped to D. citri
v1.92
Total isoforms: 314,275
Isoseq provides a comprehensive (de novo and genome-based)
transcriptome with full-length transcripts and a range of isoforms
Counts
Number of
genes
18,799
(30,562 in MCOT)
Number of
isoforms
61,086
Average
number of
isoforms/gene
3.24
N50 2.7 Kb
Longest 9 Kb
Shortest 100 bp
12. www.citrusgreening.org
Evaluating the assembly
Complete Fragmented Missing
Diaci 1.1 74.8% 0.3% 24.9%
Diaci 1.92 85.2% 0.1% 14.7%
Overall alignment
rate
Concordant
alignment rate
Diaci 1.1 82% 0.62%
Diaci 1.92 88% 60%
Benchmarking sets of Universal Single-Copy Orthologs based on a set of 3350 single-copy
orthologs from hemipteran species
Paired-end RNAseq
alignment
MCOT Isoseq
(full-length transcripts)
Diaci 1.1 1054 bp 470 bp
Diaci 1.92 1321 bp 699 bp
Average length of
aligned coding
sequence
NNN
13. www.citrusgreening.org
Improved genome and annotation will expedite
identification of targets for interdiction
13
Genome
Pacbio
v1.92
Annotation
Isoseq
Target for interdiction molecules
Pathway Databases
Expression Networks
…….
Host
Vector
Pathogen
14. www.citrusgreening.org
Thank you!!
Utilizing system biology resources to decipher a tritrophic disease complex
Prashant Hosmani
Wednesday, 10:30 AM - 10:45 AM
Member Symposium: Applying Emerging Genomic Techniques to Control Invasive Species