Comparative genomics to the rescue: How complete is your plant genome sequence?
1. Comparative genomics to the rescue
How complete is your plant genome sequence?
Klaas Vandepoele
Ghent University - VIB, Belgium
5th Plant Genomics & Gene Editing Congress
16-17 March 2017, Amsterdam
plaza_genomics
2. Plant genome sequencing is booming
New and faster sequencing technologies
Generating a draft genome sequence has become cheap
The number of published plant genomes grows exponentially
2
>150 published
plant genomes Credit: Usadel lab
3. From read data to knowledge
The basic genome analysis toolkit:
Genome assembly
Structural annotation shows
where genes are
Functional annotation tells you
what genes do
Data availability of genome
sequence & gene annotation
Faciliate biological discovery
3
4. Yet another “draft” plant genome
What is the quality and completeness of plant genome sequences?
4
The N50 denotes that 50% of
the total assembly length is
contained in scaffolds of
length N50 or longer
6. Quality of a genome: what to expect?
6
Transcript mapping
7. Transcript mapping: tools & settings
7
• The transcript mapping score is
stable (standard deviation < 1%)
in bin sizes of at least 3,000 ESTs
• Challenging to have a correct
estimation of the assembled gene
space. Influence of
mapping tools
coverage cutoffs
Intron-aware transcript mapping using GMAP
8. Transcript mapping: library size
8
• If the libraries contain more than
10,000 ESTs, the EST mapping
scores for A. thaliana libraries
converge to the same value as for
subsampling bins of >10,000 ESTs.
• RNA-Seq de novo assembled
transcripts can lead to the
over-estimation of the
expected number of genes
(allelic transcripts, splice
variants and fragmented
transcripts)
under-estimation due to the
failure to reconstruct low-
abundant transcripts
9. Estimating gene space completeness along an
evolutionary scale
9
evolutionary conserved
Species-specific
expected gene spaces influenced by
within-species
diversity
between-
species
diversity
CEGMA
248, single copy
BUSCO
952, single copy
PLAZA CoreGF
7k gene families
Transcript mapping
Species tree of life
PLAZA CoreGF
3k gene families
13. Evaluation
13
• Arabidopsis and Oryza have
consistent high Completeness
scores
• Over-estimation of
completeness by CEGMA
• Lolium: discrepancy between
genome vs gene set
completeness
14. Improving Lolium gene annotation
14
2 Transcriptomes, aligned with GenomeThreader
de novo assembly
Orthology-guided assembly
300k
80k
4 Proteomes, aligned with GenomeThreader
Brachypodium distachyon
Oryza sativa
Sorghum bicolor
Zea mays
16k
11k
11k
10k
2 Annotation sets
Byrne et al. (2015)
ab initio predictions
28k
41k
# loci
EVM consensus 39.967
Haas et al. (2008), Gremme et al. (2005), Ruttink et al. (2013)
15. Updated completeness scores Lolium
15 Completeness score (%)
75 80 90 9585 100
Byrne et al. (2015)
EVM consensus
>900 new coreGF loci found in the genome!
CEGMA
248, single copy
BUSCO
952, single copy
PLAZA CoreGF
7k gene families
Transcript mapping
Species tree of life
PLAZA CoreGF
3k gene families
16. Evaluation
16
• Arabidopsis and Oryza have
consistent high Completeness
scores
• Over-estimation of
completeness by CEGMA
• Lolium: discrepancy between
genome vs gene set
completeness
• Cicer: EST mapping score
much lower than BUSCO
geneset or coreGF score
More than half of the unmapped
sequences are of non-plant origin
(mostly from Fusarium oxysporum)
Proper taxonomic binning of
expected transcripts is essential!
17. Guidelines to assess the quality of a new genome sequence
1. Estimate genome size using different methods
2. Define and evaluate the expected gene space based on
transcript mapping AND evolutionary conservation
Cleaning and mapping transcripts
Prefer coreGF/BUSCO over CEGMA to model expected conserved
genes
3. Large differences in completeness scores between genome
assembly / annotated gene set can point to gene prediction
issues
4. To perform cross-species genome comparisons, focus on
genomes with complete and contiguous assemblies
17
Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating
the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.
18. • Gene family annotation and phylogenetic trees
• Traceable functional annotation (GO/InterPro/MapMan)
• Colinearity and synteny
• Integrative gene orthology inference
Highly integrative platform to translate knowledge from model to crop
• 55 species/genomes
• Highly scalable design
• Web-based mobile user interface
• Integrated Workbench for analysis
of sets of genes
http://bioinformatics.psb.ugent.be/plaza/
19. Coverage gene function information
19 blue = primary GO; green = GO projection (orthology + homology)
Gene descriptions
Gene Ontology (Biological Process)
20. TRAPID: analysis of non-model transcriptomes
20
Homology-based ORFs detection incl. frameshift correction
Gene family assignment
Functional annotation based on Gene Ontology and/or protein domains
Two reference databases: PLAZA 2.5 and OrthoMCL-DB
Applications
Sugar cane, wheat, Crocus sativa, conifers, Coffea arabica, Prunus
Dinoflagellates, diatoms, worms, fishes
SRA Viridiplantae
Transcriptomic
Van Bel, … & Vandepoele, Genome Biology 2013
21. Drought Tolerance Conferred to Sugarcane by Association with
Gluconacetobacter diazotrophicus: A Transcriptomic View of
Hormone Pathways
21 Vargas et al., PLoS One 2014
22. Further reading
Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating
the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.
Proost, S., Van Bel, M. … and Vandepoele, K. (2015). PLAZA 3.0: an access point for plant
comparative genomics. Nucleic Acids Research Jan;43(Database issue):D974-81
Vandepoele K (2017) A Guide to the PLAZA 3.0 Plant Comparative Genomic Database.
In Methods Mol Biol, Vol 1533, pp 183-200
Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., and Vandepoele, K. (2013).
TRAPID: an efficient online tool for the functional and comparative analysis of de novo RNA-
Seq transcriptomes. Genome Biol 14, R134.
plaza_genomics
Code freely available to efficiently compute coreGF completeness score
Want a free PDF? Check out PLAZA poster
23. PLAZA 3.0 user statistics (2016)
>11,000 users (+13%), 370K page views (+30%)
Users from >95 countries
Intensively used by
academia (>400 citations)
industry
25. PLAZA Workbench
25
Create a custom gene set (~experiment) using gene identifiers or BLAST
External/internal gene IDs (e.g. AN3, AT5G28640, GRMZM2G180246_T01)
BLAST interface can be used to map sequence data from a non-model species to a
reference species present in PLAZA
A toolbox is available to analyze user-defined gene sets (~experiment)
2,132 registered users processed 11,875 Workbench experiments