Comparative genomics to the rescue: How complete is your plant genome sequence?

Comparative genomics to the rescue
How complete is your plant genome sequence?
Klaas Vandepoele
Ghent University - VIB, Belgium
5th Plant Genomics & Gene Editing Congress
16-17 March 2017, Amsterdam
plaza_genomics

Plant genome sequencing is booming
 New and faster sequencing technologies
 Generating a draft genome sequence has become cheap
 The number of published plant genomes grows exponentially
2
>150 published
plant genomes Credit: Usadel lab

From read data to knowledge
The basic genome analysis toolkit:
 Genome assembly
 Structural annotation shows
where genes are
 Functional annotation tells you
what genes do
 Data availability of genome
sequence & gene annotation
Faciliate biological discovery
3

Yet another “draft” plant genome
 What is the quality and completeness of plant genome sequences?
4
The N50 denotes that 50% of
the total assembly length is
contained in scaffolds of
length N50 or longer

Quality of a genome: what to expect?
5

Quality of a genome: what to expect?
6
Transcript mapping

Transcript mapping: tools & settings
7
• The transcript mapping score is
stable (standard deviation < 1%)
in bin sizes of at least 3,000 ESTs
• Challenging to have a correct
estimation of the assembled gene
space. Influence of
 mapping tools
 coverage cutoffs
Intron-aware transcript mapping using GMAP

Transcript mapping: library size
8
• If the libraries contain more than
10,000 ESTs, the EST mapping
scores for A. thaliana libraries
converge to the same value as for
subsampling bins of >10,000 ESTs.
• RNA-Seq de novo assembled
transcripts can lead to the
 over-estimation of the
expected number of genes
(allelic transcripts, splice
variants and fragmented
transcripts)
 under-estimation due to the
failure to reconstruct low-
abundant transcripts

Estimating gene space completeness along an
evolutionary scale
9
evolutionary conserved
Species-specific
expected gene spaces influenced by
within-species
diversity
between-
species
diversity
CEGMA
248, single copy
BUSCO
952, single copy
PLAZA CoreGF
7k gene families
Transcript mapping
Species tree of life
PLAZA CoreGF
3k gene families

Biases in the expected “conserved” gene space
10

A diverse set of genome quality metrics
11

Evaluation
13
• Arabidopsis and Oryza have
consistent high Completeness
scores
• Over-estimation of
completeness by CEGMA
• Lolium: discrepancy between
genome vs gene set
completeness

Improving Lolium gene annotation
14
2 Transcriptomes, aligned with GenomeThreader
de novo assembly
Orthology-guided assembly
300k
80k
4 Proteomes, aligned with GenomeThreader
Brachypodium distachyon
Oryza sativa
Sorghum bicolor
Zea mays
16k
11k
11k
10k
2 Annotation sets
Byrne et al. (2015)
ab initio predictions
28k
41k
# loci
EVM consensus 39.967
Haas et al. (2008), Gremme et al. (2005), Ruttink et al. (2013)

Updated completeness scores Lolium
15 Completeness score (%)
75 80 90 9585 100
Byrne et al. (2015)
EVM consensus
>900 new coreGF loci found in the genome!
CEGMA
248, single copy
BUSCO
952, single copy
PLAZA CoreGF
7k gene families
Transcript mapping
Species tree of life
PLAZA CoreGF
3k gene families

Evaluation
16
• Arabidopsis and Oryza have
consistent high Completeness
scores
• Over-estimation of
completeness by CEGMA
• Lolium: discrepancy between
genome vs gene set
completeness
• Cicer: EST mapping score
much lower than BUSCO
geneset or coreGF score
More than half of the unmapped
sequences are of non-plant origin
(mostly from Fusarium oxysporum)
Proper taxonomic binning of
expected transcripts is essential!

Guidelines to assess the quality of a new genome sequence
1. Estimate genome size using different methods
2. Define and evaluate the expected gene space based on
transcript mapping AND evolutionary conservation
 Cleaning and mapping transcripts
 Prefer coreGF/BUSCO over CEGMA to model expected conserved
genes
3. Large differences in completeness scores between genome
assembly / annotated gene set can point to gene prediction
issues
4. To perform cross-species genome comparisons, focus on
genomes with complete and contiguous assemblies
17
Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating
the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.

• Gene family annotation and phylogenetic trees
• Traceable functional annotation (GO/InterPro/MapMan)
• Colinearity and synteny
• Integrative gene orthology inference
 Highly integrative platform to translate knowledge from model to crop
• 55 species/genomes
• Highly scalable design
• Web-based mobile user interface
• Integrated Workbench for analysis
of sets of genes
http://bioinformatics.psb.ugent.be/plaza/

Coverage gene function information
19 blue = primary GO; green = GO projection (orthology + homology)
Gene descriptions
Gene Ontology (Biological Process)

TRAPID: analysis of non-model transcriptomes
20
 Homology-based ORFs detection incl. frameshift correction
 Gene family assignment
 Functional annotation based on Gene Ontology and/or protein domains
 Two reference databases: PLAZA 2.5 and OrthoMCL-DB
 Applications
 Sugar cane, wheat, Crocus sativa, conifers, Coffea arabica, Prunus
 Dinoflagellates, diatoms, worms, fishes
SRA Viridiplantae
Transcriptomic
Van Bel, … & Vandepoele, Genome Biology 2013

Drought Tolerance Conferred to Sugarcane by Association with
Gluconacetobacter diazotrophicus: A Transcriptomic View of
Hormone Pathways
21 Vargas et al., PLoS One 2014

Further reading
Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating
the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.
Proost, S., Van Bel, M. … and Vandepoele, K. (2015). PLAZA 3.0: an access point for plant
comparative genomics. Nucleic Acids Research Jan;43(Database issue):D974-81
Vandepoele K (2017) A Guide to the PLAZA 3.0 Plant Comparative Genomic Database.
In Methods Mol Biol, Vol 1533, pp 183-200
Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., and Vandepoele, K. (2013).
TRAPID: an efficient online tool for the functional and comparative analysis of de novo RNA-
Seq transcriptomes. Genome Biol 14, R134.
plaza_genomics
Code freely available to efficiently compute coreGF completeness score
Want a free PDF? Check out PLAZA poster

PLAZA 3.0 user statistics (2016)
 >11,000 users (+13%), 370K page views (+30%)
 Users from >95 countries
 Intensively used by
 academia (>400 citations)
 industry

PLAZA Workbench
25
 Create a custom gene set (~experiment) using gene identifiers or BLAST
 External/internal gene IDs (e.g. AN3, AT5G28640, GRMZM2G180246_T01)
 BLAST interface can be used to map sequence data from a non-model species to a
reference species present in PLAZA
 A toolbox is available to analyze user-defined gene sets (~experiment)
 2,132 registered users processed 11,875 Workbench experiments

Comparative genomics to the rescue: How complete is your plant genome sequence?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Comparative genomics to the rescue: How complete is your plant genome sequence?

Semelhante a Comparative genomics to the rescue: How complete is your plant genome sequence? (20)

Mais de Klaas Vandepoele

Mais de Klaas Vandepoele (6)

Último

Último (20)

Comparative genomics to the rescue: How complete is your plant genome sequence?