Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy
1. 1
Towards a Reference Genome for
Switchgrass (Panicum virgatum)
Jeremy Schmutz, Jarrod Chapman,
Jerry Jenkins, Jane Grimwood, Kerrie
Barry, Gerald A. Tuskan, Daniel S.
Rokhsar & many others
2. DOE Joint Genome Institute
Mission: Serving as a genomic user facility in support of DOE mission science
• Funded by Biological Environmental Science (BER)
• Walnut Creek, CA
• ~270 employees
• HiSeq (9), MiSeq (6), PacBio (2), 454 (1)
• Includes partner laboratories such as HudsonAlpha
funded for specific goals
Bioenerg
y
Carbon cycling Biogeochemistry
Plants FungiMicrobes Metagenomes
2
3. JGI & BRCs
• Development of next-generation
bioenergy crops
• Discovery and design of enzymes and
microbes with novel biomass-degrading
capabilities
• Development of transformational
microbe-mediated strategies for biofuel
production 3
4. JGI Plant Program
Flagship Plant Genomes
Flagship
Comparative
Genomics
Resequencing
& Population
Diversity
Transcriptomics
& sequence based functional assays for
Flagship Plants
Community Organization
Plant
Customi
-zation
QTLs and
Genotype/
Phenotype
Links
4
5. JGI Plant Flagship Genomes
1. Provide complete genomic resources and
genomes of direct DOE mission importance
2. Support efforts for cellulosic biofuel
development and feedstock customization
3. Foster communities to develop research
programs around DOE plants
4. Build a solid foundation for diversity and
functional studies
5
7. Introduction to switchgrass
Plus:
• High cellulosic yields, marginal land, low input
plant
• Existing agronomy knowledge and breeding
from planting as a forage crop
• Perennial crop which can be annually harvested
after establishment
• Widespread native species in North America,
resistant to American pests
• Presumably large adaptive variation across the
growing regions
Minus:
• Widespread native species in North America,
very difficult to contain large scale plantings of
improved varieties
• Obligate outcrossing polyploid species 7
8. Switchgrass is difficult
Obligate outcrossing tetraploid!
• Difficult to inbreed
• 4 copies of genes (or maybe 2,
or 1 or more)
• Variation within and between
subgenomes
• Genome is only a reference for
one plant AP13, not for all
switchgrass or even all
Panicum virgatum individuals A1 A2 B1 B2
A C A A
A A C C
A T C G
C C
8
10. Switchgrass Reference Project
• Goal : Produce a reference genome of AP13
that can be used for everything from marker
assisted breeding and QTL identification to
direct functional work on understanding cell
wall biosynthesis
• Project has spanned several phases:
1. Resource development
2. Initial whole genome shotgun sequencing (v0.0)
3. Localization and assembly on chromosomes
(v1.0)
4. Ongoing improvement through direct
sequencing of localized regions (v2.0)
10
11. Project origins
• Started as a BRC project to
produce a reference genome
(initiated by JBEI in Nov.2008)
• Cultivar selected by the group:
Alamo AP13
• DNA was isolated by Jeff
Bennetzen’s group @UGA and
Rod Wing’s group@AU for BAC
libraries and sequencing
• Began sequencing in early 2010,
work for developing resources
was in progress
11
13. BAC libraries & BES
• Generated BAC libraries with 2
cut cites, 330k BES and some
clone based sequencing,
average insert size 110kb and
144kb.
• Clones available from CUGI
www.genome.clemson.edu
With: Pam Ronald’s Group @ JBEI
PLOS1 April 2012, Shama et al.
13
14. EST & transcripts
• 510,000 Sanger ESTs from 9 tissues with C. Tobias
• 169,079 Sanger ESTs from cell wall, 11.5 million 454
ESTs from VS16 and AP13 with BESC/Noble
Table 1. Switchgrass 454-cDNA libraries and 454-ESTs
NCBI SRA
accession #
JGI lib
code Tissues
Plant growth
stage/conditions
# of good
ESTs
Mean
length (bp)
Summer VS16 454 data
SRX026147 CFBB Whole shoot Leaf development 259,106 201
SRX026148 CFBC Whole root Leaf development 205,466 222
SRX026149 CFBF Whole shoot Stem elongation 194,426 194
SRX026150 CFBG Whole root Stem elongation 174,053 190
SRX026151-2 CFCZ Whole shoot Reproductive stage 219,230 189
SRX026153-4 CFFA Whole root Reproductive stage 234,107 205
SRX026155-6 CFFB
Panicles
including seeds Reproductive stage 220,933 212
1,507,321 202
Alamo AP13 454 data
SRX057824 CCXN Whole shoot Stem elongation 733,173 202
SRX057825 CCXO Whole root Stem elongation 667,612 206
SRX057830 CFXX Whole shoot Leaf development 1,236,020 419
SRX057831 CFXY Whole root Leaf development 1,214,630 375
SRX057828 CFXW Whole shoot Stem elongation 1,357,290 223
SRX057829 CGGO Whole root Stem elongation 1,040,192 404
SRX057827 CGFF Whole shoot Reproductive stage 547,278 320
SRX057826 CGFC Whole root Reproductive stage 998,691 388
SRX057834 CGTX
Panicles
including seeds Reproductive stage 1,096,949 384
SRX057833 CGFI Whole shoot
Stem elongation 2
w/drought 362,346 213
SRX057832 CGGU Whole root
Stem elongation 2
w/drought 918,585 337
10,172,766 316
11,680,087
Sub-total
Total
Sub-total
Development of an integrated
transcript sequence database and a
gene expression atlas for gene
discovery and analysis in switchgrass
(Panicum virgatum L.) – Zhang et al.
2013 Plant Journal.
http://switchgrassgenomics.noble.org
14
15. Foxtail millet
• Foxtail millet sequenced with 8x sanger sequence
• Demonstrate using it as a comparative basis to
reconstruct switchgrass chromosomes
Nature Biotech May 13, 201215
17. Tetraploid Switchgrass
Began with 8.3x 454 linear sequencing and added 6.5x
454 XLR+ longer sequencing (14.5x total coverage)
78% longer read length
54% longer HQ length
2x the yield per run
17
18. V0.0 Release PAG 2012
• Linear 454 > 200 bp
• Sampled both the “A”
and “B” genomes in
the tetraploid
• Assembled using
Newbler V2.6
• Results:
• Contig N50 of 3.8 Kb
• 1.466 Gb of total
sequence assembled.
• 80 contigs > 50 Kb
• 1,663 contigs > 25Kb
18
19. Annotation?
• Annotation was based solely
on Sanger ESTs homology
(foxtail millet, rice,
Brachypodium, and
sorghum)
• 65,878 total loci containing
protein-coding transcripts
• 4,193 total alternatively
spliced transcripts
For primary transcripts:
• Average number of exons: 4.1
• Median exon length: 160
• Median intron length: 126
• Complete genes: 47,302
• Incomplete gene with start codon: 5,862
• Incomplete gene with stop codon: 10,459
19
20. Is this a genome?
How do we organize these fractured 410k
pieces into a reference genome and put
them into the correct subgenome?
The Map! The Map! The Map!
20
21. Genetic map AP13 x VS16
Switchgrass mapping population planted out at Noble
21
22. Mapping strategy
VS16
250 offspring (F1) of VS16xAP13
sequenced in pools of 10-12 (depth <~1X)
Find short sequences that are:
(1) Polymorphic in one parent
(2) Found in only one subgenome
(3) Not found in the other parent
These are simple markers to track by
resequencing F1 offspring.
Directly observe recombination in the
polymorphic parent.
AP13
X
Select markers like:
AAAAAAAATCTCGTATGCATGGAGTACTAAATGAAGTCTATTTGCAAAAC A 15 T 12
AAAAAAAATCTCTCCAGGGCAAAAATAAAAAAATGAAAAAGAAAAAAAAA A 13 C 14
AAAAAAAATCTTCGTGAGGAATTTTCTGTGCACTTTAAGTCTTCAATAAC A 12 G 14
113,325 AP13-derived markers and 236,622
VS16-derived markers
Mapping population: Malay Saha
Map development: Jarrod Chapman 22
23. Initial VS16 map examples
First round of the map based on 106 offspring
and VS16 specific subgenome differences and
covers ~87K markers 23
24. New map examples
1. Second mapping
round uses all 250
offspring, AP13
subgenome
differences
2. Added additional
markers from WGS
assembly
3. 130k typed
subspecific genome
markers + 418k
markers that are
linked to these
24
26. Organizing assembly with map
• Original Newbler Assembly:
• 14.5x read coverage
• 556,117 contigs (1,466.3 Mb)
• V0.0 Release at PAG Jan 2012:
• 410,030 contigs (1,358.1 Mb)
• 73,010 contigs (426 Mb) mapped
and annotated to 21,624 FTM
genes.
26
27. 27
Bin contigs into
Linkage Groups
(189,942 binned)
Subgenome
duplicates
(35,683 removed)
Collapse
subgenomes
contigs
(36,467 collapsed)
Scaffold each
linkage group using
5.5x, 2x250, 800bp
MiSEQ
Scaffold using 18x,
2x100, 4KB & 5kb
LFPE
Scaffold using 6x,
2x100, 9kb LFPE
Eliminate
redundant ends on
adjacent scaffolded
contigs
Position scaffolds
on genetic map
THIS SYSTEM IS NOT STATIC AND IS EASILY EXTENDED TO INCLUDE
ADDITIONAL DATA, CLONE SEQUENCE, LONG READ DATA, …
Assembly process
Order scaffolds
using P. hallii
synteny
Starting with
117,792 contigs
Scaffolding Performed
Using Abyss
28. Panicum hallii (panic grass)
28
Panicum hallii
• Native southwest grass
• Closely related (~4 MYR) to
tetraploid switchgrass
• Drought tolerance model
• 660Mb, mostly inbred
• w/Tom Juenger at UT-Austin
29. P. Hallii synteny
• 31mers used to identify shared
content
• P. virgatum scaffolds binned
on P. hallii
• Orientation of P. virgatum
scaffolds relative to P. hallii
determined using BLAT
65 P. virgatum
scaffolds ordered on
P. hallii super_61
Before After
29
30. 30
CHR 01: 2,291 corrections, 459 bp average
(100bp to 4KB), 1.05 Mb removed from 44.7 Mb
Adjacent subgenome duplicates
Original scaffold
Corrected scaffold (1.4kb eliminated)
31. Map Integration
• 548,932 AP13 specific subgenome
markers used to position scaffolds and
syntenic blocks
• 56,088 map joins (with 10kb Ns) were
made for 18 (2x9) linkage groups
• Map positions contain sizable blocks of
contigs (10-20) that align to the same map
position, cannot be ordered or orientated-
placed within the context of other
scaffolds
31
34. Clone alignments against scaffolds
34
158KB clone on syntenic block 107KB clone on syntenic block
35. Chromosome Assignments
• Asked switchgrass “power” users for recommendation
on naming
• Assigned using shared 21mer content with S. italica
• Designation of "a" assigned to the P. virgatum LG that
contained more shared 21mers with S. italica
35
36. Panicum virgatum V1.0 release
36
• Read coverage: 14.5x
• Release size: 1.22 Gb
• Contig L50: 5.7 KB
• 636.1 MB of sequence
in chromosomes
• 593.4 bits off
chromosomes,
includes duplicate
sequences
37. Annotation – resources
• Latest JGI pipeline for integrating RNA-seq data and
available EST data
• Included: Original sanger ESTs, 454 ESTs, minimal
FLcDNAs, 370 million pairs from GLBRC cultivars,
710 million pairs of RNA-seq data from germinating
seed, stem-node, stem-internode, blade, immature
flower
• Homology: rice, brachy, foxtail, sorghum, maize,
arabi, soybean, poplar and swiss prot
37
39. Annotation results
• Primary transcripts (loci): 98,007
• Alternative transcripts: 27,432
• Total transcripts: 125,439
• For primary transcripts:
– Average number of exons 3.9
– Median exon length 183
– Median intron length 133
39
Length EST support Peptide homology
100% 57,584 7,311
95% 62,327 33,251
90% 63,848 41,653
75% 66,121 58,319
50% 68,391 77,960
PLEASE
DO NOT
GET ATTACHED
TO GENE NAMES!
41. Paralogous gene analysis
41
• Genes “A”, “B”, and remaining
contigs aligned using BLASTP
• Alignments Screened:
• >80% identity and >80%
coverage
• Length of query and target
amino acid sequences had to be
within 20% of one another
• There are a total of:
• 29,357 “A” genes
• 27,522 “B” genes
• 41,128 genes in remaining
contigs
• 98,007 total genes
42. SNP rates in chromosomes
42
AP13 VS16
Heterozygous SNPs 1,449,600 581,106
Homozygous SNPs 10,406 1,482,882
Heterozygous INDELs 864 924
Homozygous INDELs 1,920 9,391
Assembly Length 1,103.8 Mb 1,103.8 Mb
Callable Bases 466.0 Mb 241.6 Mb
Heterozygous Rate (Callable) 3.111 per Kb 2.405 per Kb
Homozygous Rate (Callable) 0.022 per Kb 6.137 per Kb
44. Towards switchgrass 2.0
1. Build new AP13 and VS16 maps from recent
sequence data (116 genotypes + 250 originals) to
help with subgenome localization
2. Upgrade mate pair data for AP13 to new, longer,
better, stronger mate pairs
3. Continue directed clone based sequencing of
switchgrass important regions
4. Version 2.0 of the genome with ~3-400Mb of locally
sequenced contigs, integrated into chromosomes
44
PLEASE
DO NOT
GET ATTACHED
TO ORDER!
45. Improvement project
45
Short Scaffolds: Selected
properly projecting clones.
??
Not
Selected
Selected
Cell Wall EST
Redundant
Clone
Long Scaffolds: Tiling path covering cell wall genes.
Long
Scaffs
Short
Scaffs
Total
Chromosomes 335 3,237 3,572
Remaining 6 1,182 1,188
Total 341 4,419 4,760
46. Improvement project
46
• 96 well clone based pool with individual indexes
• Sequenced as 2x250 on HiSeq2500, assembled and minimal
manual finishing
• Add 96 paired, sized libraries run as ¼ MiSeq run as needed
47. Switchgrass V2.0
47
Bin contigs and
clones into linkage
Groups
Remove
subgenome
duplicates from
clones and contigs
Scaffold each
linkage group using
2x250, 800bp pairs
Scaffold using LMP
pairs
Eliminate
redundant ends on
adjacent scaffolded
contigs
Position scaffolds
on genetic map
Order scaffolds
using P. hallii
synteny
New
Genetic
Map
WGS
Contigs
Clone
Contigs
New
HiSeq
Frags
New
LMP
Pairs
New
Genetic
Map
48. Current JGI switchgrass projects
1. Community diversity project for up to 50 genotypes (12
sampled to date) – Laura Bartley
2. eQTL study of 90 genotypes (2 samples per) for biomass/cell
wall trait variants – with Laura Bartley and Malay Saha
3. BRC switchgrass projects, QTL mapping, engineered mutants
48
Purpose Class Targets Genotypes Contributors
Support Genetics
Mapping
Parents
Biparental mapping parents,
NAM parents
22
Saha, Brummer,
Tobias, Wu,
Bonos
Diversity
Each phylogenetic group
from Lui et al. 2012,
including octaploids;
Mexican and NE accessions
9
Casler, Auer,
Juenger
Genome Stucture
Determination
Genomic
variants
dihaploid, selfed,
intermediate genome size
10
Wu, Tobias,
Brummer
Baseline Data
Interesting
phenotypes
Transformed genotype,
Other
1 Wang
Current total 42
49. Panicum hallii projects
1. Produce draft assembly V1.0 for Panicum hallii
2. Diploid panicum eQTLs for segregating
drought/biomass traits in HAL x FIL cross
3. Diversity sampling
49
Hall’s Panicgrass diversity west Texas
to east Texas, Juenger Lab FIL2HAL HAL X FIL2
51. Acknowledgments
DOE JGI
Jane Grimwood (HA)
Jerry Jenkins (HA)
Jarrod Chapman
Shengqiang Shu
Dan Rokhsar
Kerrie Barry
BESC
Gerald Tuskan, ORNL
Katrien Devos, UGA
Yi-Ching Lee, Noble
Malay Saha, Noble
Michael Udvardi, Noble
Jiyi Zhang, Noble
JBEI
Pam Ronald, UC-Davis
Manoj Sharma, UC-Davis
Rita Sharma, UC-Davis
Others
Laura Bartley, OU
Christian Tobias, SGEC
Chris Saski, CUGI (BAC libs)
Tom Juenger, UT-Austin
Funding Sources
DOE DE-AC02-05CH11231
ARRA UC Berkeley
51
52. Questions for discussion
• How often should we update the
switchgrass genome and annotation?
• What else can the JGI do that would be
immediately useful for the switchgrass
community?
• JGI CSP2015 deadline for LOI will be
March 2014
• Comprehensive community proposals are
greatly preferred!
52