4. ‘the complete set of phylogenetic trees derived
from the proteome of an organism’
Sicheritz-Pontén and Andersson, 2001. Nuc. Acids Res. 29: 545
genome-wide events
+
gene family-specific events
August 2012. At Daitoku-ji Temple, Kyoto
5. Hypothesis A
Hypothesis B
Hypothesis C
chicken
chicken
chicken
shark
shark
shark
lamprey
hagfish
lamprey
hagfish
lamprey
hagfish
Cyclostomes
human
Cyclostomes
human
Cyclostomes
human
amphioxus
amphioxus
amphioxus
tunicate
tunicate
tunicate
- Mol. phylogeny of 55 gene families
Kuraku et al., 2009. MBE
- Globin gene phylogeny
Hoffmann et al., 2010. PNAS
- Sea lamprey genome analysis
Smith, Kuraku et al., 2013.
Nature Genetics
- Composition of Hox/Dlx clusters
Neidert et al., 2001. PNAS
Irvine et al., 2002. J Exp Zool B
Force et al., 2002. J Exp Zool B etc
- Mol. phylogeny of 33 gene families
Escriva et al., 2002. MBE
- Amphioxus genome
Putnam et al., 2008. Nature
- ParaHox clusters
Furlong et al., 2007. MBE
12. Our experiences at GRAS
・Main applications: RNA-seq & ChIP-seq
・Diverse non-model organisms for RNA-seq
・Trouble shooting with tight wet-dry communication
・Many requests with limited sample amounts
13. For retrieving complete genome and original transcriptome
・Sequencers ‘can’ produce ‘data’ from problematic samples
Low quality DNA/RNA, contamination, over-amplification, …
・Look carefully for acceptable pricing and service contents
e.g. How many reads do you need?
・Longer illumina reads are not necessarily beneficial
~150bp on HiSeq & ~300bp MiSeq (as of September 2013)
Prep of libraries with longer inserts
15. Species
Sequenced
at
Gene model by
Sequencing
technology
Published in
# of
authors
Started
in
sea
lamprey
Wash. Univ.
Yandell lab /
Ensembl
Sanger
Nat. Genet.
(2013)
59
2005?
soft-shelled
turtle
BGI
BGI / Ensembl
illumina
Nat. Genet.
(2013)
34
2010
coelacanth
Broad
Institute
Broad / Ensembl
illumina
Nature
(2013)
91
2011
16. Sequenced at Wash. Univ. Genome Institute
International consortium
Smith, Kuraku, et al. 2013.
Nature Genetics
Contributed analysis
Vertebrate ‘new genes’
GC & codon usage bias
Myelin-associated genes
In-house annotation effort
Trained gene prediction setting
available at Augustus web server
GC-content & codon usage bias
Qiu et al., 2011. BMC Genomics
Horizontal gene transfer
Kuraku et al., 2012. Genome Biol. Evol.
http://www.ensembl.org/Petromyzon_marinus/Info/Index
Coding genes: 10,415
Incomplete genome assembly: Pax6 missing
Incomplete gene annotation: Fgf8/17-A missing
(as of September 2013; release 73)
17. Amino acid composition
CA
Methods: Correspondence analysis for frequencies of 20 amino acids
CA
Deviation of ‘gene model’ in lamprey genome
Smith, Kuraku, et al. 2013. Nature Genetics
18. Codon usage bias
Methods: RSCU (Sharp et al., 1986) and ENc (Wright, 1990)
N
sea lamprey
stickleback
Tetraodon
Takifugu
platypus
medaka
dog
human
mouse
ghost shark
zebrafish
chicken
anole lizard
opossum
X. tropicalis
Heavy use of GC-rich codons
Qiu et al., 2011. BMC Genomics
19. Genomic DNA
Sanger, 454, illumina, or/and PacBio
Heterochromatin etc.
Raw reads
Assembly
Repeats, regions with low depth
Genome assembly (contigs/scaffolds)
Gene prediction (after ‘training’)
‘Unusual’ genes
‘Gene model’
(protein-coding sequences)
Reference: transcriptome, annotated genes in GenBank
20. Genomic DNA
Sanger, 454, illumina, or/and PacBio
Raw reads
Assembly
Genome assembly (contigs/scaffolds)
Gene prediction (after ‘training’)
‘Gene model’
(protein-coding sequences)
Reference: transcriptome, annotated genes in GenBank
21. (cf. Assemblathon2 - Bradnam et al., 2013)
‘NG50’ instead of N50
CEGMA (Parra et al., 2007) – coverage of CEGs
CGAL, REAPR, ALE – evaluation by identifying misassemblies
QUAST – computation of assembly summary
22. Species
Assembly release
# of CEGs found
(including ‘partial’)
Published?
human
GRCh37 (hg19)
248
First draft in 2001
mouse
GRCm38 (mm10)
239
First draft in 2002
X. tropicalis
JGI_4.2
239
Hellsten et al., 2010
coelacanth
LatCal1
236
Amemiya et al., 2013
spotted gar
LepOcu1
235
soft-shell turtle
PelSin_1.0
232
Wang et al., 2013
anole lizard
AnoCar2.0
231
Alföldi et al., 2011
zebrafish
Zv9
230
Howe et al., 2013
chicken
galGal4
220
chicken
WASHUC2.63 (galGal3)
210
First draft in 2004
Japanese lamprey
LetCam1
199
Mehta et al., 2013
sea lamprey
PerMar1
172
Smith et al., 2013
little skate
version2
77
elephant shark
(1.4x)
58
unpublished
unpublished
Venkatesh et al., 2007
248 core eukaryotic genes (CEGs)
23. Genomic DNA
Sanger, 454, illumina, or/and PacBio
Raw reads
Assembly
Genome assembly (contigs/scaffolds)
Gene prediction (after ‘training’)
‘Gene model’
(protein-coding sequences)
Reference: transcriptome, annotated genes in GenBank
24. (cf. Assemblathon2 - Bradnam et al., 2013)
‘NG50’ instead of N50
CEGMA (Parra et al., 2007) – coverage of CEGs
CGAL, REAPR, ALE – evaluation by identifying misassemblies
QUAST – computation of assembly summary
‘Annotation Turnover’ and ‘AED’ (Eilbeck et al., 2009)
Also, run CEGMA to check transcript diversity?
27. - Phylogenetic property of the species of your interest
e.g. Ploidy level, distance to close relatives, …
www.genomesize.com, www.timetree.org
- Any clue about its molecular attributes ?
e.g. GC-content, repeats, intron/UTR length, …
Using existing resources at SRA & Sanger traces at NCBI dbEST
28. - Genome or transcriptome to sequence ?
Any existing or emerging resources?
- RNA-seq: sequence identification or quantification?
- Sample prep mostly determines the fate of the project
Quantification with Qubit; rRNA removal controlled with BioAnalyzer
Replication > Depth (Rapaport et al., 2013. Genome Biol.)
- Rigorous QC of prepared libraries before sequencing
ChIP-qPCR before ChIP-seq
29. - Fostering more productive sequencing facilities in Japan
GRAS
accepts visits of facility managers/staffs
- Education of researchers
with dual (wet/dry) capabilities
‘A sequencer or a bioinformatician ?‘
Learning material: ‘Unix & Perl for Biologists’ by Korf Lab
http://korflab.ucdavis.edu/unix_and_Perl/
- Importing latest information from overseas
→ shigehiro-kuraku@cdb.riken.jp