The document outlines the Tetrahymena thermophila macronuclear genome project led by TIGR in collaboration with Stanford and UCSB. The goals of the project were to generate 8x coverage of the macronuclear genome, assemble the genome sequences, and create databases for the sequence and annotation. The project achieved its goals, sequencing over 100 million bases and releasing the data publicly on their website and in GenBank.
2. Acknowledgements
• Ed Orias
• Members of Tetrahymena steering
committee
• Members of Tetrahymena Genome
Advisory Board
• NSF/Pat Dennis
• NIGMS/Tony Carter
• Tetrahymena research community
TIGR
3. Genome Project Planning -
coordinated by Ed Orias at UCSB
• 8/99 Workshop in Ciliate Genomics
• 10/99 First Meeting of Tetrahymena Genome Project
Steering Committee
• 10/00 Second Meeting of Tetrahymena Genome Project
Steering Committee
• 8/01 Third Meeting of Tetrahymena Genome Project
Steering Committee
TIGR
5. Details of Project
• Collaboration between
– TIGR (Jonathan Eisen, Malcolm Gardner, Steven
Salzberg, others)
– Stanford (Mike Cherry)
– UCSB (Ed Orias)
• Funding
– NSF Microbial Genome Program
– NIH-NIGMS
TIGR
6. Major Goals of Project
• ~8x coverage of macronuclear genome of
strain SB210
• Generation of genome assemblies
• Creation and maintenance of two genome
databases
– Sequence and automated-annotation - TIGR
– Tetrahymena Genome Database - Stanford
TIGR
9. Why Tetrahymena?
• Model alveolate and ciliate
• Free living, pure culture, non pathogenic
• Genetic unicellular eukaryotic model:
• Processes and cellular components not found in
yeasts
• Organelle function: cilia, phagosome, nucleoli,
centrosomes
• Robust and novel molecular genetic tools
• Large research community
• Heterologous expression of alveolate genes
TIGR
10. Major Discoveries Using Tetrahymena
• Dynein and its unidirectional motor
activity
• Ribozymes, self-splicing RNA
• Telomere structure, telomerase &
telomerase RNA
• Role of histone acetylation in control
of gene expression
• Role of RNAi in developmental DNA
rearrangements
TIGR
11. Tools in Tetrahymena
• Genetic tools
– Conjugation, genetic-crossing, inducible self-fertilization
– Transformation, gene disruption, gene replacement
– Gene overexpression, ribosome antisense repression
• Many genomic resources
– Genetic maps (for mic and mac)
– Physical maps
– EST projects
• Ease of use
– Grows fast (1.5 h doubling) in pure culture
– Large cell size
– Large T° range for growth
– Storage in liquid N2
– Large scale sub-cellular compartment fractionation
TIGR
12. Tetrahymena’s two nuclear genomes
Micronucleus (MIC)
Germline Genome
(Silent)
5 pairs of chromosomes
Macronucleus (MAC)
Somatic genome
(Expressed)
250-300 chromosomes
@ ~45 copies each
TIGR
14. Macronuclear Genome
• Little repetitive DNA
• 180 Mbp genome
• Little evidence for large duplications
• No centromeres
• Few and small introns
• No alternative splicing reported
• Genes are lower AT (63%) than rest of the
genome (83%)
TIGR
15. Major Achievements
• 8x coverage achieved September 20, 2003
• Shotgun assembly finished September 25,
2003
• Sequence and assembly Data released to
TIGR web site October 1, 2003
• Traces released to NCBI trace archive
October 15, 2003
TIGR
16. Why sequence the Mac?
• Advantages:
– It contains all the genes and control elements
required for life
– IES loss removes the vast majority of the
germline’s repeated sequences
• Special challenges
– Assembling a highly fragmented genome.
– Relating the MAC genome sequence to the MIC
genome.
TIGR
17. Macronuclear DNA Libraries
Size of % Good % No insert
DNA used Sequences
TTAAA 1.5-2.0 95 0
TUAAA 2.0-3.0 90 0
TXAAA 3.0-4.0 88 1
TYAAA 4.0-6.0 85 1
TQAAA 6.0-10.0 45 27
TIGR Made by Bill Nierman at TIGR
18. Sequencing
• Sequencing done at the J. Craig Venter
Science Foundation’s Joint Technology
Center
• 1,197,106 million reads primarily from 4-6
kb library
• Average edited length 815 bp
TIGR
19. Assembly
• Celera Assembler with modifications by Mihai Pop,
Art Delcher, Steven Salzberg, et al.
Scaffolds 2988
Contigs 4223
Bases in 106,196,540
Scaffolds
Largest contig 715,652
Largest 2,217,035
scaffold
Coverage 9.01
N50 Scaffolds 464,449
TIGR
20. Data Release
• All raw data is in the NCBI Trace Archive
• Sequences and assemblies are available at (
http://www.tigr.org/tdb/e2k1/ttg/ and will
be available in Genbank
• Assemblies will be released monthly if
there are any improvements
TIGR
21. Assorted statistics
Feature Stat
Number of “capped” scaffolds 114
Fraction of the genome residing in capped scaffolds 40%
Fraction of the genome residing in scaffolds capped on at least one end 75%
Post-genomic estimate of the number of MAC chromosomes 292
Number of sequenced RAPS found in single scaffolds 93/94 tested
Longest single-contig scaffold 716 kb
Longest scaffold 2.2 Mb
Longest capped scaffold (on both ends) 1.1 Mb
Shortest capped scaffold (on both ends) 37.5 kb
Estimated fold-redundancy of MIC sequence in the TIGR sequence database 0.1 fold
TIGR
22. Accuracy?
• No scaffolds are larger than the corresponding
MAC chromosomes
• All independently assorting loci match different
scaffolds and all co-assorting loci match either
same scaffold or the sum of the scaffolds is < than
the size of cognate MAC chromosome
• Previously obtained Cbs-adjacent sequences that
match to untelomerized scaffolds invariably do so
at scaffold ends.
TIGR
23. Scaffold to MAC Chromosome Size Ratio
1.80
1.60
1.40
1.20
1.00
0.80
0.60
Scaffold to Chromosome Ratio
0.40
0.20
0.00
0 0.5 1 1.5 2 2.5 3 3.5
MAC Chromosome Ratio (Mb)
Observed "0.9 & 1.1 Lines"
TIGR
24. Estimating the number
of MAC chromosomes
• 114 “closed” scaffolds (= MAC chromosomes)
encompass 40% of the genome sequence in scaffolds.
• If the size distribution of these scaffolds is
representative, then, by proportionality,
• The entire genome is estimated to contain ~290 MAC
chromosomes.
• This number falls within the range of earlier
estimates, suggesting that few, if any, MAC
chromosomes are missing from the TIGR
Tetrahymena sequence
TIGR
25. Assembly Issues
• rRNA and mitochondrial contigs are
considered “repetitive” due to the higher
depth of coverage
• Reran assembly in three subsets
– rRNA
– mitochondrial
– other sequences
TIGR
26. Assembly 2
rRNA Mitochondria Major
chromosomes
Scaffolds 2 1 1971
Contigs 2 1 2955
Bases in 12,166 45,538 103,927,049
Scaffolds
Largest contig 45,538 715,652
Largest 12,166 45,538 2,214,258
scaffold
Coverage 635x 17.85x 9.08x
TIGR
28. Tetrahymena Genome Database
• Phenotypes associated with gene knockouts,
replacements and other types of mutations.
• Gene regulation information from the literature.
• Post-translational modifications.
• Linkage & physical maps
• DNA polymorphisms
• Experimental protocols
• Links to other sites
TIGR
30. Paul Doerder, Cleveland State
Immobilization antigens (i-ag)
Major GPI-linked cell surface protein
o related to surface proteins of disease-causing protists
o encoded by at least 8 families of paralogs expressed
under different conditions of temperature and salinity
o members of H, L, J and S families already sequenced
Tetrahymena Genome Project:
o additional H, L, J and S paralogs and pseudogenes have
been identified
o candidate I, T, M and P i-ag genes currently being
tested by RT-PCR and real-time PCR
TIGR
31. Todd Hennesey, Buffalo
• Identified ecto-ATPase that he’s been trying to
clone for the past 7 years
• Making a knockout
• Identified "lysozyme receptor" that he’s been
trying to clone for the past 5 years
• We screened some antisense ribosome mutants,
got an interesting phenotype (extended backward
swimming in Ba++), BLASTed the short antisense
sequence into the database and now have 1.7kb of
sequence to use to make a knockout
TIGR
32. Kathleen Karrer, Marquette
• We have just today had a paper accepted by
Eukaryotic Cell, pending revisions, which
was significantly enhanced by analysis of
the data base. There are two undergraduate
co-authors on the paper.
TIGR
33. Cliff Brunk, UCLA
T. thermophila genes detected by CUI
CUI versus Gene Position
70
60
50
40
30
20
10
0
23500 28500 33500 38500 43500
Nucleotide Position
1000/CUI Nucleotide
TIGR
34. Davis Asai, Harvey Mudd College
• Dynein heavy chains are very large ORFs (ca. 16
kb) and traditional cloning etc. has been a slow
go.
• We were able to use the database to complete the
determination of the sequence of the major
cytoplasmic dynein heavy chain gene, DYH1, and
we are extending our information on the second
cytoplasmic dynein heavy chain, DYH2.
• Further, we have been able to walk "in silico"
upstream of the DYH1 gene in order to make
constructs for the N- terminal tagging of the
heavy chain.
TIGR
35. J. Smith, K. Belay, S. Beeser,
A. Keuroghlian, R.E.
Pearlman, K.W.M. Siu
TIGR – sequences
Translate in 6 reading frames using ciliate code
Use these files as databases of all known proteins in
Tetrahymena thermophila in these two mass
TIGR – scaffolds
spectrometry related searching programs (in-house):
TIGR
36. Gel approach… Ciliary axonemal proteins from
Tetrahymena thermophila
Digest with trypsin
Excise
Sequence individual
Identify based on peptides and identify
tryptic fingerprint using using MASCOT and
translated T. thermophila translated T. thermophila
database (MS-FIT). database.
Run each fraction on a 1.5 hour
Ciliary axonemal proteins from reverse phase gradient
Tetrahymena thermophila (C18 column) into a mass
spectrometer, acquiring a CID
spectrum of each peptide in the
solution.
Digest with trypsin
Divide into 30 fractions using SCX Identify using MASCOT and
translated T. thermophila
TIGR
2D LC/MS/MS approach… database.
38. Preliminary Summary (using Gel approach):
Axonemal proteins found:
• Alpha Tubulin
• Beta Tubulin
• Unnamed protein product
• Axoneme central apparatus protein
• Chain A, Tryparedoxin Ii / Thioredoxin Peroxidase
/ Peroxiredoxin 2 / Natural Killer Cell Enhancing Factor
• Hypothetical Protein
• Dynein, 70 kDa intermediate chain
• Calmodulin like protein / Outer dynein arm-docking complex
• Axonemal leucine-rich repeat protein
• Testes specific A2 / Meichroacidin / phosphatidylinositol-4-phosphate
• invl / putative ankyrin repeat protein / Ankyrin 3
• Calmodulin
• Radial spokehead-like protein
• Flagellar Radial Spoke protein
• ABC transporter
Membrane proteins found (tubulins found in previous experiments):
• Hypothetical Protein
• Xenobiotic reductase
• SerH3 immobilization antigen
TIGR• NADH:flavin oxidoreductase
39. Preliminary Analysis of the Tetrahymena Phagosome Proteome
Preliminary Analysis of the Tetrahymena Phagosome Proteome
L. Klobutcher (Univ. Connecticut Health Ctr.) & R. Pearlman (York Univ.)
L. Klobutcher (Univ. Connecticut Health Ctr.) & R. Pearlman (York Univ.)
Oral
Appa
ratus
*Components of the mouse phagosome proteome
(Garin et al. J. Cell Biol. 152:165, 2001)
TIGR
40. Doug Chalker, Wash. U.
Using the genome sequence to predict genes
that we are going to use this semester as the
focus of an undergraduate lab class.
We are going to knockout these genes and
study the phenotypes. This will bring up to
the date research techniques into the
undergraduate classroom.
TIGR
41. Marty Gorovsky, Rochester
• Expansion of a family of cystein proteases
• Two new histone H3 genes
• One new histone H2A gene
TIGR
42. Kapler: Gene Amplification and DNA Replication Con
rDNA minichromosome (21 kb)
Macronuclear development: amplified 5,000-fold
Vegetative replication: once per cell cycle
Biochemically purified trans-acting factors: TIF1, TIF4
TIGR genome sequencing project: Bioinformatics
Immediate impact on two funded research projects
• Kapler: NIH (GMS)
(Cis- and trans-acting determinants for replication and amplification
of the rDNA minichromosome)
Strong candidates identified for orthologs of Orc1,2,4,5,6, Cdc6,
Mcm2-6, Cdt1
• Kapler and Orias (co-PIs): NSF (Eukaryotic Genetics)
(Genetic dissection of replicons in non-rDNA chromosomes)
Complete sequence of 16 non-rDNA minichromosomes
(size range 37.4-99.5 kb)
TIGR
43. ID new genes by blasting
3 new histones, including a cen-P homolog Gorovsky
16 new ciliogenesis-induced genes with known homologs Gorovsky
51 novel ciliogenesis-induced genes with no known homologs Gorovsky
55 new cysteine protease genes – only one in GenBank Gorovsky
8 strong candidates for proteins involved in replication and amplification of Kapler
the rDNA minichromosome
Completing the very long (~16 kb) dynein heavy chain ORFs Asai
Orthologues of light chains and light intermediate chains characterized in Asai
other systems
2-3 families of homing endonucleases Karrer
20 nuclear transport proteins; interest, MIC vs. MAC Jahn
New heat shock proteins Miceli
New stress response proteins (oxidative and UV), including some never Miceli
reported in protozoa
Subunits of heterotrimeric G-proteins Miceli
Tetrahymenol (cholesterol surrogate) cyclase; bacterial-related, possible LGT Matsuda
Many snoRNA candidate genes Nielsen
New alternative family of U1-3 spliceosomal RNAs Nielsen
Glutamic-dehydrogenase; regulation-wise, “missing link” between bacterial Smith
and animal GDH; lacks “off” switch, just like mutant GDH that in children
causes insulin hypersecretion
16 complete minichromosomes (37.5 to 99 kb) for a study of origins of Kapler
TIGR
replication
44. QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
TIGR
45. Other Ciliate Projects
• Paramecium genomic survey (Dr. Linda Sperling,
Centre de Genetique Moleculaire, CNRS, France)
• European rumen ciliate cDNA project (C. Jamie
Newbold, Rowett Research Institute, Aberdeen,
UK)
• Oxytricha (Spirotrich ciliate) micronuclear BAC
project (Laura Landweber, Princeton University);
• Ichthyophthirius EST sequencing proposal
(Theodore G. Clark, Cornell University
TIGR
46. Relating MIC and MAC genomes
• Paired sequence tags from MAC
chromosome ends adjacent to Cbs
junctions
• MIC:MAC relational genetic and physical
maps of sequenced DNA polymorphisms
(not shown)
TIGR
48. Ordering and Orienting Tetrahymena MAC
Chromosome DNA in the Micronuclear
Genome: Genominoes
Chromosome Breakage
Junction Sequence
Scaffold
Sequence
TIGR
49. Current state of MIC Genominoes
I’m sending you a Word document with the status
before I tel-linked the 273 additional scaffold
ends.
Their tel-adjacent sequence was blasted against our
paired Cbs tags on Friday.
I should be able to send you a slide with longer
“contigs” of scaffolds within the next couple of
days (please let me know what the hard deadline
is).
TIGR
50. Fraction of the genome
in Tel-linked Scaffolds
Scaffold Number % gemome
-----------------------------
Both tels 114 40
One tel 120 35
No tel 289 25
-----------------------------
Total tel-linked
scaffold ends: 348
TIGR
Notas do Editor
Genome sizes estimated from careful cytospectrophotometry in the 1970’s. 180 Mb = Drosophila size. MAC chromosome copy # exception: rDNA @ ~9,000 copies per MAC (by quantitative DNA hybridization) Chromosome #s: MIC: Direct microscopic observations (1950s) Quantitative measurements in stained pulsed-field gels (1980s)
Cbs = chromosome breakage site IES = internally eliminated segment
Fig 2. Same data as in Fig. 1, except that the y-axis represents the scaffold-to-chromosome size ratio. The band bound by purple dots represent the region of +/- 10% error associated with our pulsed-field gel/Southern blot/hybridization estimates of MAC chromosome size.
Sequences from a Cbs-enriched library of MIC DNA inserts provide sequence tags for the pair of MAC chromosome ends generated at each junction. Characterization of the entire set of Cbs junctions allows the determination of the order and orientation of MAC chromosomes in the MIC genome, as shown in analogy in the next slide