1. Lecture 10:
EVE 161:
Microbial Phylogenomics
!
Lecture #10:
Era III: Genome Sequencing
!
UC Davis, Winter 2014
Instructor: Jonathan Eisen
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!1
2. Where we are going and where we have been
• Previous lecture:
! 9: rRNA Case Study - Built Environment
• Current Lecture:
! 10: Genome Sequencing
• Next Lecture:
! 11: Genome Sequencing II
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
!2
4. insight progress
1. Library construction
2. Random sequencing phase
(i) Sequence DNA
(15,000 sequences per Mb)
(i) Isolate DNA
–1
3. Closure phase
(i) Assemble sequences
(ii) Close gaps
–1
(ii) Fragment DNA
(iii) Edit
GGG ACTGTTC...
(iii) Clone DNA
(iv) Annotation
237
800,000 1
700,000
4. Complete
genome sequence
239
100,000
238
200,000
600,000
300,000
500,000
400,000
Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.
analysis of the genomes of two thermophilic bacterial species, be extensive, it is somehow constrained by phylogenetic relationAquifex aeolicus and Thermotoga maritima, revealed that 20–25% of ships. Other evidence for a ‘core’ of particular lineages comes from
the genes in these species were more similar to genes from archaea the finding of a conserved core of euryarchaeal genomes21,22 and
than those from bacteria13,14. This led to the suggestion of possible another finding that some types of gene might be more prone to gene
Slides for these species and archaeal transfer than others23. It Winter seems
extensive gene exchanges between UC Davis EVE161 Course Taught by Jonathan Eisentherefore2014 likely that horizontal gene
7. TIGR
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
8. Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better comparative genomics
• Genome duplications, inversions
• Presence and absence of particular genes can be very
important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
9. General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
10. General Steps in Analysis of Complete Genomes
• Structural Annotation
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Functional Annotation
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Evolutionary Annotation
• Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
11. Structural Annotation I: Genes in Genomes
• Protein coding genes.
! In long open reading frames
! ORFs interrupted by introns in eukaryotes
! Take up most of the genome in prokaryotes, but only a
small portion of the eukaryotic genome
• RNA-only genes
! Transfer RNA
! ribosomal RNA
! snoRNAs (guide ribosomal and transfer RNA
maturation)
! intron splicing
! guiding mRNAs to the membrane for translation
! gene regulation—this is a growing list
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
12. Structural Annotation II: Other Features to Find
• Gene control sequences
! Promoters
! Regulatory elements
• Transposable elements, both active and defective
! DNA transposons and retrotransposons
! Many types and sizes
• Other Repeated sequences.
! Centromeres and telomeres
! Many with unknown (or no) function
• Unique sequences that have no obvious function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
13. How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus
easy to detect. Finding them first removes some areas of the genome from further
consideration.
• One easy approach to finding common RNA genes is just looking for sequence
homology with related species: a BLAST search will find most of them quite easily
• Functional RNAs are characterized by secondary structure caused by base pairing
within the molecule.
• Determining the folding pattern is a matter of testing many possibilities to find the
one with the minimum free energy, which is the most stable structure.
• The free energy calculations are in turn based on experiments where short synthetic
RNA molecules are melted
• Related to this is the concept that paired regions (stems) will be conserved across
species lines even if the individual bases aren’t conserved. That is, if there is an A-U
pairing on one species, the same position might be occupied by a G-C in another
species.
• This is an example of concerted evolution: a deleterious mutation at one site is
cancelled by a compensating mutation at another site.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
14. RNA Structure
•
•
RNA differs from DNA in having fairly
common G-U base pairs. Also, many
functional RNAs have unusual modified
bases such as pseudouridine and inosine.
The pseudoknot, pairing between a loop
and a sequence outside its stem, is
especially difficult to detect:
computationally intense and not subject to
the normal situation that RNA base pairing
follows a nested pattern
– But pseudoknots seem to be fairly rare.
•
Essentially, RNA folding programs start
with all possible short sequences, then
build to larger ones, adding the
contribution of each structural element.
– There is an element of dynamic
programming here as well.
– And, “stochastic context-free grammars”,
something I really don’t want to approach
right now!
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
15. Finding tRNAs
•
•
•
tRNAs have a highly conserved
structure, with 3 main stem-andloop structures that form a
cloverleaf structure, and several
conserved bases. Finding such
sequences is a matter of looking in
the DNA for the proper features
located the proper distance apart.
Looking for such sequences is
well-suited to a decision tree, a
series of steps that the sequence
must pass.
In addition, a score is kept, rating
how well the sequence passed
each step. This allows a more
stringent analysis later on, to
eliminate false positives.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
16. Bacteria / Archaeal Protein Coding Genes
•
Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and
a few others are occasionally used.
–
•
The stop codons are the same as in eukaryotes: TGA, TAA, TAG
–
•
•
stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use
of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation.
Genes can overlap by a small amount. Not much, but a few codons of overlap is common
enough so that you can’t just eliminate overlaps as impossible.
Cross-species homology works well for many genes. It is very unlikely that non-coding
sequence will be conserved.
–
•
Remember that start codons are also used internally: the actual start codon may not be the first
one in the ORF.
But, a significant minority of genes (say 20%) are unique to a given species.
Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often
found just upstream from the start codon
–
–
however, some aren’t recognizable
genes in operons sometimes don’t always have a separate ribosome binding site for each gene
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
17. Composition Methods
• The frequency of various codons is different in coding regions as
compared to non-coding regions.
– This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often
used
– Well documented experimentally.
• The composition varies between different proteins of course, and
it is affected within a species by the amounts of the various
tRNAs present
– horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species.
– A second group with unusual compositions are highly expressed
genes.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
18. Eukaryotic Genes Harder to Find
•
•
Some fundamental differences between
prokaryotes and eukaryotes:
There is lots of non-coding DNA in eukaryotes.
– First step: find repeated sequences and RNA
genes
– Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2)
transcribes all protein-coding genes, while pol1
and pol3 transcribe various RNA-only genes.
•
•
•
most eukaryotic genes are split into exons and
introns.
Only 1 gene per transcript in eukaryotes.
No ribosome binding sites: translation starts at
the first ATG in the mRNA
– thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense.
•
Many fewer eukaryotic genomes have been
sequenced
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
19. Exons
• Exon sequences can often be identified by sequence conservation,
at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful
– eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between
isochores.
• The initial and terminal exons contain untranslated regions, and
thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information about
the sequences surrounding each possible GT/AC pair, then running
this information through some combination of decision tree, Markov
models, discriminant analysis, or neural networks, in an attemp to
massage the data into giving a reliable score.
– In general, sites are more likely to be correct if predicted by multiple
methods
– Experimental data from ESTs can be very helpful here.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
21. Functional Classification I: GO
•
The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt
describe gene products with a structured controlled vocabulary, a set of invariant
terms that have a known relationship to each other.
•
Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For
example, GO:0005102 is “receptor binding”.
•
There are 3 root terms: biological process, cellular component, and molecular function. A
gene product will probably be described by GO terms from each of these “ontologies”.
(ontology is a branch of philosophy concerned with the nature of being, and the basic
categories of being and their relationships.)
–
•
For instance, cytochrome c is described with the molecular function term “oxidoreductase
activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”,
and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree.
This means simply that each term can have more than one parent term, but the
direction of parent to child (i.e. less specific to more specific) is always maintained.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
22. Functional Classification II: Enzyme Nomenclature
•
Enzyme functions: which reactants are converted to which products
•
Enzyme functions are given unique numbers by the Enzyme Commission.
– Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two
entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number.
– E.C. numbers are four integers separated by dots. The left-most number is the
least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose
components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)
• EC 3.4 are hydrolases that act on peptide bonds
• EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide
• EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
•
Top level E.C. numbers:
– E.C. 1: oxidoreductases (often dehydrogenases): electron transfer
– E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules.
– E.C. 3: hydrolases: splitting a molecule by adding water to a bond.
– E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule
– E.C. 5: isomerases: rearrangements of atoms within a molecule
– E.C. 6: ligases: joining two molecules using energy from ATP
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
23. Functional Prediction
•
•
•
•
•
•
BLAST searches
HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam).
Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues.
Cellular location predictions, especially for transmembrane proteins.
Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes
transcribed in opposite directions that use a common control region).
Biochemical pathway/subsystem information. If an organism has
most of the genes needed to perform a function, any missing
functions are probably present too.
– Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
24. Functional Prediction II: Membrane Spanning
•
Integral membrane proteins contain amino acid
sequences that go through the membrane one or
several times.
– There are also peripheral membrane proteins that stick
to the hydrophilic head groups by ionic and polar
interactions
– There are also some that have covalently bound
hydrophobic groups, such as myristoylate, a 14 carbon
saturated fatty acid that is attached to the N-terminal
amino group.
•
There are 2 main protein structures that cross
membranes.
– Most are alpha helices, and in proteins that span
multiple times, these alpha helices are packed together
in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane
spanning “beta barrels”, composed of beta sheets
wrapped into a cylinder. An example: porins, which
transport water across the membrane.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
25. Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and
computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that
integrate evolutionary reconstructions and understanding
of how new functions evolve
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
26. Functional Prediction
• Identification of motifs
! Short regions of sequence similarity that are indicative
of general activity
! e.g., ATP binding
• Homology/similarity based methods
! Gene sequence is searched against a databases of
other sequences
! If significant similar genes are found, their functional
information is used
• Problem
! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
28. H. pylori genome - 1997
“The ability of H. pylori to
perform mismatch repair is
suggested by the presence of
methyl transferases, mutS
and uvrD. However,
orthologues of MutH and
MutL were not identified.”
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
30. Phylogenetic Tree of MutS Family
Yeast
Human
Celeg
Aquae
Strpy
Bacsu
Synsp
Deira Helpy
Borbu
Metth
mSaco
Yeast
Human
Mouse
Arath
Arath
Human
Mouse
Spombe
Yeast
Yeast
Spombe
Yeast
Celeg
Human
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Aquae
Trepa
Chltr
Deira
Theaq
BacsuBorbu
Thema
SynspStrpy
Ecoli
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Neigo
Based on Eisen,
1998 Nucl Acids
30
Res 26: 4291-4300.
32. Overlaying Functions onto Tree
MutS2
MSH5
Aquae
Strpy
Bacsu
Synsp
Deira Helpy
Borbu
Metth
Yeast
Human
Celeg
MSH6
mSaco
Yeast
Human
Mouse
Arath
MSH3
MSH1
MSH4
Yeast
Celeg
Human
Arath
Human
Mouse
Spombe
Yeast
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Yeast
Spombe
Aquae
Chltr
Deira
Theaq
Thema
Trepa
BacsuBorbu
Synsp
Strpy
Ecoli
Neigo
MutS1
MSH2
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Based on Eisen,
1998 Nucl Acids
32
Res 26: 4291-4300.
33. MutS Subfamilies
•
•
•
•
•
MutS1
MSH1
MSH2
MSH3
MSH6
Bacterial MMR
Euk - mitochondrial MMR
Euk - all MMR in nucleus
Euk - loop MMR in nucleus
Euk - base:base MMR in nucleus
Bacterial - function unknown
Euk - meiotic crossing-over
Euk - meiotic crossing-over
!
• MutS2
• MSH4
• MSH5
TIGR
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
34. Functional Prediction Using Tree
MSH5 - Meiotic Crossing Over
Aquae
Strpy
Bacsu
Synsp
Deira Helpy
Borbu
Metth
Yeast
Human
Celeg
MSH6 - Nuclear
Repair
Of Mismatches
MutS2 - Unknown Functions
mSaco
Yeast
Human
Mouse
Arath
Yeast
Celeg
Human
Arath
MSH3 - Nuclear Human
Mouse
RepairOf Loops Spombe
Yeast
MSH1
Mitochondrial
Repair
MSH4 - Meiotic Crossing
Over
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Yeast
Spombe
Aquae
Chltr
Deira
Theaq
Thema
MSH2 - Eukaryotic Nuclear
Mismatch and Loop Repair
Trepa
BacsuBorbu
Synsp
Strpy
Ecoli
Neigo
Slides for MutS1 - EVE161 Course Taught by Jonathan Eisen Winter 2014
UC Davis Bacterial Mismatch and Loop Repair
Based on Eisen,
1998 Nucl Acids
34
Res 26: 4291-4300.
36. Blast Search of H. pylori “MutS”
Sequences producing significant alignments:
sp|P73625|MUTS_SYNY3
sp|P74926|MUTS_THEMA
sp|P44834|MUTS_HAEIN
sp|P10339|MUTS_SALTY
sp|O66652|MUTS_AQUAE
sp|P23909|MUTS_ECOLI
DNA
DNA
DNA
DNA
DNA
DNA
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
REPAIR
REPAIR
REPAIR
REPAIR
REPAIR
REPAIR
Score
E
(bits) Value
PROTEIN
PROTEIN
PROTEIN
PROTEIN
PROTEIN
PROTEIN
117
69
64
62
57
57
• Blast search pulls up Syn. sp MutS#2 with much higher p value
than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
3e-25
1e-10
3e-09
2e-08
4e-07
4e-07
37. High Mutation Rate in H. pylori
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
38. Phylogenomics
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A
METHOD
EXAMPLE B
2A
CHOOSE GENE(S) OF INTEREST
5
3A
2B
1A 2A 1B 3B
IDENTIFY HOMOLOGS
2
1 3 4
5
6
ALIGN SEQUENCES
1A
2A 3A 1B
2B
1
2
3
4
5
6
1
3B
2
3
4
5
6
3
4
5
6
4
5
6
CALCULATE GENE TREE
Duplication?
1A
2A 3A 1B
2B
3B
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
1A
2A 3A 1B
2B
1
3B
2
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1
1A 1B
Species 2
2A 2B
Species 3
3A 3B
1
2
3
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Based on Eisen, 1998
Genome Res 8: 163-167.
40. Chemosynthetic Symbionts
Eisen et al. 1992
Eisen et al. 1992. J. Bact.174: 3416
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
41. Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005 PLoS
Genetics 1: e65. )
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
42. Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
43. Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
44. Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
!
• Ask: Yes or No, is each gene found
in each other species
!
• Cluster genes by distribution
patterns (profiles)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
45. Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
46. B. subtilis new sporulation genes
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
47. Functional Prediction III: Colocalization
•
Operon structure is often
maintained over fairly large
taxonomic regions.
–
–
•
Sometimes gene order is altered,
and sometimes one or more
enzymes are missing.
But in general, this phenomenon
allows recognition or verification
that widely diverged enzymes do
in fact have the same function.
This is an operon that contains
part of the glycolytic pathway.
–
–
–
–
–
–
1: phosphoclycerate mutase
2: triosephosphate isomerase
3: enolase
4: phosphoglycerate kinase
5: glyceraldehyde 3-phosphate
dehydrogenase
6: central glycolytic gene regulator
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
53. After the Genomes
• Better analysis and annotation
• Comparative genomics
• Functional genomics (Experimental analysis of gene
function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014