Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
2. The Genome
•
The genome contains all the biological information required to
build and maintain any given living organism
•
The genome contains the organisms molecular history
•
Decoding the biological information encoded in these molecules
will have enormous impact in our understanding of biology
3. Genomics
1.
Structural genomics-genetic and physical mapping of genomes.
2.
Functional genomics-analysis of gene function (and non-genes).
3.
Comparative genomics-comparison of genomes across species.
Includes structural and functional genomics.
Evolutionary genomics.
4. Human Genome Project
The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing
Determine how individual
genes play a role in health
and disease
5. Human Genome Project & Functional
Genome
It cost 3 billion dollars and took 10 years to complete (5 less than
initially predicted).
•
Approx 200 Mb still in progress
– Heterochromatin
– Repetitive
6. Genomics & Genome
annotation
First genome annotation software system was designed in 1995 by Dr.
Owen White with The Institute for Genomic Research that sequenced
and analyzed the first genome of a free-living organism to be decoded,
the bacterium Haemophilus influenzae
It involve assembling of the reads to form contigs then assembling with
a reference genome (reference assembly) or de novo assembly to
obtain the complete genome
Variations such as mutations, SNP, InDels etc can be identified
The genome is then annotated by structural and functional annotation
Mapping Image of Whole genome in an easily understandable manner.
11. Genome Annotation
The
process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do
Finding
and attaching the structural elements
and its related function to each genome
locations
11
12. Genome Annotation
gene structure prediction
gene function prediction
Identifying elements
(Introns/exons,CDS,stop,start)
in the genome
Attaching biological information
to these elements- eg: for which
12
protein exon will code for
13. Structural annotation
Structural annotation - identification of genomic elements
Open reading frame and their localisation
gene structure
coding regions
location of regulatory motifs
16. Genome Repeats & features
Polymorphic between individuals/populations
Percentage of repetitive sequences in different organisms
Genome
Aedes aegypti
Genome Size
(Mb)
% Repeat
~70
Anopheles gambiae
260
~30
Culex pipiens
1,300
540
~50
Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR
17
17. Finding repeats as a preliminary to gene prediction
Repeat discovery
Homology based approaches
Use RepeatMasker to search the genome and mask the sequence
18
18. Masked sequence
Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
>my sequence
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Positions/locations are not affected by masking
19
19. Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
>my sequence
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga20
tcttct
21. Structural annotation
Identification of genomic elements
Open
reading frame and their localization
Coding regions
Location of regulatory motifs
Start/Stop
Splice Sites
Non coding Regions/RNA’s
Introns
22
22. Methods
Similarity
•
Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction
•
Prediction of gene structure from first principles using only the genome
sequence
24
25. Genefinding - similarity
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus,
Prodigal
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
27
27. Genefinding - non-coding RNA genes
Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes
Rfam - a suite of HMM’s trained against a large number of different
RNA genes
29
28. Gene-finding omissions
Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board
Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set
30
34. Functional annotation
Attaching biological information to genomic elements
Biochemical
function
Biological function
Involved regulation and interactions
Expression
•
Utilize known structural annotation to predicted protein sequence
36
35. Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have
been assigned a function
Functions are assigned
37
36. Functional annotation - Other features
Other
features which can be determined
Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
Secretome
See http://expasy.org/tools/ for a good list of possible prediction algorithms
38
37. Functional annotation - Other features
(Ontologies)
Use
of ontologies to annotate gene products
Gene Ontology (GO)
Cellular component
Molecular function
Biological process
39
38. Practical - FUNCTIONAL
ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations (nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for nucleotide/protein
assigning functions
40
40. Conclusion
Annotation accuracy is dependent available supporting data at the
time of annotation; update information is necessary
Gene predictions will change over time as new data becomes
available (NCBI) that are much similar than previous ones
Functional assignments will change over time as new data becomes
available (characterization of hypothetical proteins)
42
Try to describe Genome annotation as a process
Emphasize the ongoing nature of annotation.
There is no real end point to the annotation process (only artificially defined ones)
Best to think of this as a ‘best guess’ annotation
Softmasking
Softmasking
Try to describe Genome annotation as a process
Emphasize the ongoing nature of annotation.
There is no real end point to the annotation process (only artificially defined ones)
Best to think of this as a ‘best guess’ annotation
Try to describe Genome annotation as a process
Emphasize the ongoing nature of annotation.
There is no real end point to the annotation process (only artificially defined ones)
Best to think of this as a ‘best guess’ annotation
Try to describe Genome annotation as a process
Emphasize the ongoing nature of annotation.
There is no real end point to the annotation process (only artificially defined ones)
Best to think of this as a ‘best guess’ annotation
Try to describe Genome annotation as a process
Emphasize the ongoing nature of annotation.
There is no real end point to the annotation process (only artificially defined ones)
Best to think of this as a ‘best guess’ annotation