Genome annotation 2013

Genome Annotation
Karan Veer Singh,
Scientist.
NBAGR, Karnal,
India

1

The Genome
•

The genome contains all the biological information required to
build and maintain any given living organism

•

The genome contains the organisms molecular history

•

Decoding the biological information encoded in these molecules
will have enormous impact in our understanding of biology

Genomics

1.

Structural genomics-genetic and physical mapping of genomes.

2.

Functional genomics-analysis of gene function (and non-genes).

3.

Comparative genomics-comparison of genomes across species.


Includes structural and functional genomics.



Evolutionary genomics.

Human Genome Project

The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing

Determine how individual
genes play a role in health
and disease

Human Genome Project & Functional
Genome

It cost 3 billion dollars and took 10 years to complete (5 less than
initially predicted).
•

Approx 200 Mb still in progress
– Heterochromatin
– Repetitive

Genomics & Genome
annotation


First genome annotation software system was designed in 1995 by Dr.
Owen White with The Institute for Genomic Research that sequenced
and analyzed the first genome of a free-living organism to be decoded,
the bacterium Haemophilus influenzae



It involve assembling of the reads to form contigs then assembling with
a reference genome (reference assembly) or de novo assembly to
obtain the complete genome



Variations such as mutations, SNP, InDels etc can be identified



The genome is then annotated by structural and functional annotation



Mapping Image of Whole genome in an easily understandable manner.

Input1 to Genome Viewer- Variant
Annotation

Input2 to Genome Viewer- Structural
Annotation
 Structural

2.5.5)

Annotation- AUGUSTUS (version

Input3 to Genome Viewer-Functional
Annotation

Genome Annotation
 The

process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do

 Finding

and attaching the structural elements
and its related function to each genome
locations

11

Genome Annotation

gene structure prediction

gene function prediction

Identifying elements
(Introns/exons,CDS,stop,start)
in the genome

Attaching biological information
to these elements- eg: for which
12
protein exon will code for

Structural annotation
Structural annotation - identification of genomic elements
Open reading frame and their localisation
gene structure
coding regions
location of regulatory motifs

Functional annotation
Functional annotation- attaching biological
information to genomic elements
biochemical function
biological function
involved regulations

Genome annotation - workflow
Genome sequence

Repeats

Masked or un-masked genome sequence
Structural annotation-Gene finding
nc-RNAs (tRNA, rRNA),
Introns

Protein-coding genes

View in Genome viewer
16

Genome Repeats & features
Polymorphic between individuals/populations
 Percentage of repetitive sequences in different organisms
Genome
Aedes aegypti

Genome Size
(Mb)

% Repeat
~70

Anopheles gambiae

260

~30

Culex pipiens







1,300

540

~50

Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR

17

Finding repeats as a preliminary to gene prediction
 Repeat discovery

Homology based approaches
Use RepeatMasker to search the genome and mask the sequence

18

Masked sequence




Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set

>my sequence

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct

actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct

Positions/locations are not affected by masking
19

Types of Masking- Hard or Soft?


Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked

>my sequence

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT

>my sequence (hardmasked)
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga20
tcttct

Genome sequence

Map repeats

Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns


21

Structural annotation
Identification of genomic elements
 Open

reading frame and their localization
 Coding regions
 Location of regulatory motifs
 Start/Stop
 Splice Sites
 Non coding Regions/RNA’s
 Introns

22

Methods
 Similarity
•

Similarity between sequences which does not necessarily infer any
evolutionary linkage

 Ab- initio prediction
•

Prediction of gene structure from first principles using only the genome
sequence

24

Genefinding
ab initio

similarity

25

ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential

Examples:
Genefinder, Augustus,
Glimmer, SNAP, fgenesh

26

Genefinding - similarity
 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus,
Prodigal

Gene-finding - comparative
 Use two or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM
27

Genome sequence

Map repeats

Masked or un-masked
nc-RNAs, Introns


28

Genefinding - non-coding RNA genes

 Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples

 tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes

 Rfam - a suite of HMM’s trained against a large number of different
RNA genes

29

Gene-finding omissions

Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board

Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set

30

Practical- structural annotation
Eukaryotes- AUGUSTUS (gene model)

~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial
--singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tr
--progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea
our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa
-f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
31

Structural Annotation-output


Structural Annotation conducted using AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model

Genome sequence

Map repeats

Masked or un-masked
nc-RNAs, Introns


34

Genome
Transcription

Primary Transcript
RNA processing

Processed mRNA

ATG

STOP

m 7G

AAAn

Translation

Polypeptide
Protein folding

Folded protein
Find function
Enzyme activity

Functional activity

A

B
35

Attaching biological information to genomic elements
Biochemical

function
Biological function
Involved regulation and interactions
Expression

•

Utilize known structural annotation to predicted protein sequence

36

Functional annotation – Homology Based


Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities



Visually assess the top 5-10 hits to identify whether these have
been assigned a function



Functions are assigned

37

Functional annotation - Other features
 Other








features which can be determined

Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
Secretome

See http://expasy.org/tools/ for a good list of possible prediction algorithms

38

Functional annotation - Other features
(Ontologies)
 Use



of ontologies to annotate gene products

Gene Ontology (GO)




Cellular component
Molecular function
Biological process

39

Practical - FUNCTIONAL
ANNOTATION


Homology Based Method



setup blast database for nucleotide/protein



Blasting the genome.fasta for annotations (nucleotide/protein)



sorting for blast minimum E-value (>=0.01) for nucleotide/protein



assigning functions

40

Functional annotation- output

August 2008

Bioinformatics tools for Comparative Genomics
of Vectors

41

Conclusion


Annotation accuracy is dependent available supporting data at the
time of annotation; update information is necessary



Gene predictions will change over time as new data becomes
available (NCBI) that are much similar than previous ones



Functional assignments will change over time as new data becomes
available (characterization of hypothetical proteins)

42

Genome sequence

Map repeats

Masked or un-masked
nc-RNAs, Introns


43

Genome Viewer
The Files that can be visualised
Annotation files
Indel files
Consensus sequence

Comparative Genomics

44

Genome annotation 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Genome annotation 2013

Semelhante a Genome annotation 2013 (20)

Mais de Karan Veer Singh

Mais de Karan Veer Singh (20)

Último

Último (20)

Genome annotation 2013

Notas do Editor