SlideShare uma empresa Scribd logo
1 de 28
Structural Annotation
      Overview

           Sucheta Tripathy
           Virginia Bioinformatics Institute,
           Blacksburg
           6th July 2012

07/07/12
Gene Finding -- outline
              Patterns
              Frame Consistency
              Dicodon frequencies
              PSSMs
              Coding Potential and Fickett’s statistics
              Fusion of Information
              Sensitivity and Specificity
              Prediction programs
              Known problems



07/07/12
Genes are all about Patterns – real life example




07/07/12
Gene Prediction Methods

Common sets of rules
 Homology

 Ab initio methods
          Compositional information
          Signal information




07/07/12
Pattern Recognition in Gene Finding
     atgttggacagactggacgcaagacgtgtggatgaactcgttttggagctg
      aacaaggctctatacgtacttaatcaagcggggcgtttgtggagcgagt
      tacttcacaaaaagctagccaatttgggttcaatgcagtgcctgaccgaca
      tgggtatgtattagtaacgtttggaagaagaaactgttgtggttggtgt
      ttatgcagacaatctacaggtgactgcaacgaattcaactctcgtggaca
      gttttttcgttgatttacaggacctctcggtaaaggactatgaagaggtg
      acaaaattcttggggatgcgcatttcttatgcgcctgaaaatgggtatgatta
      tatatcgagaagtgacaacccgggaaatgataaaggataa


    atggagaggatgctggagacggtcaagacgaccatcacccctgcgcaggcaa
    tgaagctgtttactgcacccaaagaacctcaagcgaacctggcccgag
    cacttcatgtacttggtggccatctcggaggcctgcggtggtacttagtcctgaata
    acgtcgtgccgtacgcgtccgcggatctacgaacggtcctgat
    agccaaagtggacggcacgcgtgtcgactacctacagcaagctgaggaactg
    gcgcatttcgcgcaatcctgggagcttgaagcgcgcacgaagaacatt




    We need to study the basic structures of genes first ….!
07/07/12
Gene Structure – Common sets of rules
   Generally true: all long (> 300 bp) orfs in prokaryotic genomes
    encode genes

But this may not necessarily be true for eukaryotic genomes
        ATAGATAAGGTGTAGAGATAGAGAGATAGAGAGGAATAGGACAGG
        T
   Eukaryotic introns begin with GT and end with AG(donor and
    acceptor sites)
          CT(A/G)A(C/T) 20-50 bases upstream of acceptor site.




07/07/12
Gene Structure
              Each coding region (exon or whole gene) has a fixed
               translation frame
              A coding region always sits inside an ORF of same reading
               frame
              All exons of a gene are on the same strand.
              Neighboring exons of a gene could have different reading
               frames .
              Exons need to be Frame consistent!

           GATGGGACGACAGATAAGGTGATAGATGGTAGGCAGCAG
            0  3  6 9 12                   01




07/07/12
Gene Structure – reading frame consistency
               Neighboring exons of a gene should be frame-consistent
           Frame 0                         Frame 2
            GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGT
            C1             16                33
               GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGT
               C1                     16                             40
               Frame 0
                                                                      Frame 2
                 Exon1 (1,16) -> Frame = a = 0 ; i = 1 and j = 16
                 Case1: Exon2 (33,100): Frame = b = 2; m = 33 and n = 100
                 Case2: Exon2 (40,100): Frame = b = 2; m = 40 and n =100

                 exon1 (i, j) in frame a and exon2 (m, n) in frame b are consistent if

                                 b = (m - j - 1 + a) mod 3

07/07/12                      …31,34,37,40,43,46,49,52…
Codon Frequencies
              Coding sequences are translated into protein sequences
              We found the following – the dimer frequency in protein sequences
               is NOT evenly distributed

              Organism specific!!!!!!!!!!!

                     The average frequency is ¼%
                     (1/20 * 1/20 = 1/400 = ¼%)
                     Some amino acids prefer to be
                     next to each other
                     Some other amino acids prefer to
                     be not next to each other




                                                         H arabidopsidis
07/07/12
ALA ARG ASN AS CYS GLU GLN GLY HIS                       ILE   LEU LYS MET PHE PRO SER THR TRP TYR VAL
                P
A   .99 .5  .27 .45 .13 .52 .34 .51 .19                      .38   .83   .46   .2    .31   .37   .73   .56   .09   .18   .65

R   .5      .5      .2    .3   .1    .4    .3    .4    .18   .25   .63   .37   .14   .22   .26   .54   .34   .08   .17   .46

N   .31     .19     .11   .2   .05   .23   .23   .26   .07   .13   .27   .16   .07   .11   .15   .24   .16   .04   .08   .27


D   .54     .32     .17   .47 .08    .51   .19   .42   .13   .25   .48   .26   .12   .20   .24   .40   .29   .06   .14   .5


C   .14     .11     .05   .09 .04    .09   .06   .13   .04   .08   .14   .08   .03   .06   .07   .14   .10   .02   .04   .13

E   .57     .43     .22   .42 .09    .59   .30   .33   .16   .28   .64   .40   .17   .20   .21   .44   .36   .06   .16   .44

Q   .34     .31     .11   .20 .06    .27   .29   .20   .12   .15   .45   .21   .10   .13   .17   .29   .22   .05   .10   .29

G   .50     .39     .22   .37 .11    .37   .21   .50   .16   .28   .50   .33   .14   .23   .21   .54   .35   .07   .17   .46

H   .21     .17     .07   .14 .04    .16   .10   .17   .08   .09   .22   .10   .05   .09   .12   .17   .11   .03   .06   .21

I   .37     .25     .13   .27 .08    .27   .15   .27   .09   .15   .34   .22   .08   .14   .18   .29   .21   .04   .11   .32

L   .79     .65     .30   .53 .16    .62   .45   .50   .25   .31   .97   .47   .19   .32   .44   .71   .49   .10   .22   .67

K   .43     .41     .19   .26 .08    .35   .24   .26   .14   .20   .49   .41   .13   .17   .20   .37   .32   .07   .15   .33

M   .23     .17     .09   .17 .04    .19   .12   .14   .06   .10   .25   .15   .07   .08   .11   .20   .17   .03   .06   .17

F   .30     .20     .11   .22 .07    .22   .13   .26   .08   .14   .33   .15   .08   .14   .14   .27   .18   .04   .10   .28

P   .35     .28     .13   .23 .05    .27   .16   .24   .10   .17   .39   .19   .10   .15   .26   .42   .29   .04   .09   .33

S   .68     .51     .26   .46 .14    .43   .26   .55   .17   .33   .68   .38   .18   .28   .36   .93   .55   .09   .17   .56

T   .51     .36     .19   .29 .10    .31   .21   .37   .12   .25   .52   .32   .13   .21   .31   .54   .42   .07   .13   .39

W   0.07 .08        .05   .06 .02    .06   .05   .06   .03   .06   .12   .07   .03   .04   .04   .09   .07   .02   .03   .07

Y   .21     .16     .09   .15 .05    .16   .09   .17   .06   .10   .23   .11   .06   .10   .1    .17   .13   .03   .07   .2
         07/07/12
V   .69     .42     .24   .46 .13    .48   .31   .42   .18   .29   .72   .37   .16   .27   .33   .55   .42   .07   .18   .61
07/07/12
Dicodon Frequencies
          Believe it or not – the biased (uneven) dimer frequencies are the
           foundation of many gene finding programs!

          Basic idea – if a dimer has lower than average dimer frequency;
           this means that proteins prefer not to have such dimers in its
           sequence;


           Hence if we see a dicodon encoding this dimer, we
           may want to bet against this dicodon being in a
           coding region!


07/07/12
Dicodon Frequencies - Examples

              Relative frequencies of a di-codon in coding versus non-
               coding
                  frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences
                   of X divided by total number of dicocon occurrences
                  frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of
                   occurrences of X divided by total number of dicodon occurrences




               In human genome, frequency of dicodon “AAA AAA” is
               ~1% in coding region versus ~5% in non-coding region

               Question: if you see a region with many “AAA AAA”,
               would you guess it is a coding or non-coding region?

07/07/12
Basic idea of gene finding
              Most dicodons show bias towards either coding or non-
               coding regions; only fraction of dicodons is neutral
              Foundation for coding region identification


                Regions consisting of dicodons that mostly tend
                 to be in coding regions are probably coding
                    regions; otherwise non-coding regions


              Dicodon frequencies are key signal used for coding region
               detection; all gene finding programs use this information


07/07/12
Prediction of Translation Starts Using PSSM
   Certain nucleotides prefer to be in certain position around start
    “ATG” and other nucleotides prefer not to be there
        TCTAGAAGATGGCAGTGGCGAAGA
                                                          A 0,0,0,100 ,0,    75,100, 75 ATG
        TCTAGAAAATGACAGTGGCGAAGA
                                                          T 100,0,100,0,0,   25, 0, 0 ATG
        TCTAGAAAATGGCAGTAGCGAAGA
                                                          G 0, 0, 0, 0, 75    ,0, 0, 25 ATG
        TCTACT A AATGATAGTAGCGAAGA
                                               ATG        C 0,100,0,0, 25     ,0, 0, 0 ATG


    A
    C
    G
    T
                         -4     -3   -2   -1         +3   +4        +5           +6

                        CACC ATG GC
                        TCGA ATG TT
   The “biased” nucleotide distribution is information! It is a basis for
    translation start prediction

  Question:      which one is more probable to be a translation start?
07/07/12
Prediction of Translation Starts
              Mathematical model: Fi (X): frequency of X (A, C, G, T) in
               position i
              Score a string by Σ log (Fi (X)/0.25)

               CACC ATG GC                              TCGA ATG TT
           log (58/25) + log (49/25) + log (40/25) +   log (6/25) + log (6/25) + log (15/25) +
           log (50/25) + log (43/25) + log (39/25) =   log (15/25) + log (13/25) + log (14/25) =

           0.37 + 0.29 + 0.20 + 0.30 + 0.24 + 0.29     -(0.62 + 0.62 + 0.22 + 0.22 + 0.28 +
                                                       0.25)
           = 1.69
                                                       = -2.54

                                 The model captures our intuition!

      A
      C
      G
      T
07/07/12
Prediction of Exons
              For each orf, find all donor and acceptor candidates by finding GT
               and YAG motifs




                   CGT      GT      CAG CAG          CAG
              Score each donor and acceptor candidate using our position-
               specific weight matrix models

              Find all pairs of (acceptor, donor) above some thresholds

              Score the coding potential of the segment [donor, acceptor], using
               the hexmer model



07/07/12
Prediction of Exons
              For each segment [acceptor, donor], we get three scores (coding
               potential, donor score, acceptor score)

              Various possibilities
                  all three scores are high – probably true exon
                  all three scores are low – probably not a real exon

                  all in the middle -- ??
                  some scores are high and some are low -- ??

              What are the rules for exon prediction?




07/07/12
Prediction of Exons
              A “classifier” can be trained to separate exons from non-exons,
               based on the three scores
              Closer to reality – other factors could also help to distinguish exons
               from non-exons


                                                                        exon length
                                                                        distribution

                          150 bp



                               coding density is different in regions
                                   with different G+C contents


 50% G+C

                                   A practical gene finding software may use many
07/07/12
                                   features to distinguish exons from non-exons
Codon preference

   Uneven usage of amino acids in the existing proteins
   Uneven usage of synonymous codons.

    S= AGGACG, when read in frame 1, it results in the sequence




07/07/12
Gene Prediction in a New Genome
              One way is to identify a set of genes in the new genome through
               homology search against known genes in GenBank
                  BLAST, FASTA, Smith-Waterman

              So we can get some “coding” regions for training a gene finder

              How about non-coding regions?


                    We can use the intergenic or inter-
                    exon regions if we can identify any
              With these known genes and non-genes, we can train a gene finder
               just like before …..

07/07/12
Evaluation of Gene prediction
                       TP     FP            TN          FN       TP
Real


Predicted


          Sensitivity = No. of Correct exons/No. of actual exons(Measurement of
           False negative) -> How many are discarded by mistake
                              Sn = TP/TP+FN
          Specificity = No. of Correct exons/No. of predicted exons(Measurement of
           False positive) -> How many are included by mistake
                               Sp=TP/TP+FP

          CC = Metric for combining both

                    (TP*TN) – (FN*FP)/sqrt( (TP+FN)*(TN+FP)*(TP+FP)*(TN+FN) )


    07/07/12
Gene Prediction in a New Genome
              We have assumed that some genes and non-genes are known
               before starting training a “gene finder” for that genome!



              What if we want to develop a gene finder for a new genome that
               has just been sequenced?




07/07/12
Challenges of Gene finder


   Alternative splicing
   Nested/overlapping genes
   Extremely long/short genes
   Extremely long introns
   Non-canonical introns
   Split start codons
   UTR introns
   Non-ATG triplet as the start codon
   Polycistronic genes
   Repeats/transposons




07/07/12
Known Gene Finders

              GeneScan
              GeneMarkHMM
              Fgenesh
              GlimmerHMM
              GeneZilla
              SNAP
              PHAT
              AUGUSTUS
              Genie




07/07/12
Problem: Find Exon-Intron boundaries and start, stop codon for this
     sequence.




atgttggacagactggacgcaagacgtgtggatgaactcgttttggagctgaacaaggctctatacgtacttaa
tcaagcggggcgtttgtggagcgagt
tacttcacaaaaagctagccaatttgggttcaatgcagtgcctgaccgacatgggtatgtattagtaacgtttgg
aagaagaaactgttgtggttggtgt
ttatgcagacaatctacaggtgactgcaacgaattcaactctcgtggacagttttttcgttgatttacaggacctct
cggtaaaggactatgaagaggtg
acaaaattcttggggatgcgcatttcttatgcgcctgaaaatgggtatgattatatatcgagaagtgacaacccg
ggaaatgataaaggataa




07/07/12
Hints:




      1.   Build PSSMs for start, donor, acceptor sites.
      2.   Mark all donor, acceptor, start, stop sites.
      3.   Eliminate unlikely donor, acceptor sites.
      4.   Score each site from PSSM.
      5.   Check frame compatibility.
      6.   Run a Blastx to nr database.
      7.   Translate and check if peptide sequence follows the dipeptide
           frequency norm.


       Web Resources: http://vmd.vbi.vt.edu
                      http://augustus.gobics.de/submission
                      http://blast.ncbi.nlm.nih.gov/Blast.cgi

07/07/12
07/07/12

Mais conteúdo relacionado

Destaque

Destaque (13)

Bioinformatic
BioinformaticBioinformatic
Bioinformatic
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
Rna secondary structure prediction
Rna secondary structure predictionRna secondary structure prediction
Rna secondary structure prediction
 
Homeobox genes
Homeobox genesHomeobox genes
Homeobox genes
 
08 Kjm206 Expression Vector, Plasmid Vector
08 Kjm206 Expression Vector, Plasmid Vector08 Kjm206 Expression Vector, Plasmid Vector
08 Kjm206 Expression Vector, Plasmid Vector
 
Experession vectores ppt bijan zare
Experession vectores ppt bijan zareExperession vectores ppt bijan zare
Experession vectores ppt bijan zare
 
Expression vectors
Expression vectorsExpression vectors
Expression vectors
 
Vectors
VectorsVectors
Vectors
 
Expression vectors
Expression vectorsExpression vectors
Expression vectors
 
Vectors
Vectors Vectors
Vectors
 
Cloning Vector
Cloning Vector Cloning Vector
Cloning Vector
 
Molecular Cloning - Vectors: Types & Characteristics
Molecular Cloning -  Vectors: Types & CharacteristicsMolecular Cloning -  Vectors: Types & Characteristics
Molecular Cloning - Vectors: Types & Characteristics
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Semelhante a Structural Annotation Methods Overview

Primer designgeneprediction
Primer designgenepredictionPrimer designgeneprediction
Primer designgenepredictionSucheta Tripathy
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Dan Gaston
 
9789740329152
97897403291529789740329152
9789740329152CUPress
 
Plz I need your Help With these Question on page 1 2 3 As.pdf
Plz I need your Help With these Question on page 1 2 3  As.pdfPlz I need your Help With these Question on page 1 2 3  As.pdf
Plz I need your Help With these Question on page 1 2 3 As.pdfshreedattaagenciees2
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLAKeiichiro Ono
 
PRISM Semantic Integration Approach
PRISM Semantic Integration ApproachPRISM Semantic Integration Approach
PRISM Semantic Integration Approachimgcommcall
 
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...Amro Elfeki
 
ORS poster 2017
ORS poster 2017 ORS poster 2017
ORS poster 2017 Jess Rames
 
Automatic Drusen Classification
Automatic Drusen ClassificationAutomatic Drusen Classification
Automatic Drusen ClassificationDarshan Shah
 
Southern, Western & Northern Bloting.pdf
Southern, Western & Northern Bloting.pdfSouthern, Western & Northern Bloting.pdf
Southern, Western & Northern Bloting.pdfRajendraChavhan3
 
Distribution and virulence spectrum of wheat stem rush in Tigray, Ethiopia
Distribution and virulence spectrum of wheat stem rush in Tigray, EthiopiaDistribution and virulence spectrum of wheat stem rush in Tigray, Ethiopia
Distribution and virulence spectrum of wheat stem rush in Tigray, EthiopiaCIMMYT
 

Semelhante a Structural Annotation Methods Overview (18)

Primer designgeneprediction
Primer designgenepredictionPrimer designgeneprediction
Primer designgeneprediction
 
10th nov2010
10th nov201010th nov2010
10th nov2010
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2
 
9789740329152
97897403291529789740329152
9789740329152
 
DNA Microarray
DNA MicroarrayDNA Microarray
DNA Microarray
 
Plz I need your Help With these Question on page 1 2 3 As.pdf
Plz I need your Help With these Question on page 1 2 3  As.pdfPlz I need your Help With these Question on page 1 2 3  As.pdf
Plz I need your Help With these Question on page 1 2 3 As.pdf
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA
 
PRISM Semantic Integration Approach
PRISM Semantic Integration ApproachPRISM Semantic Integration Approach
PRISM Semantic Integration Approach
 
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...
Analysis of Annual Rainfall Climate Variability in Saudi Arabia by Using Spec...
 
Southern Blotting.pdf
Southern Blotting.pdfSouthern Blotting.pdf
Southern Blotting.pdf
 
Microarray
MicroarrayMicroarray
Microarray
 
ORS poster 2017
ORS poster 2017 ORS poster 2017
ORS poster 2017
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
Development of marker-assisted selection (MAS) technology in crop improvement...
Development of marker-assisted selection (MAS) technology in crop improvement...Development of marker-assisted selection (MAS) technology in crop improvement...
Development of marker-assisted selection (MAS) technology in crop improvement...
 
Automatic Drusen Classification
Automatic Drusen ClassificationAutomatic Drusen Classification
Automatic Drusen Classification
 
Mohammed ayaad
Mohammed ayaadMohammed ayaad
Mohammed ayaad
 
Southern, Western & Northern Bloting.pdf
Southern, Western & Northern Bloting.pdfSouthern, Western & Northern Bloting.pdf
Southern, Western & Northern Bloting.pdf
 
Distribution and virulence spectrum of wheat stem rush in Tigray, Ethiopia
Distribution and virulence spectrum of wheat stem rush in Tigray, EthiopiaDistribution and virulence spectrum of wheat stem rush in Tigray, Ethiopia
Distribution and virulence spectrum of wheat stem rush in Tigray, Ethiopia
 

Mais de Sucheta Tripathy (20)

Gal
GalGal
Gal
 
Ramorum2016 final
Ramorum2016 finalRamorum2016 final
Ramorum2016 final
 
Motif andpatterndatabase
Motif andpatterndatabaseMotif andpatterndatabase
Motif andpatterndatabase
 
Databases ii
Databases iiDatabases ii
Databases ii
 
Snps and microarray
Snps and microarraySnps and microarray
Snps and microarray
 
Stat2013
Stat2013Stat2013
Stat2013
 
26 nov2013seminar
26 nov2013seminar26 nov2013seminar
26 nov2013seminar
 
Stat2013
Stat2013Stat2013
Stat2013
 
Presentation2013
Presentation2013Presentation2013
Presentation2013
 
Lecture7,8
Lecture7,8Lecture7,8
Lecture7,8
 
Lecture5,6
Lecture5,6Lecture5,6
Lecture5,6
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
Sequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSASequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSA
 
Databases Part II
Databases Part IIDatabases Part II
Databases Part II
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Genome sequencingprojects
Genome sequencingprojectsGenome sequencingprojects
Genome sequencingprojects
 
Human encodeproject
Human encodeprojectHuman encodeproject
Human encodeproject
 
Tyler presentation
Tyler presentationTyler presentation
Tyler presentation
 
Tyler presentation
Tyler presentationTyler presentation
Tyler presentation
 

Último

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Último (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Structural Annotation Methods Overview

  • 1. Structural Annotation Overview Sucheta Tripathy Virginia Bioinformatics Institute, Blacksburg 6th July 2012 07/07/12
  • 2. Gene Finding -- outline  Patterns  Frame Consistency  Dicodon frequencies  PSSMs  Coding Potential and Fickett’s statistics  Fusion of Information  Sensitivity and Specificity  Prediction programs  Known problems 07/07/12
  • 3. Genes are all about Patterns – real life example 07/07/12
  • 4. Gene Prediction Methods Common sets of rules  Homology  Ab initio methods  Compositional information  Signal information 07/07/12
  • 5. Pattern Recognition in Gene Finding  atgttggacagactggacgcaagacgtgtggatgaactcgttttggagctg aacaaggctctatacgtacttaatcaagcggggcgtttgtggagcgagt tacttcacaaaaagctagccaatttgggttcaatgcagtgcctgaccgaca tgggtatgtattagtaacgtttggaagaagaaactgttgtggttggtgt ttatgcagacaatctacaggtgactgcaacgaattcaactctcgtggaca gttttttcgttgatttacaggacctctcggtaaaggactatgaagaggtg acaaaattcttggggatgcgcatttcttatgcgcctgaaaatgggtatgatta tatatcgagaagtgacaacccgggaaatgataaaggataa atggagaggatgctggagacggtcaagacgaccatcacccctgcgcaggcaa tgaagctgtttactgcacccaaagaacctcaagcgaacctggcccgag cacttcatgtacttggtggccatctcggaggcctgcggtggtacttagtcctgaata acgtcgtgccgtacgcgtccgcggatctacgaacggtcctgat agccaaagtggacggcacgcgtgtcgactacctacagcaagctgaggaactg gcgcatttcgcgcaatcctgggagcttgaagcgcgcacgaagaacatt We need to study the basic structures of genes first ….! 07/07/12
  • 6. Gene Structure – Common sets of rules  Generally true: all long (> 300 bp) orfs in prokaryotic genomes encode genes But this may not necessarily be true for eukaryotic genomes ATAGATAAGGTGTAGAGATAGAGAGATAGAGAGGAATAGGACAGG T  Eukaryotic introns begin with GT and end with AG(donor and acceptor sites)  CT(A/G)A(C/T) 20-50 bases upstream of acceptor site. 07/07/12
  • 7. Gene Structure  Each coding region (exon or whole gene) has a fixed translation frame  A coding region always sits inside an ORF of same reading frame  All exons of a gene are on the same strand.  Neighboring exons of a gene could have different reading frames .  Exons need to be Frame consistent! GATGGGACGACAGATAAGGTGATAGATGGTAGGCAGCAG 0 3 6 9 12 01 07/07/12
  • 8. Gene Structure – reading frame consistency  Neighboring exons of a gene should be frame-consistent Frame 0 Frame 2 GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGT C1 16 33 GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGT C1 16 40 Frame 0 Frame 2 Exon1 (1,16) -> Frame = a = 0 ; i = 1 and j = 16 Case1: Exon2 (33,100): Frame = b = 2; m = 33 and n = 100 Case2: Exon2 (40,100): Frame = b = 2; m = 40 and n =100 exon1 (i, j) in frame a and exon2 (m, n) in frame b are consistent if b = (m - j - 1 + a) mod 3 07/07/12 …31,34,37,40,43,46,49,52…
  • 9. Codon Frequencies  Coding sequences are translated into protein sequences  We found the following – the dimer frequency in protein sequences is NOT evenly distributed  Organism specific!!!!!!!!!!! The average frequency is ¼% (1/20 * 1/20 = 1/400 = ¼%) Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other H arabidopsidis 07/07/12
  • 10. ALA ARG ASN AS CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL P A .99 .5 .27 .45 .13 .52 .34 .51 .19 .38 .83 .46 .2 .31 .37 .73 .56 .09 .18 .65 R .5 .5 .2 .3 .1 .4 .3 .4 .18 .25 .63 .37 .14 .22 .26 .54 .34 .08 .17 .46 N .31 .19 .11 .2 .05 .23 .23 .26 .07 .13 .27 .16 .07 .11 .15 .24 .16 .04 .08 .27 D .54 .32 .17 .47 .08 .51 .19 .42 .13 .25 .48 .26 .12 .20 .24 .40 .29 .06 .14 .5 C .14 .11 .05 .09 .04 .09 .06 .13 .04 .08 .14 .08 .03 .06 .07 .14 .10 .02 .04 .13 E .57 .43 .22 .42 .09 .59 .30 .33 .16 .28 .64 .40 .17 .20 .21 .44 .36 .06 .16 .44 Q .34 .31 .11 .20 .06 .27 .29 .20 .12 .15 .45 .21 .10 .13 .17 .29 .22 .05 .10 .29 G .50 .39 .22 .37 .11 .37 .21 .50 .16 .28 .50 .33 .14 .23 .21 .54 .35 .07 .17 .46 H .21 .17 .07 .14 .04 .16 .10 .17 .08 .09 .22 .10 .05 .09 .12 .17 .11 .03 .06 .21 I .37 .25 .13 .27 .08 .27 .15 .27 .09 .15 .34 .22 .08 .14 .18 .29 .21 .04 .11 .32 L .79 .65 .30 .53 .16 .62 .45 .50 .25 .31 .97 .47 .19 .32 .44 .71 .49 .10 .22 .67 K .43 .41 .19 .26 .08 .35 .24 .26 .14 .20 .49 .41 .13 .17 .20 .37 .32 .07 .15 .33 M .23 .17 .09 .17 .04 .19 .12 .14 .06 .10 .25 .15 .07 .08 .11 .20 .17 .03 .06 .17 F .30 .20 .11 .22 .07 .22 .13 .26 .08 .14 .33 .15 .08 .14 .14 .27 .18 .04 .10 .28 P .35 .28 .13 .23 .05 .27 .16 .24 .10 .17 .39 .19 .10 .15 .26 .42 .29 .04 .09 .33 S .68 .51 .26 .46 .14 .43 .26 .55 .17 .33 .68 .38 .18 .28 .36 .93 .55 .09 .17 .56 T .51 .36 .19 .29 .10 .31 .21 .37 .12 .25 .52 .32 .13 .21 .31 .54 .42 .07 .13 .39 W 0.07 .08 .05 .06 .02 .06 .05 .06 .03 .06 .12 .07 .03 .04 .04 .09 .07 .02 .03 .07 Y .21 .16 .09 .15 .05 .16 .09 .17 .06 .10 .23 .11 .06 .10 .1 .17 .13 .03 .07 .2 07/07/12 V .69 .42 .24 .46 .13 .48 .31 .42 .18 .29 .72 .37 .16 .27 .33 .55 .42 .07 .18 .61
  • 12. Dicodon Frequencies  Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs!  Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region! 07/07/12
  • 13. Dicodon Frequencies - Examples  Relative frequencies of a di-codon in coding versus non- coding  frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences  frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region? 07/07/12
  • 14. Basic idea of gene finding  Most dicodons show bias towards either coding or non- coding regions; only fraction of dicodons is neutral  Foundation for coding region identification Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regions  Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information 07/07/12
  • 15. Prediction of Translation Starts Using PSSM  Certain nucleotides prefer to be in certain position around start “ATG” and other nucleotides prefer not to be there TCTAGAAGATGGCAGTGGCGAAGA A 0,0,0,100 ,0, 75,100, 75 ATG TCTAGAAAATGACAGTGGCGAAGA T 100,0,100,0,0, 25, 0, 0 ATG TCTAGAAAATGGCAGTAGCGAAGA G 0, 0, 0, 0, 75 ,0, 0, 25 ATG TCTACT A AATGATAGTAGCGAAGA ATG C 0,100,0,0, 25 ,0, 0, 0 ATG A C G T -4 -3 -2 -1 +3 +4 +5 +6 CACC ATG GC TCGA ATG TT  The “biased” nucleotide distribution is information! It is a basis for translation start prediction  Question: which one is more probable to be a translation start? 07/07/12
  • 16. Prediction of Translation Starts  Mathematical model: Fi (X): frequency of X (A, C, G, T) in position i  Score a string by Σ log (Fi (X)/0.25) CACC ATG GC TCGA ATG TT log (58/25) + log (49/25) + log (40/25) + log (6/25) + log (6/25) + log (15/25) + log (50/25) + log (43/25) + log (39/25) = log (15/25) + log (13/25) + log (14/25) = 0.37 + 0.29 + 0.20 + 0.30 + 0.24 + 0.29 -(0.62 + 0.62 + 0.22 + 0.22 + 0.28 + 0.25) = 1.69 = -2.54 The model captures our intuition! A C G T 07/07/12
  • 17. Prediction of Exons  For each orf, find all donor and acceptor candidates by finding GT and YAG motifs CGT GT CAG CAG CAG  Score each donor and acceptor candidate using our position- specific weight matrix models  Find all pairs of (acceptor, donor) above some thresholds  Score the coding potential of the segment [donor, acceptor], using the hexmer model 07/07/12
  • 18. Prediction of Exons  For each segment [acceptor, donor], we get three scores (coding potential, donor score, acceptor score)  Various possibilities  all three scores are high – probably true exon  all three scores are low – probably not a real exon  all in the middle -- ??  some scores are high and some are low -- ??  What are the rules for exon prediction? 07/07/12
  • 19. Prediction of Exons  A “classifier” can be trained to separate exons from non-exons, based on the three scores  Closer to reality – other factors could also help to distinguish exons from non-exons exon length distribution 150 bp coding density is different in regions with different G+C contents 50% G+C A practical gene finding software may use many 07/07/12 features to distinguish exons from non-exons
  • 20. Codon preference  Uneven usage of amino acids in the existing proteins  Uneven usage of synonymous codons.  S= AGGACG, when read in frame 1, it results in the sequence 07/07/12
  • 21. Gene Prediction in a New Genome  One way is to identify a set of genes in the new genome through homology search against known genes in GenBank  BLAST, FASTA, Smith-Waterman  So we can get some “coding” regions for training a gene finder  How about non-coding regions? We can use the intergenic or inter- exon regions if we can identify any  With these known genes and non-genes, we can train a gene finder just like before ….. 07/07/12
  • 22. Evaluation of Gene prediction TP FP TN FN TP Real Predicted  Sensitivity = No. of Correct exons/No. of actual exons(Measurement of False negative) -> How many are discarded by mistake Sn = TP/TP+FN  Specificity = No. of Correct exons/No. of predicted exons(Measurement of False positive) -> How many are included by mistake Sp=TP/TP+FP  CC = Metric for combining both (TP*TN) – (FN*FP)/sqrt( (TP+FN)*(TN+FP)*(TP+FP)*(TN+FN) ) 07/07/12
  • 23. Gene Prediction in a New Genome  We have assumed that some genes and non-genes are known before starting training a “gene finder” for that genome!  What if we want to develop a gene finder for a new genome that has just been sequenced? 07/07/12
  • 24. Challenges of Gene finder  Alternative splicing  Nested/overlapping genes  Extremely long/short genes  Extremely long introns  Non-canonical introns  Split start codons  UTR introns  Non-ATG triplet as the start codon  Polycistronic genes  Repeats/transposons 07/07/12
  • 25. Known Gene Finders  GeneScan  GeneMarkHMM  Fgenesh  GlimmerHMM  GeneZilla  SNAP  PHAT  AUGUSTUS  Genie 07/07/12
  • 26. Problem: Find Exon-Intron boundaries and start, stop codon for this sequence. atgttggacagactggacgcaagacgtgtggatgaactcgttttggagctgaacaaggctctatacgtacttaa tcaagcggggcgtttgtggagcgagt tacttcacaaaaagctagccaatttgggttcaatgcagtgcctgaccgacatgggtatgtattagtaacgtttgg aagaagaaactgttgtggttggtgt ttatgcagacaatctacaggtgactgcaacgaattcaactctcgtggacagttttttcgttgatttacaggacctct cggtaaaggactatgaagaggtg acaaaattcttggggatgcgcatttcttatgcgcctgaaaatgggtatgattatatatcgagaagtgacaacccg ggaaatgataaaggataa 07/07/12
  • 27. Hints: 1. Build PSSMs for start, donor, acceptor sites. 2. Mark all donor, acceptor, start, stop sites. 3. Eliminate unlikely donor, acceptor sites. 4. Score each site from PSSM. 5. Check frame compatibility. 6. Run a Blastx to nr database. 7. Translate and check if peptide sequence follows the dipeptide frequency norm. Web Resources: http://vmd.vbi.vt.edu http://augustus.gobics.de/submission http://blast.ncbi.nlm.nih.gov/Blast.cgi 07/07/12