SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
2012
                    DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE




Probabilistic Latent Semantic Analysis
           for prediction of
    Gene Ontology annotations
     Davide Chicco, Pietro Pinoli, Marco Masseroli
              davide.chicco@elet.polimi.it
Summary


1. The problem
    • Biomolecular annotations
    • Prediction of biomolecular annotations
2. The methods
    • SVD – Singular Value Decomposition
    • pLSA – Probabilistic Latent Semantic Analysis
3. Evaluation
    • Evaluation data set
    • Evaluation results
4. Conclusions

       Davide Chicco @ PhDay2012                      2
Biomolecular annotations

• The concept of annotation: association of nucleotide or amino
  acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,
  sometimes structured as ontologies, where every controlled
  term of the vocabulary is associated with a unique
  alphanumeric code

• The association of such a code with a gene or protein ID
  constitutes an annotation

    Gene /                             Biological function feature
    Protein
                          Annotation
                           gene2bff

        Davide Chicco @ PhDay2012                                    3
Biomolecular annotations (2)

• The association of an information/feature with a gene or
  protein ID constitutes an annotation



• Annotation example:

  • gene: GD4

  • feature: “is present in the mitochondrial membrane”


   Gene /                             Biological function feature
   Protein
                         Annotation
                          gene2bff

       Davide Chicco @ PhDay2012                                    4
Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated
  information



• To support and quicken the time–consuming curation process,
  prioritized lists of computationally predicted annotations
  are extremely useful

• These lists could be generated softwares based that implement
  Machine Learning algorithms


       Davide Chicco @ PhDay2012                                  5
Annotation prediction through
  Singular Value Decomposition – SVD

• Annotation matrix A  {0, 1} m x n
   − m rows: genes / proteins
   − n columns: annotation terms
   A(i,j) = 1 if gene / protein i is annotated to term j or to any
      descendant of j in the considered ontology structure (true
      path rule)
   A(i,j) = 0 otherwise (it is unknown)
               term01      term02   term03   term04   …   termN
    gene01         0          0       0        0      …     0
    gene02         0          1       1        0      …     1
       …          …           …       …        …      …    …
     geneM         0          0       0        0      …     0

       Davide Chicco @ PhDay2012                                     7
Annotation prediction through
  Singular Value Decomposition – SVD

• Annotation matrix A  {0, 1} m x n
   − m rows: genes / proteins
   − n columns: annotation terms
   A(i,j) = 1 if gene / protein i is annotated to term j or to any
      descendant of j in the considered ontology structure (true
      path rule)
   A(i,j) = 0 otherwise (it is unknown)
               term01      term02   term03   term04   …   termN
    gene01         0          0       0        0      …     0
    gene02         0          1       1        0      …     1
       …          …           …       …        …      …    …
     geneM         0          0       0        0      …     0

       Davide Chicco @ PhDay2012                                     8
Singular Value Decomposition – SVD

Compute SVD:
                                   A  U V T  U V T V TA  U V T
                                            A      A U

             A  U V T                        

Compute reduced rank approximation:
                                   Ak  U k kkVk U kU kVkkTVkTU k  kVkT
                                            A AT    
                                                k        A

            Ak  U k kVkT                                                  k

  • An annotation prediction is performed by computing a reduced
                                      k
    rank approximation Ak of the annotation matrix A
    (where 0 < k < r, with r the number of non zero singular values
    of A, i.e. the rank of A)




       Davide Chicco @ PhDay2012                                                 9
Probabilistic Latent Semantic Analysis - pLSA


pLSA:
   • An alternative to the SVD method
   • Based on Latent Semantic Indexing (LSI)

Latent Semantic Indexing – LSI:
    • Identifies latent relationships between different elements
       in a certain class
        − e.g. between documents and words within them
        − between genes and their biomolecular features
           described by controlled annotation terms

    • Maps class elements to a vector space of reduced
      dimensionality, and then analyzes it

      Davide Chicco @ PhDay2012                                    10
Probabilistic Latent Semantic Analysis - pLSA (2)

Suppose you have;
 • A set of genes G = {g1, …, gn} related to a set of feature
   terms F = {f1, …, fn} which, together, form a set of controlled
   biomolecular annotations


 • A set of class variables T = {t1, …, tn},
  called topics, with every feature
  term f  F that can be associated
  with a topic t  T

The pLSA statistical model associates
 every unobserved class variable
 (topic) with each observation
 (feature term and gene)
        Davide Chicco @ PhDay2012                                    11
Probabilistic Latent Semantic Analysis - pLSA (3)

• P(f | t): probability of a feature term f to be associated with a
  topic t
• P(t | g): probability of getting a topic t by selecting a gene g
• The following conditions hold:

     •       t T ,       P( f | t )  1
                          f F

     •        g  G,       P(t | g )  1
                           tT

• The joint probability between g and f is given by:


                 P( g , f )   P(t ) P( g | t ) P( f | t )
                                     tT


         Davide Chicco @ PhDay2012                                    12
Probabilistic Latent Semantic Analysis - pLSA (4)

Model training
 • Aim: maximum likelihood estimation of P(f|t) by using
   Expectation Maximization (EM) algorithm, on a training set

    L          a( g , f ) log P( g , f )                [1]
         gG f F

Model validation
 • Gene and feature term validation set with the same feature
   terms, but completely different genes, respect to the ones in
   the training set

 • Aim: maximize the formula in [1], but by using the P(f|t)
   calculated in the training phase and varying the parameters
   P(t|g) related to the new genes in the validation set

        Davide Chicco @ PhDay2012                                  13
Probabilistic Latent Semantic Analysis - pLSA (5)


EM Algorithm:
It seeks to find a Maximum Likelihood Estimation by iteratively
    applying:
• Expectation step: in which the a posteriori probabilities for the
    latent variables t are computed, as

                         P(t | g , f )  P(t ) P( g , f | t )

• Maximization step: in which the parameters values are updated
  in order to maximize the log-likelihood.




         Davide Chicco @ PhDay2012                                    14
Probabilistic Latent Semantic Analysis - pLSA (5)

In comparison to SVD:


                   Uk = [ P(gi|tk) ] ik
                  k = diag[ P(tk) ] k
                   Vk = [ P(fi|tk) ]jk
             Ak = [ P(gi, fj) ]ij = Uk k VkT

                                   Ak  U k kkVk U kU kVkkTVkTU k  kVkT
                                            A AT    k 
                                                k        A

        Ak  U k kVkT                                                      k


                                                    k


       Davide Chicco @ PhDay2012                                                 15
Probabilistic Latent Semantic Analysis - pLSA (6)

Since the pLSA model constraints:

                      g  G ,       P( f | g )  1
                                    f F
• This can bias the prediction because the more annotations a
  gene has, the lower its average conditional probability is
• To avoid such bias we propose a normalized extension of pLSA:
    •  g G :
        i.    Compute: M  max P ( f | g )
                                    fF

        ii.   Compute the normalized P(f | g) vector as:
                                     1
                     P( f | g )norm  P( f | g )
                                     M
• Thus, the feature terms with the highest conditional probability
  for a gene always result predicted to be annotated to that gene
        Davide Chicco @ PhDay2012                                    16
Evaluation of the prediction

    To evaluate the prediction, we compare each A(i,j) element to its
    corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

•      if A(i,j) = 1 & Ak(i,j) > τ:    AC: Annotation Confirmed
                                                         (AC      AC+1)
•      if A(i,j) = 1 & Ak(i,j) ≤ τ:    AR: Annotation to be Reviewed
                                                          (AR      AR+1)


•     if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed
                                                   (NAC      NAC+1)
•      if A(i,j) = 0 & Ak(i,j) > τ:     AP: annotation predicted
                                                        (AP      AP+1)



           Davide Chicco @ PhDay2012                                       17
New concept: Receiver Operating Characteristic
 (ROC) curve
Starting from the annotation prediction evaluation factor we just
introduced
                                        Input      Output
 AC: Annotation Confirmed
                                         Yes         Yes
 AR: Annotation to be Reviewed          Yes         No
 NAC: No Annotation Confirmed           No          No
 AP: Annotation Predicted               No          Yes


We can design the Receiver Operating Characteristic curves for
every prediction:

 On the x, the annotation to be reviewed rate:

 On the y, the annotation predicted rate:

      Davide Chicco @ PhDay2012                                 18
Evaluation data set

•     We considered the Gene Ontology annotations of organisms:
      Gallus gallus (Chicken), and Bos taurus (Cattle)
     − Excluding less reliable Inferred Electronic Annotations

•     After this, the four organism data set were:
                                                       Annotations
      Organism           Ontology    Genes     Terms
                                                         (direct )
      Gallus gallus           BP      275       527       738
      Gallus gallus           CC      260       148       478
      Gallus gallus           MF      309       225       509
      Bos taurus              BP      512       930       1,557
      Bos taurus              CC      497       234       921
      Bos taurus              MF      543       422       934

     with total (true-path-rule) annotations about 10-times more
     than the direct annotations
         Davide Chicco @ PhDay2012                                   19
Evaluation results




                                 •The ROC curve of annotation to be
                                 reviewed rate AR / (AC + AR) and
                                 annotation predicted rate AP / (AP +
                                 NAC) of Bos taurus (Cattle) Cellular
                                 Component        (top   left),  Molecular
                                 Function (top) and Biological Process
                                 (left), for SVD with best truncation value
                                 (in red) and for pLSAnorm with best
                                 topics number (in green)
     Davide Chicco @ PhDay2012                                           20
Evaluation results (3)

• As an aggregated indicator of prediction performance, we
  computed the Area Under the Curve(AUC) in the [0; 0.01] range
  of AP rate values
   − We are interested in the low range of AP rate, since it
     corresponds to top-ranked predictions of newly inferred
     annotations (AP) with the highest score

    Area under ROC curves (AUC) % and Execution Time (sec)
Taxonomy ID Ontology            SVD  pLSAnorm Time(SVD) Time(pLSAnorm)
  Bos taurus         BP        44.30   34.75     33          28 188
  Bos taurus         CC        53.03   27.31    36          4 674
  Bos taurus         MF        80.96   30.69     11         1 890
 Gallus gallus       BP        47.33   44.83    98          3 990
 Gallus gallus       CC        75.39   37.22    10           796
 Gallus gallus       MF        65.76   29.87     5           422

           Davide Chicco @ PhDay2012                                     22
Conclusions

• We proposed the pLSAnorm method as a novel contribution in
  the context of prediction of genomic ontological annotations
   - Our pLSAnorm method gives better predictions than the
     Singular Value Decomposition (SVD) method
   - Higher execution time of pLSAnorm vs. SVD requires better
     optimizations, currently limiting its use to off-line analysis or
     small dimension data sets




       Davide Chicco @ PhDay2012                                     23
Conclusions (2)

• Our approach is not limited to the here considered Gene
  Ontology and can be applied to any controlled annotations

• Increasingly available multiple annotations from different
  controlled vocabularies and ontologies could be jointly
  considered to further improve prediction reliability (both in
  SVD and pLSAnorm)




       Davide Chicco @ PhDay2012                              24
Probabilistic Latent Semantic Analysis for
prediction of Gene Ontology annotations




        Thank you for your attention




     Davide Chicco @ PhDay2012               25

Mais conteúdo relacionado

Destaque

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notificationokazaki117
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysisguillaume ereteo
 
IEEE 2016-2017 SOFTWARE TITLE
IEEE 2016-2017  SOFTWARE TITLE IEEE 2016-2017  SOFTWARE TITLE
IEEE 2016-2017 SOFTWARE TITLE FOCUSLOGICPROJECTS
 
Ijcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalIjcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalMichal Ptaszynski
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary TaleEva Ward
 
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...JIA-MING CHANG
 
Finding new friends: A different kind of recommendation system
Finding new friends: A different kind of recommendation systemFinding new friends: A different kind of recommendation system
Finding new friends: A different kind of recommendation systemEva Ward
 
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineDeep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineKoby Karp
 
animatics
animaticsanimatics
animaticsticien
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 

Destaque (17)

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notification
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysis
 
IEEE 2016-2017 SOFTWARE TITLE
IEEE 2016-2017  SOFTWARE TITLE IEEE 2016-2017  SOFTWARE TITLE
IEEE 2016-2017 SOFTWARE TITLE
 
Ltc completed slides
Ltc completed slidesLtc completed slides
Ltc completed slides
 
Ijcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalIjcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-final
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary Tale
 
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...
 
Finding new friends: A different kind of recommendation system
Finding new friends: A different kind of recommendation systemFinding new friends: A different kind of recommendation system
Finding new friends: A different kind of recommendation system
 
Aisb cyberbullying
Aisb cyberbullyingAisb cyberbullying
Aisb cyberbullying
 
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineDeep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
 
animatics
animaticsanimatics
animatics
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 

Semelhante a "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"..."Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...Davide Chicco
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataVrije Universiteit Amsterdam
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdfJaberRad1
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeBioDec
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsRyohei Suzuki
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDavide Chicco
 
Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...Jie Bao
 
Reading Group 2014 (Insight NUIG)
Reading Group 2014 (Insight NUIG)Reading Group 2014 (Insight NUIG)
Reading Group 2014 (Insight NUIG)Bianca Pereira
 
Optimum failure truncated testing strategies
Optimum failure truncated testing strategies Optimum failure truncated testing strategies
Optimum failure truncated testing strategies ASQ Reliability Division
 
A scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationA scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationRokan Uddin Faruqui
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeGigaScience, BGI Hong Kong
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 

Semelhante a "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012 (20)

"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"..."Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdf
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the place
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problems
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
 
Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...
 
Reading Group 2014 (Insight NUIG)
Reading Group 2014 (Insight NUIG)Reading Group 2014 (Insight NUIG)
Reading Group 2014 (Insight NUIG)
 
Optimum failure truncated testing strategies
Optimum failure truncated testing strategies Optimum failure truncated testing strategies
Optimum failure truncated testing strategies
 
A scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationA scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materialization
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Ibmr 2014
Ibmr 2014Ibmr 2014
Ibmr 2014
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data deluge
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

  • 1. 2012 DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations Davide Chicco, Pietro Pinoli, Marco Masseroli davide.chicco@elet.polimi.it
  • 2. Summary 1. The problem • Biomolecular annotations • Prediction of biomolecular annotations 2. The methods • SVD – Singular Value Decomposition • pLSA – Probabilistic Latent Semantic Analysis 3. Evaluation • Evaluation data set • Evaluation results 4. Conclusions Davide Chicco @ PhDay2012 2
  • 3. Biomolecular annotations • The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features • This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code • The association of such a code with a gene or protein ID constitutes an annotation Gene / Biological function feature Protein Annotation gene2bff Davide Chicco @ PhDay2012 3
  • 4. Biomolecular annotations (2) • The association of an information/feature with a gene or protein ID constitutes an annotation • Annotation example: • gene: GD4 • feature: “is present in the mitochondrial membrane” Gene / Biological function feature Protein Annotation gene2bff Davide Chicco @ PhDay2012 4
  • 5. Prediction of biomolecular annotations • Many available annotations in different databanks • However, available annotations are incomplete • Only a few of them represent highly reliable, human–curated information • To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful • These lists could be generated softwares based that implement Machine Learning algorithms Davide Chicco @ PhDay2012 5
  • 6. Annotation prediction through Singular Value Decomposition – SVD • Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 Davide Chicco @ PhDay2012 7
  • 7. Annotation prediction through Singular Value Decomposition – SVD • Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 Davide Chicco @ PhDay2012 8
  • 8. Singular Value Decomposition – SVD Compute SVD: A  U V T  U V T V TA  U V T A A U A  U V T  Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k • An annotation prediction is performed by computing a reduced k rank approximation Ak of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) Davide Chicco @ PhDay2012 9
  • 9. Probabilistic Latent Semantic Analysis - pLSA pLSA: • An alternative to the SVD method • Based on Latent Semantic Indexing (LSI) Latent Semantic Indexing – LSI: • Identifies latent relationships between different elements in a certain class − e.g. between documents and words within them − between genes and their biomolecular features described by controlled annotation terms • Maps class elements to a vector space of reduced dimensionality, and then analyzes it Davide Chicco @ PhDay2012 10
  • 10. Probabilistic Latent Semantic Analysis - pLSA (2) Suppose you have; • A set of genes G = {g1, …, gn} related to a set of feature terms F = {f1, …, fn} which, together, form a set of controlled biomolecular annotations • A set of class variables T = {t1, …, tn}, called topics, with every feature term f  F that can be associated with a topic t  T The pLSA statistical model associates every unobserved class variable (topic) with each observation (feature term and gene) Davide Chicco @ PhDay2012 11
  • 11. Probabilistic Latent Semantic Analysis - pLSA (3) • P(f | t): probability of a feature term f to be associated with a topic t • P(t | g): probability of getting a topic t by selecting a gene g • The following conditions hold: • t T ,  P( f | t )  1 f F •  g  G,  P(t | g )  1 tT • The joint probability between g and f is given by: P( g , f )   P(t ) P( g | t ) P( f | t ) tT Davide Chicco @ PhDay2012 12
  • 12. Probabilistic Latent Semantic Analysis - pLSA (4) Model training • Aim: maximum likelihood estimation of P(f|t) by using Expectation Maximization (EM) algorithm, on a training set L  a( g , f ) log P( g , f ) [1] gG f F Model validation • Gene and feature term validation set with the same feature terms, but completely different genes, respect to the ones in the training set • Aim: maximize the formula in [1], but by using the P(f|t) calculated in the training phase and varying the parameters P(t|g) related to the new genes in the validation set Davide Chicco @ PhDay2012 13
  • 13. Probabilistic Latent Semantic Analysis - pLSA (5) EM Algorithm: It seeks to find a Maximum Likelihood Estimation by iteratively applying: • Expectation step: in which the a posteriori probabilities for the latent variables t are computed, as P(t | g , f )  P(t ) P( g , f | t ) • Maximization step: in which the parameters values are updated in order to maximize the log-likelihood. Davide Chicco @ PhDay2012 14
  • 14. Probabilistic Latent Semantic Analysis - pLSA (5) In comparison to SVD: Uk = [ P(gi|tk) ] ik k = diag[ P(tk) ] k Vk = [ P(fi|tk) ]jk Ak = [ P(gi, fj) ]ij = Uk k VkT Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT    k  k A Ak  U k kVkT  k k Davide Chicco @ PhDay2012 15
  • 15. Probabilistic Latent Semantic Analysis - pLSA (6) Since the pLSA model constraints: g  G ,  P( f | g )  1 f F • This can bias the prediction because the more annotations a gene has, the lower its average conditional probability is • To avoid such bias we propose a normalized extension of pLSA: •  g G : i. Compute: M  max P ( f | g ) fF ii. Compute the normalized P(f | g) vector as: 1 P( f | g )norm  P( f | g ) M • Thus, the feature terms with the highest conditional probability for a gene always result predicted to be annotated to that gene Davide Chicco @ PhDay2012 16
  • 16. Evaluation of the prediction To evaluate the prediction, we compare each A(i,j) element to its corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0 • if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed (AC AC+1) • if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed (AR AR+1) • if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed (NAC NAC+1) • if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted (AP AP+1) Davide Chicco @ PhDay2012 17
  • 17. New concept: Receiver Operating Characteristic (ROC) curve Starting from the annotation prediction evaluation factor we just introduced Input Output  AC: Annotation Confirmed Yes Yes  AR: Annotation to be Reviewed Yes No  NAC: No Annotation Confirmed No No  AP: Annotation Predicted No Yes We can design the Receiver Operating Characteristic curves for every prediction:  On the x, the annotation to be reviewed rate:  On the y, the annotation predicted rate: Davide Chicco @ PhDay2012 18
  • 18. Evaluation data set • We considered the Gene Ontology annotations of organisms: Gallus gallus (Chicken), and Bos taurus (Cattle) − Excluding less reliable Inferred Electronic Annotations • After this, the four organism data set were: Annotations Organism Ontology Genes Terms (direct ) Gallus gallus BP 275 527 738 Gallus gallus CC 260 148 478 Gallus gallus MF 309 225 509 Bos taurus BP 512 930 1,557 Bos taurus CC 497 234 921 Bos taurus MF 543 422 934 with total (true-path-rule) annotations about 10-times more than the direct annotations Davide Chicco @ PhDay2012 19
  • 19. Evaluation results •The ROC curve of annotation to be reviewed rate AR / (AC + AR) and annotation predicted rate AP / (AP + NAC) of Bos taurus (Cattle) Cellular Component (top left), Molecular Function (top) and Biological Process (left), for SVD with best truncation value (in red) and for pLSAnorm with best topics number (in green) Davide Chicco @ PhDay2012 20
  • 20. Evaluation results (3) • As an aggregated indicator of prediction performance, we computed the Area Under the Curve(AUC) in the [0; 0.01] range of AP rate values − We are interested in the low range of AP rate, since it corresponds to top-ranked predictions of newly inferred annotations (AP) with the highest score Area under ROC curves (AUC) % and Execution Time (sec) Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm) Bos taurus BP 44.30 34.75 33 28 188 Bos taurus CC 53.03 27.31 36 4 674 Bos taurus MF 80.96 30.69 11 1 890 Gallus gallus BP 47.33 44.83 98 3 990 Gallus gallus CC 75.39 37.22 10 796 Gallus gallus MF 65.76 29.87 5 422 Davide Chicco @ PhDay2012 22
  • 21. Conclusions • We proposed the pLSAnorm method as a novel contribution in the context of prediction of genomic ontological annotations - Our pLSAnorm method gives better predictions than the Singular Value Decomposition (SVD) method - Higher execution time of pLSAnorm vs. SVD requires better optimizations, currently limiting its use to off-line analysis or small dimension data sets Davide Chicco @ PhDay2012 23
  • 22. Conclusions (2) • Our approach is not limited to the here considered Gene Ontology and can be applied to any controlled annotations • Increasingly available multiple annotations from different controlled vocabularies and ontologies could be jointly considered to further improve prediction reliability (both in SVD and pLSAnorm) Davide Chicco @ PhDay2012 24
  • 23. Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations Thank you for your attention Davide Chicco @ PhDay2012 25