"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

2012
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE

Probabilistic Latent Semantic Analysis
for prediction of
Gene Ontology annotations
Davide Chicco, Pietro Pinoli, Marco Masseroli
davide.chicco@elet.polimi.it

Summary

1. The problem
• Biomolecular annotations
• Prediction of biomolecular annotations
2. The methods
• SVD – Singular Value Decomposition
• pLSA – Probabilistic Latent Semantic Analysis
3. Evaluation
• Evaluation data set
• Evaluation results
4. Conclusions

Davide Chicco @ PhDay2012 2

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,
sometimes structured as ontologies, where every controlled
term of the vocabulary is associated with a unique
alphanumeric code

• The association of such a code with a gene or protein ID
constitutes an annotation

Gene / Biological function feature
Protein
Annotation
gene2bff


Biomolecular annotations (2)

• The association of an information/feature with a gene or
protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

Gene / Biological function feature
Protein
Annotation
gene2bff


Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated
information

• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations
are extremely useful

• These lists could be generated softwares based that implement
Machine Learning algorithms


Annotation prediction through
Singular Value Decomposition – SVD

• Annotation matrix A  {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0


Annotation prediction through

• Annotation matrix A  {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0



Compute SVD:
A  U V T  U V T V TA  U V T
A A U

A  U V T 

Compute reduced rank approximation:
Ak  U k kkVk U kU kVkkTVkTU k  kVkT
A AT    
k A

Ak  U k kVkT  k

• An annotation prediction is performed by computing a reduced
k
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)


Probabilistic Latent Semantic Analysis - pLSA

pLSA:
• An alternative to the SVD method
• Based on Latent Semantic Indexing (LSI)

Latent Semantic Indexing – LSI:
• Identifies latent relationships between different elements
in a certain class
− e.g. between documents and words within them
− between genes and their biomolecular features
described by controlled annotation terms

• Maps class elements to a vector space of reduced
dimensionality, and then analyzes it


Probabilistic Latent Semantic Analysis - pLSA (2)

Suppose you have;
• A set of genes G = {g1, …, gn} related to a set of feature
terms F = {f1, …, fn} which, together, form a set of controlled
biomolecular annotations

• A set of class variables T = {t1, …, tn},
called topics, with every feature
term f  F that can be associated
with a topic t  T

The pLSA statistical model associates
every unobserved class variable
(topic) with each observation
(feature term and gene)


• P(f | t): probability of a feature term f to be associated with a
topic t
• P(t | g): probability of getting a topic t by selecting a gene g
• The following conditions hold:

• t T ,  P( f | t )  1
f F

•  g  G,  P(t | g )  1
tT

• The joint probability between g and f is given by:

P( g , f )   P(t ) P( g | t ) P( f | t )
tT



Model training
• Aim: maximum likelihood estimation of P(f|t) by using
Expectation Maximization (EM) algorithm, on a training set

L  a( g , f ) log P( g , f ) [1]
gG f F

Model validation
• Gene and feature term validation set with the same feature
terms, but completely different genes, respect to the ones in
the training set

• Aim: maximize the formula in [1], but by using the P(f|t)
calculated in the training phase and varying the parameters
P(t|g) related to the new genes in the validation set



EM Algorithm:
It seeks to find a Maximum Likelihood Estimation by iteratively
applying:
• Expectation step: in which the a posteriori probabilities for the
latent variables t are computed, as

P(t | g , f )  P(t ) P( g , f | t )

• Maximization step: in which the parameters values are updated
in order to maximize the log-likelihood.



In comparison to SVD:

Uk = [ P(gi|tk) ] ik
k = diag[ P(tk) ] k
Vk = [ P(fi|tk) ]jk
Ak = [ P(gi, fj) ]ij = Uk k VkT

Ak  U k kkVk U kU kVkkTVkTU k  kVkT
A AT    k 
k A

Ak  U k kVkT  k

k



Since the pLSA model constraints:

g  G ,  P( f | g )  1
f F
• This can bias the prediction because the more annotations a
gene has, the lower its average conditional probability is
• To avoid such bias we propose a normalized extension of pLSA:
•  g G :
i. Compute: M  max P ( f | g )
fF

ii. Compute the normalized P(f | g) vector as:
1
P( f | g )norm  P( f | g )
M
• Thus, the feature terms with the highest conditional probability
for a gene always result predicted to be annotated to that gene

Evaluation of the prediction

To evaluate the prediction, we compare each A(i,j) element to its
corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed
(AC AC+1)
• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed
(AR AR+1)

• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed
(NAC NAC+1)
• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted
(AP AP+1)


New concept: Receiver Operating Characteristic
(ROC) curve
Starting from the annotation prediction evaluation factor we just
introduced
Input Output
 AC: Annotation Confirmed
Yes Yes
 AR: Annotation to be Reviewed Yes No
 NAC: No Annotation Confirmed No No
 AP: Annotation Predicted No Yes

We can design the Receiver Operating Characteristic curves for
every prediction:

 On the x, the annotation to be reviewed rate:

 On the y, the annotation predicted rate:


Evaluation data set

• We considered the Gene Ontology annotations of organisms:
Gallus gallus (Chicken), and Bos taurus (Cattle)
− Excluding less reliable Inferred Electronic Annotations

• After this, the four organism data set were:
Annotations
Organism Ontology Genes Terms
(direct )
Gallus gallus BP 275 527 738
Gallus gallus CC 260 148 478
Gallus gallus MF 309 225 509
Bos taurus BP 512 930 1,557
Bos taurus CC 497 234 921
Bos taurus MF 543 422 934

with total (true-path-rule) annotations about 10-times more
than the direct annotations

Evaluation results

•The ROC curve of annotation to be
reviewed rate AR / (AC + AR) and
annotation predicted rate AP / (AP +
NAC) of Bos taurus (Cattle) Cellular
Component (top left), Molecular
Function (top) and Biological Process
(left), for SVD with best truncation value
(in red) and for pLSAnorm with best
topics number (in green)

Evaluation results (3)

• As an aggregated indicator of prediction performance, we
computed the Area Under the Curve(AUC) in the [0; 0.01] range
of AP rate values
− We are interested in the low range of AP rate, since it
corresponds to top-ranked predictions of newly inferred
annotations (AP) with the highest score

Area under ROC curves (AUC) % and Execution Time (sec)
Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm)
Bos taurus BP 44.30 34.75 33 28 188
Bos taurus CC 53.03 27.31 36 4 674
Bos taurus MF 80.96 30.69 11 1 890
Gallus gallus BP 47.33 44.83 98 3 990
Gallus gallus CC 75.39 37.22 10 796
Gallus gallus MF 65.76 29.87 5 422


Conclusions

• We proposed the pLSAnorm method as a novel contribution in
the context of prediction of genomic ontological annotations
- Our pLSAnorm method gives better predictions than the
Singular Value Decomposition (SVD) method
- Higher execution time of pLSAnorm vs. SVD requires better
optimizations, currently limiting its use to off-line analysis or
small dimension data sets


Conclusions (2)

• Our approach is not limited to the here considered Gene
Ontology and can be applied to any controlled annotations

• Increasingly available multiple annotations from different
controlled vocabularies and ontologies could be jointly
considered to further improve prediction reliability (both in
SVD and pLSAnorm)


Probabilistic Latent Semantic Analysis for
prediction of Gene Ontology annotations

Thank you for your attention


"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (17)

Semelhante a "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Semelhante a "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012 (20)

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012