Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models

Self-Similarity Metric for Index
Pruning in Conceptual Vector
Space Models
Dario Bonino, Fulvio Corno
Dipartimento di Automatica ed Informatica
Politecnico di Torino

dario.bonino@polito.it

http://elite.polito.it

Agenda

Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion

2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 2

Semantic IR

New generation search tools exploiting
conceptual information
Many techniques
Logic and reasoning
Annotation
Natural Language Processing
Latent Semantic Indexing
Research still open but some convergences are
emerging
Several researchers independently chose to work
on Conceptual Vector Space Models


C-VSM vs VSM

Differences
C-VSM VSM
Doc features Doc Features
Concepts Words
Vector components Vector components
Related to the Related to word
strength of frequency
association to a
concept


Index pruning

Commonalities
Very similar models and data structures
Need of large indexes
Reducing the index size (ideally) improves the
search efficiency
This operation is called Index Pruning
Index Pruning can be
On-line
Applicable in parallel to indexing
Works on single documents
Off-line
During idle time
Rebuilds the whole index

Objectives

Long-term goal
To analyze storage and pruning techniques for C-
VSM indexes

Current objective
On-line pruning
Index pruning based on document-local information
Design of a Self-Similarity metric for index pruning
Implementation of a simple index pruning algorithm
based on the Self Similarity Metric


Agenda

Introduction
Problem statement
Conclusion


C-VSM: a formal definition

C-VSM Annotations
+
C−VSM =C , D , A A⊆ D×C ×ℝ
C set of concepts of a Each annotation
domain ontology
D set of documents a∈ A=d , c , w
A set of annotations
Associates a
document d to a
concept c with a
w weight w
d c


Documents in C-VSM

In C-VSM a document is represented by a vector,
whose components are the weights wi of
annotations toward domain concepts
c3
V d =w 1, w 2, w 3,... , w∣C∣
w3
di
Where

w i = { w :d , c i , w ∈ A } w1
w2
c2
c1


Self-similarity metric

Defined as the cosine similarity between the
original document vector d and its pruned version
d'
V d ⋅V d ' 
S V d  ,V d ' =cos V d  ,V d ' =
∣V d ∣∣V d ' ∣
c3
d'
d
α

c2
c1

Agenda

Introduction
Problem statement
Conclusion


Self-similarity pruning

General definition

Given a document d represented by its vector
V(d), find a new representation V(d') such that,
|V(d')|<|V(d)|
for any query q, the difference
|S(V(d),V(q))-S(V(d'),V(q))| is minimal


Greedy algorithm

Self similarity prune (V(d),τ)
τ = self-similarity
threshold
V(d') = V(d)
while (S(V(d),V(d')) >= τ)
{ c3
i: argmin(V(d')i) //find the lowest weight
w3
V(d')i=0 //delete annotation d d'
}
return V(d') w1
w2
c1 c2


Agenda

Introduction
Problem statement
Conclusion


Metrics (1/2)

Ranking similarity
Measures similarity of search results obtained
using
The ranking ro deriving from the original index
The ranking rp deriving from the pruned index
The simplest and more used metric is the
Symmetric Difference Score (@ top k results)
r o  r p =r o−r p ∪r p −r o 
ro  r p
R r o , r p =1−
2k
R=1 perfect match, R=0 no match


Metrics (2/2)

Compression ratio

Measures the amount of pruning achieved by a
given compression algorithm
∣ prunedEntries∣
CR=
∣originalEntries∣


Experimental setting (1/2)

Semantic IR system
H-DOSE, http://dose.sourceforge.net
Uses a C-VSM
Shallow indexing based on a bag of words technique
Document test sets
Sider
Subset of the e-Class ontology on siderurgy (677
concepts)
250 documents gathered from the web and manually
classified
12 queries
Available on request (mail to dario.bonino@polito.it)


Experimental setting (2/2)

Document test sets (continued...)
Passepartout
Ontology on disabilities developed in collaboration with
the Turin's municipality (181 concepts, 20 different
relations)
Documents: all news and docs published on the
Passepartout web site from 2004 to 2006 (around 2400
pages)
12 queries
Available on request (mail to dario.bonino@polito.it),
ontology in Italian


CR vs Self-similarity τ = self-similarity
threshold

Limited at τ >60% (for lower values R becomes
too low)


Ranking Similarity - Sider

Ranking similarity vs Compression Ratio


Ranking Similarity - Passepartout

Ranking similarity vs Compression Ratio


Query time vs pruning

Passepartout

Sider


Discussion (1/2)

Sider
Quite controlled
Small
Smoother behavior
Quite satisfying performance
80% similarity @ 30% pruning
Passepartout
Medium-sized
Captured “on the wild”
Complex behavior
Fair performance
65% similarity @ 20% pruning


Open Issues

Test sets
Small
Relatively custom
Few or none standard sets available for Semantic
IR system
We are working on
CNN news + KIM ontology
Aquis corpus + Eurovoc

Semantic IR system
Quite simple indexing technique
Sensitive to composition of the bag of words


Agenda

Introduction
Problem statement
Conclusion


Conclusions

Index pruning is expected to become a critical
issue for Semantic IR systems (as already
happens for traditional IR)
Self-similarity pruning can be applied on-line
reaching relatively good performances
On-line pruning does not prevent off-line pruning
possibly leading to better results
Experimentation on bigger and less controlled
datasets is needed (however there's a sensible
lack of test data)
Porting of traditional algorithms to Semantic
IR systems shall be investigated

Thank you!

Questions?

Dario Bonino - dario.bonino@polito.it


Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (15)

Semelhante a Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models

Semelhante a Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models (20)

Mais de Dario Bonino

Mais de Dario Bonino (18)

Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models