Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models
1. Self-Similarity Metric for Index
Pruning in Conceptual Vector
Space Models
Dario Bonino, Fulvio Corno
Dipartimento di Automatica ed Informatica
Politecnico di Torino
dario.bonino@polito.it
http://elite.polito.it
2. Agenda
Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 2
3. Semantic IR
New generation search tools exploiting
conceptual information
Many techniques
Logic and reasoning
Annotation
Natural Language Processing
Latent Semantic Indexing
Research still open but some convergences are
emerging
Several researchers independently chose to work
on Conceptual Vector Space Models
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 3
4. C-VSM vs VSM
Differences
C-VSM VSM
Doc features Doc Features
Concepts Words
Vector components Vector components
Related to the Related to word
strength of frequency
association to a
concept
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 4
5. Index pruning
Commonalities
Very similar models and data structures
Need of large indexes
Reducing the index size (ideally) improves the
search efficiency
This operation is called Index Pruning
Index Pruning can be
On-line
Applicable in parallel to indexing
Works on single documents
Off-line
During idle time
Rebuilds the whole index
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 5
6. Objectives
Long-term goal
To analyze storage and pruning techniques for C-
VSM indexes
Current objective
On-line pruning
Index pruning based on document-local information
Design of a Self-Similarity metric for index pruning
Implementation of a simple index pruning algorithm
based on the Self Similarity Metric
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 6
7. Agenda
Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 7
8. C-VSM: a formal definition
C-VSM Annotations
+
C−VSM =C , D , A A⊆ D×C ×ℝ
C set of concepts of a Each annotation
domain ontology
D set of documents a∈ A=d , c , w
A set of annotations
Associates a
document d to a
concept c with a
w weight w
d c
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 8
9. Documents in C-VSM
In C-VSM a document is represented by a vector,
whose components are the weights wi of
annotations toward domain concepts
c3
V d =w 1, w 2, w 3,... , w∣C∣
w3
di
Where
w i = { w :d , c i , w ∈ A } w1
w2
c2
c1
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 9
10. Self-similarity metric
Defined as the cosine similarity between the
original document vector d and its pruned version
d'
V d ⋅V d '
S V d ,V d ' =cos V d ,V d ' =
∣V d ∣∣V d ' ∣
c3
d'
d
α
c2
c1
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 10
11. Agenda
Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 11
12. Self-similarity pruning
General definition
Given a document d represented by its vector
V(d), find a new representation V(d') such that,
|V(d')|<|V(d)|
for any query q, the difference
|S(V(d),V(q))-S(V(d'),V(q))| is minimal
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 12
13. Greedy algorithm
Self similarity prune (V(d),τ)
τ = self-similarity
threshold
V(d') = V(d)
while (S(V(d),V(d')) >= τ)
{ c3
i: argmin(V(d')i) //find the lowest weight
w3
V(d')i=0 //delete annotation d d'
}
return V(d') w1
w2
c1 c2
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 13
14. Agenda
Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 14
15. Metrics (1/2)
Ranking similarity
Measures similarity of search results obtained
using
The ranking ro deriving from the original index
The ranking rp deriving from the pruned index
The simplest and more used metric is the
Symmetric Difference Score (@ top k results)
r o r p =r o−r p ∪r p −r o
ro r p
R r o , r p =1−
2k
R=1 perfect match, R=0 no match
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 15
16. Metrics (2/2)
Compression ratio
Measures the amount of pruning achieved by a
given compression algorithm
∣ prunedEntries∣
CR=
∣originalEntries∣
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 16
17. Experimental setting (1/2)
Semantic IR system
H-DOSE, http://dose.sourceforge.net
Uses a C-VSM
Shallow indexing based on a bag of words technique
Document test sets
Sider
Subset of the e-Class ontology on siderurgy (677
concepts)
250 documents gathered from the web and manually
classified
12 queries
Available on request (mail to dario.bonino@polito.it)
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 17
18. Experimental setting (2/2)
Document test sets (continued...)
Passepartout
Ontology on disabilities developed in collaboration with
the Turin's municipality (181 concepts, 20 different
relations)
Documents: all news and docs published on the
Passepartout web site from 2004 to 2006 (around 2400
pages)
12 queries
Available on request (mail to dario.bonino@polito.it),
ontology in Italian
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 18
19. CR vs Self-similarity τ = self-similarity
threshold
Limited at τ >60% (for lower values R becomes
too low)
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 19
20. Ranking Similarity - Sider
Ranking similarity vs Compression Ratio
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 20
21. Ranking Similarity - Passepartout
Ranking similarity vs Compression Ratio
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 21
22. Query time vs pruning
Passepartout
Sider
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 22
24. Open Issues
Test sets
Small
Relatively custom
Few or none standard sets available for Semantic
IR system
We are working on
CNN news + KIM ontology
Aquis corpus + Eurovoc
Semantic IR system
Quite simple indexing technique
Sensitive to composition of the bag of words
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 24
25. Agenda
Introduction
Problem statement
Self-Similarity based pruning
Experimental results
Conclusion
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 25
26. Conclusions
Index pruning is expected to become a critical
issue for Semantic IR systems (as already
happens for traditional IR)
Self-similarity pruning can be applied on-line
reaching relatively good performances
On-line pruning does not prevent off-line pruning
possibly leading to better results
Experimentation on bigger and less controlled
datasets is needed (however there's a sensible
lack of test data)
Porting of traditional algorithms to Semantic
IR systems shall be investigated
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 26
27. Thank you!
Questions?
Dario Bonino - dario.bonino@polito.it
2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 27