SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Self-Similarity Metric for Index
Pruning in Conceptual Vector
        Space Models
          Dario Bonino, Fulvio Corno
  Dipartimento di Automatica ed Informatica
             Politecnico di Torino

           dario.bonino@polito.it


                                    http://elite.polito.it
Agenda

    Introduction
    Problem statement
    Self-Similarity based pruning
    Experimental results
    Conclusion




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   2
Semantic IR

    New generation search tools exploiting
    conceptual information
    Many techniques
         Logic and reasoning
         Annotation
         Natural Language Processing
         Latent Semantic Indexing
    Research still open but some convergences are
    emerging
         Several researchers independently chose to work
         on Conceptual Vector Space Models

2008-09-02     Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   3
C-VSM vs VSM

    Differences
        C-VSM                                      VSM
             Doc features                               Doc Features
               Concepts                                       Words
             Vector components                          Vector components
               Related to the                                 Related to word
               strength of                                    frequency
               association to a
               concept




2008-09-02     Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA      4
Index pruning

    Commonalities
         Very similar models and data structures
         Need of large indexes
         Reducing the index size (ideally) improves the
         search efficiency
    This operation is called Index Pruning
    Index Pruning can be
         On-line
             Applicable in parallel to indexing
             Works on single documents
         Off-line
             During idle time
             Rebuilds the whole index
2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   5
Objectives

    Long-term goal
         To analyze storage and pruning techniques for C-
         VSM indexes

    Current objective
         On-line pruning
             Index pruning based on document-local information
             Design of a Self-Similarity metric for index pruning
             Implementation of a simple index pruning algorithm
             based on the Self Similarity Metric




2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   6
Agenda

    Introduction
    Problem statement
    Self-Similarity based pruning
    Experimental results
    Conclusion




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   7
C-VSM: a formal definition

    C-VSM                                                Annotations
                                                                                   +
             C−VSM =C , D , A                                A⊆ D×C ×ℝ
                 C set of concepts of a                              Each annotation
                 domain ontology
                 D set of documents                                  a∈ A=d , c , w
                 A set of annotations
                                                                    Associates a
                                                                    document d to a
                                                                    concept c with a
                      w                                             weight w
             d                     c

2008-09-02           Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA        8
Documents in C-VSM

    In C-VSM a document is represented by a vector,
    whose components are the weights wi of
    annotations toward domain concepts
                                                                         c3
         V d =w 1, w 2, w 3,... , w∣C∣
                                                                    w3
                                                                               di
    Where

         w i = { w :d , c i , w ∈ A }                        w1
                                                                                w2
                                                                                     c2
                                                         c1

2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA              9
Self-similarity metric

    Defined as the cosine similarity between the
    original document vector d and its pruned version
    d'
                                                        V d ⋅V d ' 
        S V d  ,V d ' =cos V d  ,V d ' =
                                                     ∣V d ∣∣V d ' ∣
                                        c3
                           d'
                                                  d
                                       α

                                                         c2
                   c1
2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   10
Agenda

    Introduction
    Problem statement
    Self-Similarity based pruning
    Experimental results
    Conclusion




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   11
Self-similarity pruning

     General definition

         Given a document d represented by its vector
         V(d), find a new representation V(d') such that,
             |V(d')|<|V(d)|
             for any query q, the difference
             |S(V(d),V(q))-S(V(d'),V(q))| is minimal



2008-09-02      Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   12
Greedy algorithm

 Self similarity prune (V(d),τ)
                                                                 τ = self-similarity
                                                                        threshold
 V(d') = V(d)
 while (S(V(d),V(d')) >= τ)
 {                                                                            c3
      i: argmin(V(d')i) //find the lowest weight
                                                 w3
      V(d')i=0 //delete annotation                                                 d    d'
 }
 return V(d')                                                      w1
                                                                                       w2
                                                          c1                                c2

2008-09-02      Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA                 13
Agenda

    Introduction
    Problem statement
    Self-Similarity based pruning
    Experimental results
    Conclusion




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   14
Metrics (1/2)

    Ranking similarity
         Measures similarity of search results obtained
         using
             The ranking ro deriving from the original index
             The ranking rp deriving from the pruned index
         The simplest and more used metric is the
             Symmetric Difference Score (@ top k results)
                r o  r p =r o−r p ∪r p −r o 
                                  ro  r p
                R r o , r p =1−
                                     2k
                R=1 perfect match, R=0 no match



2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   15
Metrics (2/2)

    Compression ratio

         Measures the amount of pruning achieved by a
         given compression algorithm
               ∣ prunedEntries∣
         CR=
              ∣originalEntries∣




2008-09-02     Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   16
Experimental setting (1/2)

    Semantic IR system
         H-DOSE, http://dose.sourceforge.net
             Uses a C-VSM
             Shallow indexing based on a bag of words technique
    Document test sets
         Sider
             Subset of the e-Class ontology on siderurgy (677
             concepts)
             250 documents gathered from the web and manually
             classified
             12 queries
             Available on request (mail to dario.bonino@polito.it)


2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   17
Experimental setting (2/2)

    Document test sets (continued...)
         Passepartout
             Ontology on disabilities developed in collaboration with
             the Turin's municipality (181 concepts, 20 different
             relations)
             Documents: all news and docs published on the
             Passepartout web site from 2004 to 2006 (around 2400
             pages)
             12 queries
             Available on request (mail to dario.bonino@polito.it),
             ontology in Italian




2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   18
CR vs Self-similarity                                        τ = self-similarity
                                                                     threshold

    Limited at τ >60% (for lower values R becomes
    too low)




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA            19
Ranking Similarity - Sider

    Ranking similarity vs Compression Ratio




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   20
Ranking Similarity - Passepartout

    Ranking similarity vs Compression Ratio




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   21
Query time vs pruning



                                                                Passepartout



                    Sider




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA       22
Discussion (1/2)

    Sider
         Quite controlled
         Small
         Smoother behavior
         Quite satisfying performance
             80% similarity @ 30% pruning
    Passepartout
         Medium-sized
         Captured “on the wild”
         Complex behavior
         Fair performance
             65% similarity @ 20% pruning

2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   23
Open Issues

    Test sets
         Small
         Relatively custom
         Few or none standard sets available for Semantic
         IR system
             We are working on
                CNN news + KIM ontology
                Aquis corpus + Eurovoc

    Semantic IR system
         Quite simple indexing technique
             Sensitive to composition of the bag of words


2008-09-02       Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   24
Agenda

    Introduction
    Problem statement
    Self-Similarity based pruning
    Experimental results
    Conclusion




2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   25
Conclusions

    Index pruning is expected to become a critical
    issue for Semantic IR systems (as already
    happens for traditional IR)
    Self-similarity pruning can be applied on-line
    reaching relatively good performances
    On-line pruning does not prevent off-line pruning
    possibly leading to better results
    Experimentation on bigger and less controlled
    datasets is needed (however there's a sensible
    lack of test data)
    Porting of traditional algorithms to Semantic
    IR systems shall be investigated
2008-09-02   Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   26
Thank you!

                                Questions?

             Dario Bonino - dario.bonino@polito.it




2008-09-02     Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA   27

Mais conteúdo relacionado

Mais procurados

Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...Tanjarul Islam Mishu
 
Logistic Regression(SGD)
Logistic Regression(SGD)Logistic Regression(SGD)
Logistic Regression(SGD)Prentice Xu
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsLuc Brun
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
Graph kernels
Graph kernelsGraph kernels
Graph kernelsLuc Brun
 
A Hough Transform Based On a Map-Reduce Algorithm
A Hough Transform Based On a Map-Reduce AlgorithmA Hough Transform Based On a Map-Reduce Algorithm
A Hough Transform Based On a Map-Reduce AlgorithmIJERA Editor
 
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)A Fast Near Optimal Vertex Cover Algorithm (NOVCA)
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)Waqas Tariq
 
R short-refcard
R short-refcardR short-refcard
R short-refcardconline
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNNtuxette
 

Mais procurados (15)

Linear Classifiers
Linear ClassifiersLinear Classifiers
Linear Classifiers
 
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
 
Logistic Regression(SGD)
Logistic Regression(SGD)Logistic Regression(SGD)
Logistic Regression(SGD)
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
cdrw
cdrwcdrw
cdrw
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
 
A Hough Transform Based On a Map-Reduce Algorithm
A Hough Transform Based On a Map-Reduce AlgorithmA Hough Transform Based On a Map-Reduce Algorithm
A Hough Transform Based On a Map-Reduce Algorithm
 
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)A Fast Near Optimal Vertex Cover Algorithm (NOVCA)
A Fast Near Optimal Vertex Cover Algorithm (NOVCA)
 
R short-refcard
R short-refcardR short-refcard
R short-refcard
 
M16302
M16302M16302
M16302
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 

Semelhante a Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models

Using semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingUsing semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingShahab Mokarizadeh
 
Context-aware Mobile Recommendation Services for Conference Participants
Context-aware Mobile Recommendation Services for Conference ParticipantsContext-aware Mobile Recommendation Services for Conference Participants
Context-aware Mobile Recommendation Services for Conference ParticipantsRalf Klamma
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomyPatrick Nicolas
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-pythonJoe OntheRocks
 
Object Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLObject Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLShahriar Hyder
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Matthew Rowe
 
MSR presentation
MSR presentationMSR presentation
MSR presentationShivani Rao
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery AlgorithmsShiva Sandeep Garlapati
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpAdrian Ziegler
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSSilvio Cesare
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysisamitpraseed
 
Mdst3559 2011-04-26-viz1
Mdst3559 2011-04-26-viz1Mdst3559 2011-04-26-viz1
Mdst3559 2011-04-26-viz1Rafael Alvarado
 

Semelhante a Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models (20)

Using semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingUsing semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzing
 
Dijkstra
DijkstraDijkstra
Dijkstra
 
d
dd
d
 
Context-aware Mobile Recommendation Services for Conference Participants
Context-aware Mobile Recommendation Services for Conference ParticipantsContext-aware Mobile Recommendation Services for Conference Participants
Context-aware Mobile Recommendation Services for Conference Participants
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
 
Object Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLObject Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQL
 
GraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query ProcessorGraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query Processor
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
MSR presentation
MSR presentationMSR presentation
MSR presentation
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Sharath copy
Sharath   copySharath   copy
Sharath copy
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Big data
Big dataBig data
Big data
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Mdst3559 2011-04-26-viz1
Mdst3559 2011-04-26-viz1Mdst3559 2011-04-26-viz1
Mdst3559 2011-04-26-viz1
 

Mais de Dario Bonino

Mais de Dario Bonino (18)

OSGi compendium
OSGi compendiumOSGi compendium
OSGi compendium
 
OSGi introduction
OSGi introductionOSGi introduction
OSGi introduction
 
dfl
dfldfl
dfl
 
ficloud2015
ficloud2015ficloud2015
ficloud2015
 
citizen-centric-app
citizen-centric-appcitizen-centric-app
citizen-centric-app
 
Dog ont
Dog ontDog ont
Dog ont
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
bonino
boninobonino
bonino
 
The Dog Gateway - Intro
The Dog Gateway - IntroThe Dog Gateway - Intro
The Dog Gateway - Intro
 
Home and building automation systems sun slice
Home and building automation systems   sun sliceHome and building automation systems   sun slice
Home and building automation systems sun slice
 
Rilievo informatico di cavità naturali
Rilievo informatico di cavità naturaliRilievo informatico di cavità naturali
Rilievo informatico di cavità naturali
 
Dog2.3 Architecture
Dog2.3 ArchitectureDog2.3 Architecture
Dog2.3 Architecture
 
Home and building automation systems
Home and building automation systemsHome and building automation systems
Home and building automation systems
 
Dog Ont In Dog
Dog Ont In DogDog Ont In Dog
Dog Ont In Dog
 
Dog Ont
Dog OntDog Ont
Dog Ont
 
Iswc2008
Iswc2008Iswc2008
Iswc2008
 
Dog Sim
Dog SimDog Sim
Dog Sim
 
Interoperation Modeling
Interoperation ModelingInteroperation Modeling
Interoperation Modeling
 

Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models

  • 1. Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models Dario Bonino, Fulvio Corno Dipartimento di Automatica ed Informatica Politecnico di Torino dario.bonino@polito.it http://elite.polito.it
  • 2. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 2
  • 3. Semantic IR New generation search tools exploiting conceptual information Many techniques Logic and reasoning Annotation Natural Language Processing Latent Semantic Indexing Research still open but some convergences are emerging Several researchers independently chose to work on Conceptual Vector Space Models 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 3
  • 4. C-VSM vs VSM Differences C-VSM VSM Doc features Doc Features Concepts Words Vector components Vector components Related to the Related to word strength of frequency association to a concept 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 4
  • 5. Index pruning Commonalities Very similar models and data structures Need of large indexes Reducing the index size (ideally) improves the search efficiency This operation is called Index Pruning Index Pruning can be On-line Applicable in parallel to indexing Works on single documents Off-line During idle time Rebuilds the whole index 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 5
  • 6. Objectives Long-term goal To analyze storage and pruning techniques for C- VSM indexes Current objective On-line pruning Index pruning based on document-local information Design of a Self-Similarity metric for index pruning Implementation of a simple index pruning algorithm based on the Self Similarity Metric 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 6
  • 7. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 7
  • 8. C-VSM: a formal definition C-VSM Annotations + C−VSM =C , D , A A⊆ D×C ×ℝ C set of concepts of a Each annotation domain ontology D set of documents a∈ A=d , c , w A set of annotations Associates a document d to a concept c with a w weight w d c 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 8
  • 9. Documents in C-VSM In C-VSM a document is represented by a vector, whose components are the weights wi of annotations toward domain concepts c3 V d =w 1, w 2, w 3,... , w∣C∣ w3 di Where w i = { w :d , c i , w ∈ A } w1 w2 c2 c1 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 9
  • 10. Self-similarity metric Defined as the cosine similarity between the original document vector d and its pruned version d' V d ⋅V d '  S V d  ,V d ' =cos V d  ,V d ' = ∣V d ∣∣V d ' ∣ c3 d' d α c2 c1 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 10
  • 11. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 11
  • 12. Self-similarity pruning General definition Given a document d represented by its vector V(d), find a new representation V(d') such that, |V(d')|<|V(d)| for any query q, the difference |S(V(d),V(q))-S(V(d'),V(q))| is minimal 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 12
  • 13. Greedy algorithm Self similarity prune (V(d),τ) τ = self-similarity threshold V(d') = V(d) while (S(V(d),V(d')) >= τ) { c3 i: argmin(V(d')i) //find the lowest weight w3 V(d')i=0 //delete annotation d d' } return V(d') w1 w2 c1 c2 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 13
  • 14. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 14
  • 15. Metrics (1/2) Ranking similarity Measures similarity of search results obtained using The ranking ro deriving from the original index The ranking rp deriving from the pruned index The simplest and more used metric is the Symmetric Difference Score (@ top k results) r o  r p =r o−r p ∪r p −r o  ro  r p R r o , r p =1− 2k R=1 perfect match, R=0 no match 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 15
  • 16. Metrics (2/2) Compression ratio Measures the amount of pruning achieved by a given compression algorithm ∣ prunedEntries∣ CR= ∣originalEntries∣ 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 16
  • 17. Experimental setting (1/2) Semantic IR system H-DOSE, http://dose.sourceforge.net Uses a C-VSM Shallow indexing based on a bag of words technique Document test sets Sider Subset of the e-Class ontology on siderurgy (677 concepts) 250 documents gathered from the web and manually classified 12 queries Available on request (mail to dario.bonino@polito.it) 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 17
  • 18. Experimental setting (2/2) Document test sets (continued...) Passepartout Ontology on disabilities developed in collaboration with the Turin's municipality (181 concepts, 20 different relations) Documents: all news and docs published on the Passepartout web site from 2004 to 2006 (around 2400 pages) 12 queries Available on request (mail to dario.bonino@polito.it), ontology in Italian 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 18
  • 19. CR vs Self-similarity τ = self-similarity threshold Limited at τ >60% (for lower values R becomes too low) 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 19
  • 20. Ranking Similarity - Sider Ranking similarity vs Compression Ratio 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 20
  • 21. Ranking Similarity - Passepartout Ranking similarity vs Compression Ratio 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 21
  • 22. Query time vs pruning Passepartout Sider 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 22
  • 23. Discussion (1/2) Sider Quite controlled Small Smoother behavior Quite satisfying performance 80% similarity @ 30% pruning Passepartout Medium-sized Captured “on the wild” Complex behavior Fair performance 65% similarity @ 20% pruning 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 23
  • 24. Open Issues Test sets Small Relatively custom Few or none standard sets available for Semantic IR system We are working on CNN news + KIM ontology Aquis corpus + Eurovoc Semantic IR system Quite simple indexing technique Sensitive to composition of the bag of words 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 24
  • 25. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 25
  • 26. Conclusions Index pruning is expected to become a critical issue for Semantic IR systems (as already happens for traditional IR) Self-similarity pruning can be applied on-line reaching relatively good performances On-line pruning does not prevent off-line pruning possibly leading to better results Experimentation on bigger and less controlled datasets is needed (however there's a sensible lack of test data) Porting of traditional algorithms to Semantic IR systems shall be investigated 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 26
  • 27. Thank you! Questions? Dario Bonino - dario.bonino@polito.it 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 27