SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
A Quantitative analysis and performance study for similarity
                 search methods in high-dimensional spaces




                                          Group 4
                                             Seokhwan Eom,
                                               Jungyeol Lee,
                                                   Rina You,
                                                  Kilho Lee,
Presenter: Seokhwan Eom




Contents

•   Introduction
•   Observations
•   Analysis of NN-search
•   VA-file
•   Conclusion




        2
Presenter: Seokhwan Eom




The Similarity Search Paradigm




      3       ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Seokhwan Eom




The Similarity Search Paradigm




  Locate closest point to query object, i.e. its nearest neighbor(NN)



         4             ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Seokhwan Eom




The conventional approach

• Space-partitioning methods
   - Gridfile   [Nievergelt:1984]
   - K-D-B tree [Robinson:1981]
   - Quad tree [Finkel:1974]

• Data-partitioning index trees
   -R-tree         [Guttman:1984]   -R+-tree    [Sellis:1987]
   -R*-tree       [Beckmann:1990]   -X-tree    [Berchtold:1996]
   -SR-tree       [Katayama:1997]   -M-tree    [Ciaccia:1996]
   -TV-tree       [Lin:1994]        -hB-tree    [Lomet:1990]
Unfortunately,
As the number of dimensions increases, their performance degrades.
- The dimensional curse

              5
Presenter: Seokhwan Eom




Contribution

• Assumptions : initially uniformly-distributed data within unit
  hypercube with independent dimensions

1.   Establish lower bounds on the average performance of NN-
     search for space- and data-partitioning, and clustering
     structures.

2.   Show formally that any partitioning scheme and clustering
     technique must degenerate to a sequential scan through all
     their blocks if the number of dimension is sufficiently large.

3.   Present performance results which support their analysis, and
     demonstrate that the performance of VA-file offers the best
     performance in practice whenever the number of dimensions is
     larger than 6.


           6
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 1 (Number of partitions)
A simple partitioning scheme :
 split the data space in each dimension into two halves.




This seems reasonable with low dimensions.
But with d = 100 there are 2100 ≒ 1030 partitions;
 even with 106 points, almost all of the partitions(1024) are empty.


           7
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 2 (Data space is sparsely populated)
 Consider a hyper-cube range query with size s=0.95
     Data space Ω=[0,1]d

      Target region



                           s



               s



 At d=100,
         P d [ s]  s d  0.95100  0.0059


          8
Presenter: Seokhwan Eom



  The Difficulties of High Dimensionality
  • Observation 3 (Spherical range queries)
     The probability that an arbitrary point R lies within the largest
      spherical query.




Figure: Largest range query      Table: Probability that a point lies within the largest
entirely within the data space. range query inside Ω, and the expected database size

                9
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 4 (Exponentially growing DB size)
 The size which a data set would have to have such that, on average,
                                             d
  at least one point falls into the sphere sp (Q,0.5) (for even d):




               Table: Probability that a point lies within the largest
              range query inside Ω, and the expected database size

         10
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
The probability that the NN-distance is at most r(i.e. the probability that NN to query
   point Q is contained in spd (Q,r)):




The expected NN-distance for a query point Q :




The expected NN-distance E[nndist] for any query point in the data space :




            11
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)




1.   The NN-distance grows steadily with d
2.   Beyond trivially-small data sets D, NN-distances decrease only
     marginally as the size of D increases.

           12
Presenter: Jungyeol Lee




Analysis of NN-Search

• The complexity of any partitioning and clustering
  scheme converges to O( N ) with increasing
  dimensionality

•   General Cost Model
•   Space-Partitioning Methods
•   Data-Partitioning Methods
•   General Partitioning and Clustering Schemes




         13
Presenter: Jungyeol Lee




General Cost Model

• ‘Cost’ of a query:
  – the number of blocks which must be accessed
• Optimal NN search algorithm:
  – Blocks visited during the search
      = blocks whose MBR1) intersect the NN-sphere




   1) MBR: Minimum Bounding Regions
           14
Presenter: Jungyeol Lee




General Cost Model

• Let M visit be the number of blocks visited.
• M visit = The number of blocks
             which intersect the sp d (Q, E[nndist ])
• Transform the spherical query into a point query
• Minkowski sum, MSum(mbri , E[nndist ])
                       E[nn dist ]



              mbri




        MSum(mbri , E[nndist ])


        15
Presenter: Jungyeol Lee




General Cost Model

• Transform the spherical query into a point
  query




• Probability that the i -th block must be visit
        Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )
                                        N m
•   M visit   
                N avg
                  Pvisit , Pvisit 
                             avg    m
                                        P     visit   [i ]
                m                   N   i 0
              16
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Dividing  regardless of clusters
• If each dimension is split once,
  the total # of partitions: 2 , the space overhead: O(2 )
                              d                         d


• To reduce the space overhead, only d '  d dimensions
  are split such that, on average, m points are assigned
  to a partition
          N               N
      2   ,
       d'
                 d '  log 2 
          m               m




        17
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Let lmax denote the maximum distance from mbri to
  any point in the data space
                                 N
                      d '  log 2 
                                 m

                               1      1       N
                      lmax      d'      log 2 
                               2      2 
                                              m

• lmax  E[nndist ], at some dimensionality
• From that dimensionality, Minkowski sum covers the
  entire data space
• Pvisit converges into 1 same as sequential scan
        18
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )  1
• Fig. 7 Comparison of lmax with E[nndist ]




          19
Presenter: Rina You




Data-Partitioning Methods

• Data-partitioning methods partition the data
  space hierarchically
   – In order to reduce the search cost from N  to log N 


• Impracticability of existing methods for NN-
  search in HDVSs.
   – A sequential scan out-performed these more sophisticated
     hierarchical methods.




         20
Presenter: Rina You




Rectangular MBRs

• Index methods use hyper-cubes to bound the
  region of a block.

• Splitting a node results in two new, equally-full
  partitions of the data space.
• d’ dimensions are split at high dimensionality

                             N
                    d  log 2 
                     '

                             m


       21
Presenter: Rina You




Rectangular MBRs

• rectangular MBR
  – d’ sides with a length of 1/2
  – d - d’ sides with a length of 1.


• the probability of visiting a block during
  NN-search
  : the volume of that part of the extended box in the data
  space




        22
Presenter: Rina You




Rectangular MBRs

• the probability of accessing a block during a
  NN-search
  – different database sizes and different values of d’




       23
Presenter: Rina You




Spherical MBRs

• Another group of index structures
   – MBRs in the form of hyper-spheres.

• Each block of optimal structure consists of
   – the center point C
   – m - 1 nearest neighbors


• MBR can be described by nn              sp, m 1
                                                     C 

         24
Presenter: Rina You




Spherical MBRs

• The probability of accessing a block during
  the search.

• MBRs in the form of hyper-spheres :                       nn     sp, m 1
                                                                              C 
• use a Minkowski sum
      d
  sp C, nn        dist, m1
                               c  Enn   dist
                                                   
• The probability that block                  i     must be visited
  during a NN-search
  P sp
   visit   i  Vol sp C, nn
                           d         dist,m1
                                                   c  Enn  
                                                            dist


              25
Presenter: Rina You




Spherical MBRs

• another lower bound for this probability
   – replace nn dist,m1 by nn dist,1  Enn dist 

   P    sp
       visit   i  Vol sp C,2  Enn  
                         d            dist




• If i increases, nn dist,i does not decrease.
   –
      j  i : nn  dist, j
                            nn dist,i



               26
Presenter: Rina You




Spherical MBRs

• The probability of accessing a block
  during the search
  – average the above probability over all center
    points C   :

    P sp, avg
     visit               Vol spc,2  Enn  dC
                     C




        27
Presenter: Rina You




Spherical MBRs

• percentage of blocks visited increases rapidly
  with the dimensionality




• sequential scan will perform better in practice
       28
Presenter: Rina You


General Partitioning and Clustering
Schemes

• No partitioning or clustering scheme
  can offer efficient NN-search
  – if the number of dimensions becomes large.


• The complexity of methods : ON 
• A large portion (up to 100%) of data
  blocks must be read
  – In order to determine the nearest neighbor.

      29
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Basic assumptions:
  1. A cluster is a geometrical form (MBR) that
    covers all cluster points
  2. Each cluster contains at least two points
  3. The MBR of a cluster is convex.




      30
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Average probability of accessing a cluster
  during an NN-search
              1 l
   p avg
     visit    VM mbrCi 
              l i 1

                       
  VM x   Vol MSum x, E[nn    dist
                                       ]   

        31
Presenter: Rina You


General Partitioning and Clustering
Schemes
• Lower bound the average probability
  of accessing a line cluster.
• Pick two arbitrary data points
  – each cluster contains at least two points
• line  Ai, Bi  is contained in mbr Ci 
   – mbr Ci  is convex.
• Lower bound the volume of the
  extended mbr Ci 
  : VM mbrCi   VM line  Ai, Bi 
       32
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Lower bound the distance between Ai
  and Bi : VM line ( Ai, Bi )   VM line ( Ai, Pi ) 
                         min             VM (line ( Ai, Qi ))
                    Qsurf ( nn ( Ai ))
                             sp


               With Pi  surf (nn sp ( Ai ))
   – Points in surface of nn-sphere of Ai have
     minimal minkowski sum for line(Ai, Bi)
   – Line(Ai, Pi) is the optimal line cluster for
     point A
      • If Pi is point in surface of nn-sphere of Ai.
        33
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Lower bound the average probability
  of accessing a line clusters
            1 l
    avg
  Pvisit    VM (mbr(Ci ))   VM (line ( A, P( A)))dA
            l i 1           A

  – Calculate the average volume of minkowski
    sums over all possible pairs A and P(A) in
    the data space



           34
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Conclusion 1 (Performance)
  – For any clustering and partitioning method,
     a simple sequential scan performs better.
   if the number of dimensions exceeds some d.


• Conclusion 2 (Complexity)
  – The complexity of any clustering and
    partitioning methods tends towards O(N)
    as dimensionality increases.
      35
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Conclusion 3 (Degeneration)
  – All blocks are accessed
   if the number of dimensions exceeds some d




      36
Presenter: Kilho Lee




The VA-file

• Accelerates that unavoidable scan by using object
  approximations to compress the vector data.
• Reduces the amount of data that must be read during
  similarity searches.

• Compressing vector data
• The filtering step
• Accessing the data




       37
Presenter: Kilho Lee


The VA-file
 Compressing vector data


                                                                          1 d
                                       P["in _ cell " ]  Vol (cell )  ( bi )  2b
                                                                         2
                                                                b N 1    N
                                          P[ Share]  1  (1  2 )         b
                                                                           2



  • For each dimension i, a small number of bits (bi) is assigned
  • Let b be the sum of all bi’s, b  i 1 bi
                                       d


  • The data space is divided into 2b



        38
Presenter: Kilho Lee


The VA-file
 Filtering step




  • When searching for the nearest neighbor, the entire approximation file
    is scanned and upper and lower bounds on the distance to the query
  • Let δ is the smallest upper bound found so far.
  • if a approx has lower bound exceeds δ, it will be filtered.


        39             ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Kilho Lee


The VA-file
 Filtering step




  • After the filtering step, less than 0.1% of vectors remaining.




         40
Presenter: Kilho Lee


The VA-file
 Accessing the vector




  • After the filtering step, a small set of candidates remain.
  • candidates are sorted by lower bound
  • If a lower bound is encountered that exceeds the nearest distance seen
    so far, the VA-file method stops.


        41
Presenter: Kilho Lee


The VA-file
 Accessing the vector




  • less than 1% of vector blocks are visited.
  • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.




        42
Presenter: Kilho Lee




Performance




  •Figure depicts the percentage of blocks visited.




        43
Presenter: Kilho Lee




Conclusion




  • conventional indexing methods are out-performed by a
    simple sequential scan at moderate dimensionality ( d = 10)
  • At moderate and high dimensionality ( d ≥ 6 ), the VA-file method
    can out-perform any other method.

        44
45

Mais conteúdo relacionado

Mais procurados

Binarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionBinarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionTELKOMNIKA JOURNAL
 
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosDochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosEvans Marshall
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learningYu Huang
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataIOSRjournaljce
 
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsCPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsNAVER Engineering
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learningYu Huang
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Hans Ecke
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERcscpconf
 
Satellite image compression technique
Satellite image compression techniqueSatellite image compression technique
Satellite image compression techniqueacijjournal
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 

Mais procurados (20)

poster
posterposter
poster
 
Binarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionBinarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram Assumption
 
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosDochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learning
 
Class Weighted Convolutional Features for Image Retrieval
Class Weighted Convolutional Features for Image Retrieval Class Weighted Convolutional Features for Image Retrieval
Class Weighted Convolutional Features for Image Retrieval
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
 
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsCPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIER
 
Satellite image compression technique
Satellite image compression techniqueSatellite image compression technique
Satellite image compression technique
 
robio-2014-falquez
robio-2014-falquezrobio-2014-falquez
robio-2014-falquez
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 

Semelhante a A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
 
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionBeyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionFrank Ong
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
NovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionNovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionSeungjoon1
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsMason Porter
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection TechniquesCatarina Moreira
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdfEmerald72
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image FundamentalsKalyan Acharjya
 
Example of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchExample of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchAbhijeet Agarwal
 

Semelhante a A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces (20)

ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
 
Nearest neighbor search
Nearest neighbor searchNearest neighbor search
Nearest neighbor search
 
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionBeyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
NovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionNovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration Criterion
 
Spectral convnets
Spectral convnetsSpectral convnets
Spectral convnets
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
 
Lecture24
Lecture24Lecture24
Lecture24
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection Techniques
 
Db Scan
Db ScanDb Scan
Db Scan
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
KNN
KNNKNN
KNN
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
 
Example of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchExample of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional search
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

  • 1. A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces Group 4 Seokhwan Eom, Jungyeol Lee, Rina You, Kilho Lee,
  • 2. Presenter: Seokhwan Eom Contents • Introduction • Observations • Analysis of NN-search • VA-file • Conclusion 2
  • 3. Presenter: Seokhwan Eom The Similarity Search Paradigm 3 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 4. Presenter: Seokhwan Eom The Similarity Search Paradigm Locate closest point to query object, i.e. its nearest neighbor(NN) 4 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 5. Presenter: Seokhwan Eom The conventional approach • Space-partitioning methods - Gridfile [Nievergelt:1984] - K-D-B tree [Robinson:1981] - Quad tree [Finkel:1974] • Data-partitioning index trees -R-tree [Guttman:1984] -R+-tree [Sellis:1987] -R*-tree [Beckmann:1990] -X-tree [Berchtold:1996] -SR-tree [Katayama:1997] -M-tree [Ciaccia:1996] -TV-tree [Lin:1994] -hB-tree [Lomet:1990] Unfortunately, As the number of dimensions increases, their performance degrades. - The dimensional curse 5
  • 6. Presenter: Seokhwan Eom Contribution • Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions 1. Establish lower bounds on the average performance of NN- search for space- and data-partitioning, and clustering structures. 2. Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large. 3. Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6. 6
  • 7. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 1 (Number of partitions) A simple partitioning scheme : split the data space in each dimension into two halves. This seems reasonable with low dimensions. But with d = 100 there are 2100 ≒ 1030 partitions; even with 106 points, almost all of the partitions(1024) are empty. 7
  • 8. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 2 (Data space is sparsely populated) Consider a hyper-cube range query with size s=0.95 Data space Ω=[0,1]d Target region s s At d=100, P d [ s]  s d  0.95100  0.0059 8
  • 9. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 3 (Spherical range queries) The probability that an arbitrary point R lies within the largest spherical query. Figure: Largest range query Table: Probability that a point lies within the largest entirely within the data space. range query inside Ω, and the expected database size 9
  • 10. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 4 (Exponentially growing DB size) The size which a data set would have to have such that, on average, d at least one point falls into the sphere sp (Q,0.5) (for even d): Table: Probability that a point lies within the largest range query inside Ω, and the expected database size 10
  • 11. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 5 (Expected NN-distance) The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)): The expected NN-distance for a query point Q : The expected NN-distance E[nndist] for any query point in the data space : 11
  • 12. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 5 (Expected NN-distance) 1. The NN-distance grows steadily with d 2. Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases. 12
  • 13. Presenter: Jungyeol Lee Analysis of NN-Search • The complexity of any partitioning and clustering scheme converges to O( N ) with increasing dimensionality • General Cost Model • Space-Partitioning Methods • Data-Partitioning Methods • General Partitioning and Clustering Schemes 13
  • 14. Presenter: Jungyeol Lee General Cost Model • ‘Cost’ of a query: – the number of blocks which must be accessed • Optimal NN search algorithm: – Blocks visited during the search = blocks whose MBR1) intersect the NN-sphere 1) MBR: Minimum Bounding Regions 14
  • 15. Presenter: Jungyeol Lee General Cost Model • Let M visit be the number of blocks visited. • M visit = The number of blocks which intersect the sp d (Q, E[nndist ]) • Transform the spherical query into a point query • Minkowski sum, MSum(mbri , E[nndist ]) E[nn dist ] mbri MSum(mbri , E[nndist ]) 15
  • 16. Presenter: Jungyeol Lee General Cost Model • Transform the spherical query into a point query • Probability that the i -th block must be visit Pvisit [i]  Vol (MSum(mbri , E[nndist ])  ) N m • M visit  N avg Pvisit , Pvisit  avg m P visit [i ] m N i 0 16
  • 17. Presenter: Jungyeol Lee Space-Partitioning Methods • Dividing  regardless of clusters • If each dimension is split once, the total # of partitions: 2 , the space overhead: O(2 ) d d • To reduce the space overhead, only d '  d dimensions are split such that, on average, m points are assigned to a partition N   N 2   , d' d '  log 2  m  m 17
  • 18. Presenter: Jungyeol Lee Space-Partitioning Methods • Let lmax denote the maximum distance from mbri to any point in the data space  N d '  log 2   m 1 1  N lmax  d'  log 2  2 2   m • lmax  E[nndist ], at some dimensionality • From that dimensionality, Minkowski sum covers the entire data space • Pvisit converges into 1 same as sequential scan 18
  • 19. Presenter: Jungyeol Lee Space-Partitioning Methods • Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )  1 • Fig. 7 Comparison of lmax with E[nndist ] 19
  • 20. Presenter: Rina You Data-Partitioning Methods • Data-partitioning methods partition the data space hierarchically – In order to reduce the search cost from N  to log N  • Impracticability of existing methods for NN- search in HDVSs. – A sequential scan out-performed these more sophisticated hierarchical methods. 20
  • 21. Presenter: Rina You Rectangular MBRs • Index methods use hyper-cubes to bound the region of a block. • Splitting a node results in two new, equally-full partitions of the data space. • d’ dimensions are split at high dimensionality  N d  log 2  '  m 21
  • 22. Presenter: Rina You Rectangular MBRs • rectangular MBR – d’ sides with a length of 1/2 – d - d’ sides with a length of 1. • the probability of visiting a block during NN-search : the volume of that part of the extended box in the data space 22
  • 23. Presenter: Rina You Rectangular MBRs • the probability of accessing a block during a NN-search – different database sizes and different values of d’ 23
  • 24. Presenter: Rina You Spherical MBRs • Another group of index structures – MBRs in the form of hyper-spheres. • Each block of optimal structure consists of – the center point C – m - 1 nearest neighbors • MBR can be described by nn sp, m 1 C  24
  • 25. Presenter: Rina You Spherical MBRs • The probability of accessing a block during the search. • MBRs in the form of hyper-spheres : nn sp, m 1 C  • use a Minkowski sum d sp C, nn dist, m1 c  Enn dist  • The probability that block i must be visited during a NN-search P sp visit i  Vol sp C, nn d dist,m1 c  Enn   dist 25
  • 26. Presenter: Rina You Spherical MBRs • another lower bound for this probability – replace nn dist,m1 by nn dist,1  Enn dist  P sp visit i  Vol sp C,2  Enn   d dist • If i increases, nn dist,i does not decrease. – j  i : nn dist, j  nn dist,i 26
  • 27. Presenter: Rina You Spherical MBRs • The probability of accessing a block during the search – average the above probability over all center points C   : P sp, avg visit  Vol spc,2  Enn  dC C 27
  • 28. Presenter: Rina You Spherical MBRs • percentage of blocks visited increases rapidly with the dimensionality • sequential scan will perform better in practice 28
  • 29. Presenter: Rina You General Partitioning and Clustering Schemes • No partitioning or clustering scheme can offer efficient NN-search – if the number of dimensions becomes large. • The complexity of methods : ON  • A large portion (up to 100%) of data blocks must be read – In order to determine the nearest neighbor. 29
  • 30. Presenter: Rina You General Partitioning and Clustering Schemes • Basic assumptions: 1. A cluster is a geometrical form (MBR) that covers all cluster points 2. Each cluster contains at least two points 3. The MBR of a cluster is convex. 30
  • 31. Presenter: Rina You General Partitioning and Clustering Schemes • Average probability of accessing a cluster during an NN-search 1 l p avg visit  VM mbrCi  l i 1   VM x   Vol MSum x, E[nn dist ]  31
  • 32. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the average probability of accessing a line cluster. • Pick two arbitrary data points – each cluster contains at least two points • line  Ai, Bi  is contained in mbr Ci  – mbr Ci  is convex. • Lower bound the volume of the extended mbr Ci  : VM mbrCi   VM line  Ai, Bi  32
  • 33. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the distance between Ai and Bi : VM line ( Ai, Bi )   VM line ( Ai, Pi )   min VM (line ( Ai, Qi )) Qsurf ( nn ( Ai )) sp With Pi  surf (nn sp ( Ai )) – Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi) – Line(Ai, Pi) is the optimal line cluster for point A • If Pi is point in surface of nn-sphere of Ai. 33
  • 34. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the average probability of accessing a line clusters 1 l avg Pvisit  VM (mbr(Ci ))   VM (line ( A, P( A)))dA l i 1 A – Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space 34
  • 35. Presenter: Rina You General Partitioning and Clustering Schemes • Conclusion 1 (Performance) – For any clustering and partitioning method, a simple sequential scan performs better. if the number of dimensions exceeds some d. • Conclusion 2 (Complexity) – The complexity of any clustering and partitioning methods tends towards O(N) as dimensionality increases. 35
  • 36. Presenter: Rina You General Partitioning and Clustering Schemes • Conclusion 3 (Degeneration) – All blocks are accessed if the number of dimensions exceeds some d 36
  • 37. Presenter: Kilho Lee The VA-file • Accelerates that unavoidable scan by using object approximations to compress the vector data. • Reduces the amount of data that must be read during similarity searches. • Compressing vector data • The filtering step • Accessing the data 37
  • 38. Presenter: Kilho Lee The VA-file Compressing vector data 1 d P["in _ cell " ]  Vol (cell )  ( bi )  2b 2 b N 1 N P[ Share]  1  (1  2 )  b 2 • For each dimension i, a small number of bits (bi) is assigned • Let b be the sum of all bi’s, b  i 1 bi d • The data space is divided into 2b 38
  • 39. Presenter: Kilho Lee The VA-file Filtering step • When searching for the nearest neighbor, the entire approximation file is scanned and upper and lower bounds on the distance to the query • Let δ is the smallest upper bound found so far. • if a approx has lower bound exceeds δ, it will be filtered. 39 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 40. Presenter: Kilho Lee The VA-file Filtering step • After the filtering step, less than 0.1% of vectors remaining. 40
  • 41. Presenter: Kilho Lee The VA-file Accessing the vector • After the filtering step, a small set of candidates remain. • candidates are sorted by lower bound • If a lower bound is encountered that exceeds the nearest distance seen so far, the VA-file method stops. 41
  • 42. Presenter: Kilho Lee The VA-file Accessing the vector • less than 1% of vector blocks are visited. • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed. 42
  • 43. Presenter: Kilho Lee Performance •Figure depicts the percentage of blocks visited. 43
  • 44. Presenter: Kilho Lee Conclusion • conventional indexing methods are out-performed by a simple sequential scan at moderate dimensionality ( d = 10) • At moderate and high dimensionality ( d ≥ 6 ), the VA-file method can out-perform any other method. 44
  • 45. 45