08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
1. A Quantitative analysis and performance study for similarity
search methods in high-dimensional spaces
Group 4
Seokhwan Eom,
Jungyeol Lee,
Rina You,
Kilho Lee,
3. Presenter: Seokhwan Eom
The Similarity Search Paradigm
3 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
4. Presenter: Seokhwan Eom
The Similarity Search Paradigm
Locate closest point to query object, i.e. its nearest neighbor(NN)
4 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
5. Presenter: Seokhwan Eom
The conventional approach
• Space-partitioning methods
- Gridfile [Nievergelt:1984]
- K-D-B tree [Robinson:1981]
- Quad tree [Finkel:1974]
• Data-partitioning index trees
-R-tree [Guttman:1984] -R+-tree [Sellis:1987]
-R*-tree [Beckmann:1990] -X-tree [Berchtold:1996]
-SR-tree [Katayama:1997] -M-tree [Ciaccia:1996]
-TV-tree [Lin:1994] -hB-tree [Lomet:1990]
Unfortunately,
As the number of dimensions increases, their performance degrades.
- The dimensional curse
5
6. Presenter: Seokhwan Eom
Contribution
• Assumptions : initially uniformly-distributed data within unit
hypercube with independent dimensions
1. Establish lower bounds on the average performance of NN-
search for space- and data-partitioning, and clustering
structures.
2. Show formally that any partitioning scheme and clustering
technique must degenerate to a sequential scan through all
their blocks if the number of dimension is sufficiently large.
3. Present performance results which support their analysis, and
demonstrate that the performance of VA-file offers the best
performance in practice whenever the number of dimensions is
larger than 6.
6
7. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 1 (Number of partitions)
A simple partitioning scheme :
split the data space in each dimension into two halves.
This seems reasonable with low dimensions.
But with d = 100 there are 2100 ≒ 1030 partitions;
even with 106 points, almost all of the partitions(1024) are empty.
7
8. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 2 (Data space is sparsely populated)
Consider a hyper-cube range query with size s=0.95
Data space Ω=[0,1]d
Target region
s
s
At d=100,
P d [ s] s d 0.95100 0.0059
8
9. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 3 (Spherical range queries)
The probability that an arbitrary point R lies within the largest
spherical query.
Figure: Largest range query Table: Probability that a point lies within the largest
entirely within the data space. range query inside Ω, and the expected database size
9
10. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 4 (Exponentially growing DB size)
The size which a data set would have to have such that, on average,
d
at least one point falls into the sphere sp (Q,0.5) (for even d):
Table: Probability that a point lies within the largest
range query inside Ω, and the expected database size
10
11. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
The probability that the NN-distance is at most r(i.e. the probability that NN to query
point Q is contained in spd (Q,r)):
The expected NN-distance for a query point Q :
The expected NN-distance E[nndist] for any query point in the data space :
11
12. Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
1. The NN-distance grows steadily with d
2. Beyond trivially-small data sets D, NN-distances decrease only
marginally as the size of D increases.
12
13. Presenter: Jungyeol Lee
Analysis of NN-Search
• The complexity of any partitioning and clustering
scheme converges to O( N ) with increasing
dimensionality
• General Cost Model
• Space-Partitioning Methods
• Data-Partitioning Methods
• General Partitioning and Clustering Schemes
13
14. Presenter: Jungyeol Lee
General Cost Model
• ‘Cost’ of a query:
– the number of blocks which must be accessed
• Optimal NN search algorithm:
– Blocks visited during the search
= blocks whose MBR1) intersect the NN-sphere
1) MBR: Minimum Bounding Regions
14
15. Presenter: Jungyeol Lee
General Cost Model
• Let M visit be the number of blocks visited.
• M visit = The number of blocks
which intersect the sp d (Q, E[nndist ])
• Transform the spherical query into a point query
• Minkowski sum, MSum(mbri , E[nndist ])
E[nn dist ]
mbri
MSum(mbri , E[nndist ])
15
16. Presenter: Jungyeol Lee
General Cost Model
• Transform the spherical query into a point
query
• Probability that the i -th block must be visit
Pvisit [i] Vol (MSum(mbri , E[nndist ]) )
N m
• M visit
N avg
Pvisit , Pvisit
avg m
P visit [i ]
m N i 0
16
17. Presenter: Jungyeol Lee
Space-Partitioning Methods
• Dividing regardless of clusters
• If each dimension is split once,
the total # of partitions: 2 , the space overhead: O(2 )
d d
• To reduce the space overhead, only d ' d dimensions
are split such that, on average, m points are assigned
to a partition
N N
2 ,
d'
d ' log 2
m m
17
18. Presenter: Jungyeol Lee
Space-Partitioning Methods
• Let lmax denote the maximum distance from mbri to
any point in the data space
N
d ' log 2
m
1 1 N
lmax d' log 2
2 2
m
• lmax E[nndist ], at some dimensionality
• From that dimensionality, Minkowski sum covers the
entire data space
• Pvisit converges into 1 same as sequential scan
18
20. Presenter: Rina You
Data-Partitioning Methods
• Data-partitioning methods partition the data
space hierarchically
– In order to reduce the search cost from N to log N
• Impracticability of existing methods for NN-
search in HDVSs.
– A sequential scan out-performed these more sophisticated
hierarchical methods.
20
21. Presenter: Rina You
Rectangular MBRs
• Index methods use hyper-cubes to bound the
region of a block.
• Splitting a node results in two new, equally-full
partitions of the data space.
• d’ dimensions are split at high dimensionality
N
d log 2
'
m
21
22. Presenter: Rina You
Rectangular MBRs
• rectangular MBR
– d’ sides with a length of 1/2
– d - d’ sides with a length of 1.
• the probability of visiting a block during
NN-search
: the volume of that part of the extended box in the data
space
22
23. Presenter: Rina You
Rectangular MBRs
• the probability of accessing a block during a
NN-search
– different database sizes and different values of d’
23
24. Presenter: Rina You
Spherical MBRs
• Another group of index structures
– MBRs in the form of hyper-spheres.
• Each block of optimal structure consists of
– the center point C
– m - 1 nearest neighbors
• MBR can be described by nn sp, m 1
C
24
25. Presenter: Rina You
Spherical MBRs
• The probability of accessing a block during
the search.
• MBRs in the form of hyper-spheres : nn sp, m 1
C
• use a Minkowski sum
d
sp C, nn dist, m1
c Enn dist
• The probability that block i must be visited
during a NN-search
P sp
visit i Vol sp C, nn
d dist,m1
c Enn
dist
25
26. Presenter: Rina You
Spherical MBRs
• another lower bound for this probability
– replace nn dist,m1 by nn dist,1 Enn dist
P sp
visit i Vol sp C,2 Enn
d dist
• If i increases, nn dist,i does not decrease.
–
j i : nn dist, j
nn dist,i
26
27. Presenter: Rina You
Spherical MBRs
• The probability of accessing a block
during the search
– average the above probability over all center
points C :
P sp, avg
visit Vol spc,2 Enn dC
C
27
28. Presenter: Rina You
Spherical MBRs
• percentage of blocks visited increases rapidly
with the dimensionality
• sequential scan will perform better in practice
28
29. Presenter: Rina You
General Partitioning and Clustering
Schemes
• No partitioning or clustering scheme
can offer efficient NN-search
– if the number of dimensions becomes large.
• The complexity of methods : ON
• A large portion (up to 100%) of data
blocks must be read
– In order to determine the nearest neighbor.
29
30. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Basic assumptions:
1. A cluster is a geometrical form (MBR) that
covers all cluster points
2. Each cluster contains at least two points
3. The MBR of a cluster is convex.
30
31. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Average probability of accessing a cluster
during an NN-search
1 l
p avg
visit VM mbrCi
l i 1
VM x Vol MSum x, E[nn dist
]
31
32. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Lower bound the average probability
of accessing a line cluster.
• Pick two arbitrary data points
– each cluster contains at least two points
• line Ai, Bi is contained in mbr Ci
– mbr Ci is convex.
• Lower bound the volume of the
extended mbr Ci
: VM mbrCi VM line Ai, Bi
32
33. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Lower bound the distance between Ai
and Bi : VM line ( Ai, Bi ) VM line ( Ai, Pi )
min VM (line ( Ai, Qi ))
Qsurf ( nn ( Ai ))
sp
With Pi surf (nn sp ( Ai ))
– Points in surface of nn-sphere of Ai have
minimal minkowski sum for line(Ai, Bi)
– Line(Ai, Pi) is the optimal line cluster for
point A
• If Pi is point in surface of nn-sphere of Ai.
33
34. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Lower bound the average probability
of accessing a line clusters
1 l
avg
Pvisit VM (mbr(Ci )) VM (line ( A, P( A)))dA
l i 1 A
– Calculate the average volume of minkowski
sums over all possible pairs A and P(A) in
the data space
34
35. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Conclusion 1 (Performance)
– For any clustering and partitioning method,
a simple sequential scan performs better.
if the number of dimensions exceeds some d.
• Conclusion 2 (Complexity)
– The complexity of any clustering and
partitioning methods tends towards O(N)
as dimensionality increases.
35
36. Presenter: Rina You
General Partitioning and Clustering
Schemes
• Conclusion 3 (Degeneration)
– All blocks are accessed
if the number of dimensions exceeds some d
36
37. Presenter: Kilho Lee
The VA-file
• Accelerates that unavoidable scan by using object
approximations to compress the vector data.
• Reduces the amount of data that must be read during
similarity searches.
• Compressing vector data
• The filtering step
• Accessing the data
37
38. Presenter: Kilho Lee
The VA-file
Compressing vector data
1 d
P["in _ cell " ] Vol (cell ) ( bi ) 2b
2
b N 1 N
P[ Share] 1 (1 2 ) b
2
• For each dimension i, a small number of bits (bi) is assigned
• Let b be the sum of all bi’s, b i 1 bi
d
• The data space is divided into 2b
38
39. Presenter: Kilho Lee
The VA-file
Filtering step
• When searching for the nearest neighbor, the entire approximation file
is scanned and upper and lower bounds on the distance to the query
• Let δ is the smallest upper bound found so far.
• if a approx has lower bound exceeds δ, it will be filtered.
39 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
40. Presenter: Kilho Lee
The VA-file
Filtering step
• After the filtering step, less than 0.1% of vectors remaining.
40
41. Presenter: Kilho Lee
The VA-file
Accessing the vector
• After the filtering step, a small set of candidates remain.
• candidates are sorted by lower bound
• If a lower bound is encountered that exceeds the nearest distance seen
so far, the VA-file method stops.
41
42. Presenter: Kilho Lee
The VA-file
Accessing the vector
• less than 1% of vector blocks are visited.
• In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.
42
44. Presenter: Kilho Lee
Conclusion
• conventional indexing methods are out-performed by a
simple sequential scan at moderate dimensionality ( d = 10)
• At moderate and high dimensionality ( d ≥ 6 ), the VA-file method
can out-perform any other method.
44