CLIQUE Automatic subspace clustering of high dimensional data for data mining application
1. Author
Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos,
Prabhakar Raghavan
Prepared by : Raed T Aldahdooh
2. Introduction
Motivation
Contributions Of The Paper
Subspace Clustering
CLIQUE(Clustering in Quest)
Performance Experiments
Conclusions
3. Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
CLIQUE can be considered as both density-based and grid-
based
Clustering high-dimensional data.
Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
4. Many irrelevant dimensions may mask clusters.
Distance measure becomes meaningless—due to
equi-distance.
Clusters may exist only in some subspaces.
5. Only data in one dimension is relatively packed.
Adding a dimension “stretch” the points across that dimension, making
them further apart.
Density decrease dramatically.
Distance measure becomes meaningless—due to equi-distance.
6. Methods
◦ Feature transformation: only effective if most dimensions are relevant
PCA “Principal component analysis” & SVD “Singular
value decomposition” useful only when features are highly
correlated/redundant
◦ Feature selection: wrapper or filter approaches
useful to find a subspace where the data have nice clusters
◦ Subspace-clustering: find clusters in all the possible subspaces
CLIQUE, ProClus, and frequent pattern-based clustering
7. The need for developing new algorithms
Effective treatment of high dimensionality:
◦ To effectively extract information from a huge amount of data in databases. In
other words. The running time of algorithms must be predictable and usable in
large database.
Interpretability of results:
◦ User expect clustering results in the high dimensional data to be interpretable,
comprehensible.
Scalability and usability:
◦ Many clustering algorithms don’t well in a large database may contain millions
of objects, Clustering on a sample of a given data set may lead to biased results.
In other words, The clustering technique should be fast and scale with the
number of dimensions and the size of input and insensitive to the order of input
data.
8. CLIQUE satisfies the above desiderata
( Effective , interpretability, Scalability and Usability).
CLIQUE can automatically finds subspaces with high density
clusters.
CLIQUE generates a minimal description for each cluster in
DNF expressions.
Empirical evaluation shows that CLIQUE scales linearly with
the number of input records and has good scalability as the
number of dimension in the dimensionality of the hidden
cluster.
9. a disjunctive normal form (DNF) is a
standardization (or normalization) of a logical
formula which is a disjunction of conjunctive clauses.
A disjunction of conjunctions where every variable or
its negation is represented once in each conjunction
(a minterm)
◦ each minterm appears only once
Example: DNF of pq is
(pq)(pq).
10. Clusters may exist only in some
subspaces.
Subspace-clustering: find clusters in
all the subspaces.
11. What’s (a)unit (b)dense unit (c)a cluster (d)a minimal description of a cluster.
In Figure 1,the two dim space(age , salary) has been partitioned by a 10x10 grid.ξ=10
The unit u=(30≤age<35)Λ(1≤salary<2)
A and B are both region
A=(30≤age<50)Λ(4≤salary<8)
B =(40≤age<60)Λ(2≤salary<6)
Assuming the dense units have been shaded,
AUB is a cluster( A,B are connected regions)
A∩B is not a maximal region.
The minimal description for this cluster AUB is the
DNF expression: ( (30≤age<50)Λ(4≤salary<8))v
( (40≤age<60)Λ(2≤salary<6))
12. In Figure2. Assuming T=20%
(density threshold _ 3 point) If
selectivity(u)>T then u is a dense
unit.
Where selectivity in the fraction of
total data points contained in the
unit.
No 2-dimen unit is dense and
there are no clusters In the
original data space.
The points are projected on the salary dimension , there are three 1-dim dense
units, and there are two clusters in the 1-dim salary subspace,
C=(5≤salary<7 )and D=(2≤salary<3)
But there is no dense unit and cluster in 1-dim age subspace
13. 3.
Generation of
minimal
description for
the clusters.
CLIQUE consists of the following three steps:
1) Identification of subspace that contain clusters.
2) Identification of clusters .
3) Generation of minimal description for the clusters.
Title in here
2.
Identification of
clusters.
1.
Identification of
subspace that
contain clusters.
CLIQUE consists
of the following
three steps:
14. Downward closure (DC) property: If a cluster
is satisfied in a k-dimensional space, it is
also satisfied in all of its (k-1)-dimensional
subspaces.
Due to the DC property, identification of
subspaces is carried out in an iterative
bottom-up fashion (from lower to higher
dimensional subspaces).
15. The difficulty in identifying subspaces that contain clusters
lies in finding dense units in different subspaces.
A. using a bottom-up algorithm to find dense units that
exploits the monotonicity of the clustering criterion with
respect to dimensionality to prune the search space.
Lemma1 (monotonicity):If k-dim unit is dense ,then so are
it’s projections in (k-1)-dim space.
The bottom-up algorithm process
Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit.
Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units.
we discard those dense units from Ck which have a projection (k-1)-dim that
isn't included in Ck-1 .
B. Making the bottom-up algorithm faster with MDL-base
pruning.
16. A. Determination of dense units
◦ Determine the set D1 of all one-dimensional dense units.
◦ k=1
◦ While Dk ≠ do
k=k+1
Determine the set Dk as the set of all the k-dimensional dense units
all of whose (k-1)-dimensional projections, belong to Dk-1.
◦ End while
17. B. Determination of high coverage subspaces.
◦ Determine all the subspaces that contain at least one dense
unit.
◦ Sort these subspaces in descending order according to their
coverage (fraction of the num. of points of the original data set
they contain).
◦ Optimize a suitably defined Minimum Description Length
criterion function and determine a threshold under which a
coverage is considered “low”.
◦ Select the subspaces with “high” coverage.
18. The input to the step of Finding Clusters is a set of dense units
D all in the same k-dim space.
Depth-first search algorithm
◦ Using a Depth –first search algorithm to find the connected components
of the graph, By starting with some U in D, Assign it the first cluster
number and find all the units it is connected to, then if there still are
units in D that have not yet been visited , we find one and repeat the
procedure.
19. For each high coverage subspace S do
◦ Consider the set E of all the dense units in S.
◦ While E ≠
◦ m´ =1
◦ Select a randomly chosen unit u from E.
◦ Assign to Cm´, u and all units of E that are connected to u.
◦ E=E-Cm´
◦ End while
End for
20. The input to this step consists of disjoint clusters in k-
dim subspace.
The goal is to generate a minimal description of each
cluster with two steps:
◦ Covering with maximal region.
◦ Minimal cover.
21. The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the Last
procedure, is the minimum possible union of hyper rectangular
regions.
For example
A B is the minimum cluster description of the shaded region.
C D E is a non-minimal cluster description of the same
region.
22. The CLIQUE Algorithm (cont.)
3. Minimal description of clusters (algorithm)
For each cluster C do
1st stage
• c=0
• While C ≠
c=c+1
Choose a dense unit in C
For i=1 to l
o Grow the unit in both directions along the i-th dimension, trying to cover as
many units in C as possible (boxes that are not belong to C should not be
covered).
End for
Define the set I containing all the units covered by the above procedure
C=C-I
• End while
2nd stage
• Remove all covers whose units are covered by at least another cover.
23.
24. A two dimensional grid of lines of edge size ξ applied in the
two-dimensional feature space.
Two-dimensional and one-dimensional units are defined:
◦ ui
q denotes the i-th one dimensional unit along xq
◦ uij denotes the two dimensional unit resulting from the Cartesian
product of the i-th and j-th intervals along x1 and x2, respectively.
ξ=10 and τ=8% (thus, each unit containing more than
5 points is considered to be dense).
25. The points in u48 and u58, u75 and u76, u83 and u93 are
collinear.
26. One-dimensional dense units:
D1={u2
1, u3
1, u4
1, u5
1, u8
1, u9
1, u1
2, u2
2, u3
2, u5
2, u6
2}
Two-dimensional dense units:
D2={u21, u22, u32, u33, u83, u93}
Notes:
•Although each one of the u48, u75, u76
contains more that 5 points, they are not
included in D2.
•Although it seems unnatural for u83 and
u93 to be included in D2, they are
included since u3
2 is dense.
• All subspaces of the two-dimensional
space contain clusters.
28. C1={(x1): 1 x1<5}
C2={(x1): 7 x1<9}
C3={(x2): 0 x2<3}
C4={(x2): 4 x2<6}
C5={(x1, x2): 1 x1<2, 0 x2<2}{(x1, x2): 2 x1<3, 1 x2<3}
C6={(x1, x2): 7 x1<9, 2 x2<3}
Note that C2 and C6 are
essentially the same cluster,
which is reported twice by
the algorithm.
29. We now empirically evaluate CLIQUE using synthetic data (Generator
from M.Zait and H.Messatfa. a comparative study of clustering methods)
The goals of the experiments are to assess the efficiency of
CLIQUE:
Efficiency :Determine how the running time scales with
◦ Dimensionality of the data space.
◦ Dimensionality of clusters.
◦ Size of data.
Accuracy:Test if CLIQUE recovers known clusters in some
subspaces of a high dimensional data space.
30.
31.
32. Using clusters embedded in 5-dim subspaces while varying
the dimensional of the space from 5 to50.
CLIQUE was able to recover all clusters in every case.
33.
34. Strength
◦ automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
◦ insensitive to the order of records in input and does not presume some
canonical data distribution
◦ scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
Weakness
◦ The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
35. The problem of high dimensionality is often tackled by requiring the
user to specify the subspace for cluster analysis. But user-identification
of quite error-prone. CLIQUE can find clusters embedded in subspaces of
high dimensional data without requiring the user to guess subspaces that
might have interesting clusters.
CLIQUE generates cluster descriptions in the form of DNF expressions
that are minimized for ease of comprehension.
CLIQUE is insensitive to the order of input records, Some clustering
algorithms are sensitive to the order of input data.
Empirical evolution shows that CLIQUE scales linearly with the size of
input and has good scalability as the number of dimension in the data.
CLIQUE can accurately discover clusters embedded in lower dimensional
subspaces.