2. Clustering
• Clustering is a process of partitioning a set of
data(objects) in a set of meaningful sub
classes, called clusters.
• Cluster is a collection of objects that are
similar to each other.
• Unsupervised classification (no predefined
classes).
4. Clustering Algorithms
• Are attractive for the task of class identification.
1. Partitioning Methods
2. Hierarchical Methods
3. Density Based Methods
4. Grid Based Methods
5. Model Based Methods
5. Density Based Methods
• Based on notion of density
• Density-based clustering algorithm that grows
regions with sufficiently high density into clusters.
• The idea is to continue growing the given cluster as
long as the density (# of data points) in the
neighborhood exceeds some threshold. Namely, the
neighborhood of a given radius has to contain at
least a minimum number of objects.
• Discover clusters of arbitrary shape
• Handle noise
6. Density Based Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
6
7. Density based Notion of Clusters
• Def:1 (Eps-neighborhood of a point)
• The Eps neighborhood of a point p, denoted by
NEps(P), is defined:
NEps(P) = {q E D I dist(p,q) < = Eps}.
• A naive approach could require for each point in a
cluster that there are at least a minimum number
(MinPts) of points in an Eps-neighborhood of that
point.
8.
9. Def:2 (directly density reachable)
• A point p is directly density-reachable from a point q
wrt. Eps, MinPts if
• 1) p є NEps(q)
• 2) I NEps(q) l > = MinPts (core point condition).
10. • Def:4(density connected)
• A point p is density connected to a point q wrt. Eps and
MinPts if there is a point o such that both, p and q are
density-reachable from o wrt. Eps and MinPts. Density-
connectivity is a symmetric relation. Now, we are able to
define our density-based notion of a cluster. cluster is
defined to be a set of density connected points which is
maximal wrt. density-reachability. Noise is simply the set
of points in D not belonging to any of its clusters.
11.
12. Def:5 (Cluster)
Let D be a database of points. A cluster C wrt. Eps and MinPts is a
non-empty subset of D satisfying the following conditions:
1) ɏ p, q: if p E C and q is density-reachable from p wrt. Eps and
MinPts, then q E C. (Maximality)
2) ɏ p, q є C: p is density-connected to q wrt. EPS and MinPts.
(Connectivity)
Def:6 (Noise)
Let C 1 ..... Ck be the clusters of the database D wrt. parameters Eps i
and MinPts i, i = 1 ..... k. Then we define the noise as the set of points
in the database D not belonging to any cluster C i, i.e.
noise = {p E D I ɏ i: p !є Ci)
13.
14.
15.
16. Lemmas for validating the correctness of
our clustering algorithm
Lemma 1: Let p be a point in D and INEps(p)l > MinPts. Then
the
set O = {o I o E D and o is density-reachable from p wrt. Eps
and MinPts } is a cluster wrt. Eps and MinPts.
• It is not obvious that a cluster C wrt. Eps and MinPts is
uniquely determined by any of its core points. However,
each point in C is density-reachable from any of the core
points of C and, therefore, a cluster C contains exactly the
points which are density-reachable from an arbitrary
core point of C.
17. Lemmas for validating the correctness of our
clustering algorithm
Lemma 2:
• Let C be a cluster wrt. Eps and MinPts and let p be
any point in C with INEps(P)l >= MinPts.
• Then C equals to the
set O = {o I o is density-connected from p wrt. Eps and
MinPts }.
18. Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
• If a spatial index is used, the computational complexity of
DBSCAN is O(nlogn), where n is the number of database
objects. Otherwise, the complexity is O(n2)
19.
20.
21. Comparisons (DBSCAN vs. CLARANS)
• the DBSCAN algorithm is compared to another
clustering algorithm. This one is called CLARANS
(Clustering Large Applications based on RANdomized
Search).
• It is an improvement of the k-medoid algorithms.
• The good properties compared to k-medoid are that
CLARANS works efficient for databases with about a
thousand objects. When the database grows larger,
CLARANS will fall behind because the algorithm
temporarily stores all the objects in the main
memory, i.e. the run time will increase.
22.
23.
24. Complexity
• DBSCAN visits each point of the database, possibly multiple
times. For practical considerations, time complexity is mostly
governed by the number of regionQuery invocations. DBSCAN
executes exactly one such query for each point, and if
an indexing structure is used that executes such
aneighborhood query in O(log n), an overall runtime
complexity of O(n log n) is obtained.
• Without the use of an accelerating index structure, the run
time complexity is O(n²). Often the distance matrix of size (n²-
n)/2 is materialized to avoid distance recomputations. This
however also needs O(n²) memory, whereas a non-matrix
based implementation only needs O(n) memory.
25. Advantages
• DBSCAN does not require one to specify the number of
clusters in the data a priori, as opposed to k-means.
• DBSCAN can find arbitrarily shaped clusters.
• DBSCAN requires just two parameters and is mostly
insensitive to the ordering of the points in the database.
• DBSCAN has a notion of noise, and is robust to outliers
• DBSCAN is designed for use with databases that can
accelerate region queries, e.g. using an R* tree.
26. Disadvantages
• DBSCAN is not entirely deterministic: border points that are
reachable from more than one cluster can be part of either
cluster. Fortunately, this situation does not arise often, and
has little impact on the clustering result: both on core points
and noise points, DBSCAN is deterministic.
• The quality of DBSCAN depends on the distance measure used
in the function regionQuery(P,ε). The most common distance
metric used is Euclidean distance (making it difficult to find an
appropriate value for ε. This effect, however, is also present in
any other algorithm based on Euclidean distance.)
• DBSCAN cannot cluster data sets well with large differences in
densities
27. Extensions
• Generalized DBSCAN (GDBSCAN)is a generalization by the
same authors to arbitrary "neighborhood" and "dense"
predicates.
• DBSCAN algorithm have been proposed, including methods
for parallelization, parameter estimation and support for
uncertain data. The basic idea has been extended to
hierarchical clustering by the OPTICS algorithm.
• HDBSCANis a hierarchical version of DBSCAN which is also
faster than OPTICS, from which a flat partition consisting of
most prominent clusters can be extracted from the hierarchy.