Dbm630 lecture09

DBM630: Data Mining and
Data Warehousing

MS.IT. Rangsit University
Semester 2/2011

Lecture 9
Clustering

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1

Topics
 What is Cluster Analysis?
 Types of Attributes in Cluster Analysis
 Major Clustering Approaches
 Partitioning Algorithms
 Hierarchical Algorithms

2 Data Warehousing and Data Mining by Kritsada Sriphaew

Classification vs. Clustering
Classification: Supervised learning
Learns a method for predicting the
instance class from pre-labeled
(classified) instances

3 Clustering Analysis

Clustering

Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data


Clustering Methods
 Many different method and algorithms:
 For numeric and/or symbolic data
 Deterministic vs. probabilistic
 Exclusive vs. overlapping
 Hierarchical vs. flat
 Top-down vs. bottom-up


What is Cluster Analysis ?
 Cluster: a collection of data objects
 High similarity of objects within a cluster
 Low similarity of objects across clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is an unsupervised classification: no predefined
classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms


General Applications of Clustering
 Pattern Recognition
 In biology, derive plant and animal taxonomies, categorize genes
 Spatial Data Analysis
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data mining
 Image Processing
 Economic Science (e.g. market research)
 discovering distinct groups in their customer bases and characterize customer
groups based on purchasing patterns
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns


Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults


Clustering Evaluation
 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
 distance measures
 high similarity within a cluster, low across clusters


Criteria for Clustering
 A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on:
 both the similarity measure used by the method and its
implementation to the new cases
 ability to discover some or all of the hidden patterns


Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability


The distance function
 Simplest case: one numeric attribute A
 Distance(X,Y) = A(X) – A(Y)
 Several numeric attributes:
 Distance(X,Y) = Euclidean distance between X,Y
 Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
 Are all attributes equally important?
 Weighting the attributes might be necessary


From Data Matrix to Similarity or Dissimilarity Matrices
 Data matrix (or object-by-attribute structure)
 m objects with n attributes, e.g., relational data
 x11  x1 j  x1n 
      
x  xij  xin 
 i1 
      
 xm1
  xmj  xmn 


 Similarity and dissimilarity matrices
 a collection of proximities for all pairs of m objects.

1   0 
       
s  1  d  0 
 i1   i1 
      sij  s ji       dij  d ji
13  m1
s  smj  1 0  sij  1
 d m1
  d mj  0 0  dijAnalysis

Clustering
1

Distance Functions (Overview)
 To transform a data matrix to similarity or dissimilarity
matrices, we need a definition of distance.
 Some definitions of distance functions depend on the type of
attributes
 interval-scaled attributes
 Boolean attributes
 nominal, ordinal and ratio attributes.
 Weights should be associated with different attributes based
on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.


Similarity and Dissimilarity Between Objects (II)
 If q = 2, d is Euclidean distance:

dij  (| xi1  x j1 |2  | xi 2  x j 2 |2 ... | xip  x jp |2 )

 Properties
d(i,j) >= 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) <= d(i,k) + d(k,j)
 Also one can use weighted distance, parametric Pearson product
moment correlation, or other dissimilarity measures.
d  (w | x  x | w | x  x | ... w | x  x | )
ij 1 i1 j1
2
2 i2 j2
2
i ip jp
2


Types of Attributes in Clustering
 Interval-scaled attributes
 Continuous measures of a roughly linear scale
 Binary attributes
 Two-state measures: 0 or 1
 Nominal, ordinal, and ratio attributes
 More than two states, nominal or ordinal or nonlinear scale
 Mixed types
 Mixture of interval-scaled, symmetric binary, asymmetric binary,
nominal, ordinal, or ratio-scaled attributes

17 Clustering Analysis: Types of Attributes

Interval-valued Attributes
 Standardize data
 Calculate the mean absolute deviation:
s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
n
where m f  1 (x1 f  x2 f
n  ...  xnf )
.

 Calculate the standardized measurement (mean-absolute-based z-
xif  m f
score) zif  sf
 Using mean absolute deviation is more robust than using standard
deviation since the z-scores of outliers do not become too small.
Hence, the outliers remain detectable.


A binary variable contains two
possible outcomes: 1

Binary Attributes (positive/present) or 0
(negative/absent).
• If there is no preference for
 A contingency table for binary data which outcome should be
coded as 0 and which as 1, the
binary variable is
Object j called symmetric.
1 0 sum • If the outcomes of a binary
variable are not equally
1 a b a b important, the binary variable is
called asymmetric, such as "is
Object i 0 c d cd color-blind" for a human being.
The most important outcome
sum a  c b  d p is usually coded as 1 (present)
and the other is coded as 0
(absent).

 Simple matching coefficient (invariant, if the binary variable is
symmetric): bc
dij 
abcd
 Jaccard coefficient (noninvariant if the binary variable is
asymmetric): bc
dij 
19
a  b  cAnalysis:Types of Attributes
Clustering

Dissimilarity on Binary Attributes
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 The gender is symmetric attribute and the remaining attributes are asymmetric
binary. Here, let the values Y and P be set to 1, and the value N be set to 0.
Then calculate only asymmetric binary 0 1
d ( jack , mary)   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary)   0.75
11 2

Nominal Attributes
 A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m is the # of matches, p is the total # of nominal attributes
pm
dij 
p
 Method 2: Use a large number of binary variables
 creating a new binary variable for each of the M nominal
states


Ordinal Attributes
 An ordinal variable can be discrete or continuous
 order is important, e.g., rank
 Can be treated like interval-scaled
 replacing xif by their rank rif { ,..., M f }
1
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-scaled
variables

Ratio-Scaled Attributes
 Ratio-scaled variable: a positive measurement on a nonlinear
scale, approximately at exponential scale, such as AeBt
or Ae-Bt
 Methods:
(1) treat them like interval-scaled attributes
not a good choice!
(2) apply logarithmic transformation
yif = log(xif)
(3) treat them as continuous ordinal data and
treat their rank as interval-scaled.

Mixed Types
 A database may contain all the six types of attributes
 symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
 One may use a weighted formula to combine their effects.
 p 1 ij f ) dij f )
( (
dij  f p
 f 1 ij f )
(
 f is binary or nominal:
 dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
 treat (normalized) zif as interval-scaled


Major Clustering Approaches
 Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other

25 Clustering Analysis: Clustering Approaches

Partitioning Approach
 Construct a partition of a database D of n objects into a set of k
clusters
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

26 Clustering Analysis: Partitioning Algorithms

The K-Means Clustering Method
(Overview)
 Given k, the k-means algorithm is implemented in 4 steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
 Assign each object to the cluster with the nearest seed
point.
 Go back to Step 2, stop when no more new assignment.


(An Graphical Example)
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10


K-means example, step 1

Given k=3,

k1
Y

Pick 3 k2
initial
cluster
centers
(randomly)
k3

X


k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X


k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each cluster k3

X


Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X

K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X

K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X


k1
Y

k2
move cluster
centers to k3
cluster means

X

Problems to be considered
 What can be the problems with K-means clustering?
 Result can vary significantly depending on initial choice of seeds
(number and position)
 Can get trapped in local minimum initial cluster
 Example: centers

instances

 Q: What can be done?
 A: To increase chance of finding global optimum: restart with
different random seeds.
 What can be done about outliers?


(Strength and Weakness)
 Strength
 Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n
 Good for finding clusters with spherical shapes
 Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, no. of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes


(Variations – I)
 A few variants of the k-means which differ in
 Selection of the initial k means Mean of 1, 3, 5, 7, 9 is 5
Mean of 1, 3, 5, 7, 1009 is 205
 Dissimilarity calculations Median of 1, 3, 5, 7, 1009 is 5
Median advantage: not affected
 Strategies to calculate cluster means by extreme values

 K-medoids – instead of mean, use medians of each cluster
 For large databases, use sampling
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical/numerical data: k-prototype method


The K-Medoids Clustering Method
(Overview)
 Find representative objects, called medoids, in clusters
 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
 PAM works effectively for small data sets, but does not
scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling


The K-Medoids Clustering Method
(PAM - Partitioning Around Medoids)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected object
i, calculate the total swapping cost Sih
 For each pair of i and h,
 If Sih < 0, i is replaced by h
 Then assign each non-selected object to the most similar
representative object
 repeat steps 2-3 until there is no change


PAM example
 Cluster the following data set of ten objects into two
clusters i.e k = 2.
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6


PAM example, step 1
 Initialise k medoids. Let assume c1 = (3,4) and c2 =
(7,4)
 Calculating distance so as to associate each data
object to its nearestCost medoid. Assume that cost is Cost
calculated )using Minkowski distance metric with r =(distan
c 1
Data objects
(X
(distan 2 c
Data objects
(X ) 1.
i
ce) i
ce)
3 4 2 6 3 7 4 2 6 7
3 4 3 8 4 7 4 3 8 8
3 4 4 7 4 7 4 4 7 6
3 4 6 2 5 7 4 6 2 3
3 4 6 4 3 7 4 6 4 1
3 4 7 3 5 7 4 7 3 1
3 4 8 5 6 7 4 8 5 2
3 4 7 6 6 7 4 7 6 2

PAM example, step 1b
 Then the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
 Total cost = 3 + 4 +Cost+ 3 + 1 + 1 + 2 + 2 = 20
4 Cost
Data objects Data objects
c1 (distan c2 (distan
(Xi) (Xi)
ce) ce)
3 4 2 6 3 7 4 2 6 7
3 4 3 8 4 7 4 3 8 8
3 4 4 7 4 7 4 4 7 6
3 4 6 2 5 7 4 6 2 3
3 4 6 4 3 7 4 6 4 1
3 4 7 3 5 7 4 7 3 1
3 4 8 5 6 7 4 8 5 2
3 4 7 6 6 7 4 7 6 2

PAM example, step 2
Selection of nonmedoid O′ randomly. Let us

assume O′ = (7,3). So now the medoids
are c1(3,4) and O′(7,3)
 Calculate the cost of new medoid by using the
Cost Cost
Data objects Data objects
formula in )the step1. Total cost =
c 1
(X
(distan O′
(X )
(distan
i
ce) i
ce)
3 3+4+4+2+2+1+3+33 = 22 7
4 2 6 3 2 6 8
3 4 3 8 4 7 3 3 8 9
3 4 4 7 4 7 3 4 7 7
3 4 6 2 5 7 3 6 2 2
3 4 6 4 3 7 3 6 4 2
3 4 7 3 5 7 3 7 4 1
3 4 8 5 6 7 3 8 5 3
3 4 7 6 6 7 3 7 6 3

PAM example, step 2b
 So cost of swapping medoid from c2 to O′ is
S = current total cost – past total cost = 22-20
= 2 >0
 So moving to O′ would be bad idea, so the previous
choice was good, and algorithm terminates here (i.e
there is no change in the medoids).
 It may happen some data points may shift from one
cluster to another cluster depending upon their
closeness to medoid


CLARA (Clustering Large Applications) (1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased


CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized Search) (Ng
and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a graph where
every node is a potential solution, that is, a set of k medoids
 If the local optimum is found, CLARANS starts with new randomly
selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may further improve
its performance (Ester et al.’95)


The Partition-Based Clustering
(Discussion)

 Result can vary significantly based on initial choice of seeds
 Algorithm can get trapped in a local minimum
 Example: four instances at the vertices of a two-
dimensional rectangle
 Local minimum: two cluster centers at the midpoints
of the rectangle’s long sides
 Simple way to increase chance of finding a global optimum:
restart with different random seeds

48 Clustering Analysis: Hierarchical Algorithms

Hierarchical Clustering
 Use distance matrix as clustering criteria.
 This method does not require the number of clusters k as an
input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10


Dendrogram for Hierarchical Clustering
 Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
 A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.


DIANA - Divisive Analysis
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10


Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously


Distances - Hierarchical Clustering
(Overview)

 Four measures for distance between clusters are:
 Single linkage (Minimum distance):

dmin (Ci , C j )  minpCi , p'Cj p  p'
 Complete linkage (Maximum distance):

dmax (Ci , C j )  max pCi , p'Cj p  p'
 Centroid comparison (Mean distance):

d mean (Ci , C j )  mi  m j
 Element comparison (Average distance):

1
d avg (Ci , C j )    p  p'
54
ni n j pCi p'C j Analysis: Hierarchical Algorithms
Clustering

Distances - Hierarchical Clustering
(Graphical Representation)
 Four measures for distance between clusters are (1) single linkage, (2) complete
linkage, (3) centroid comparison and (4) element comparison

Cluster 1 (1)single Cluster 2

(4) Element comparison:
x x average distance among all
elements in two clusters

(3)centroid
(2)complete
Cluster 3
x x = centroids


Practice
 Use single and complete link agglomerative clustering
to group the data described by the following
distance matrix. Show the dendrograms.
A B C D
A 0 1 4 5
B 0 2 6
C 0 3
D 0


Other Clustering Methods
 Incremental Clustering
 Probability-based Clustering, Bayesian Clustering
 EM Algorithm

Cluster Schemes
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods


Advanced Method: BIRCH (Overview)
 Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang,
Raghu Ramakrishnan, Miron Livny, 1996]
 Incremental, hierarchical, one scan
 Save clustering information in a tree
 Each entry in the tree contains information about one cluster
 New nodes inserted in closest entry in tree
 Only works with "metric" attributes
 Must have Euclidean coordinates
 Designed for very large data sets
 Time and memory constraints are explicit
 Treats dense regions of data points as sub-clusters
 Not all data points are important for clustering
 Only one scan of data is necessary


BIRCH (Merits)
 Incremental, distance-based approach
 Decisions are made without scanning all data points, or all
currently existing clusters
 Does not need the whole data set in advance
 Unique approach: Distance-based algorithms generally need
all the data points to work
 Make best use of available memory while minimizing I/O
costs
 Does not assume that the probability distributions on
attributes is independent


BIRCH – Clustering Feature and Clustering Feature Tree
 BIRTH introduces two concepts, clustering feature and clustering
feature tree (CF Tree), which are used to summarize cluster
representations.
 These structures help the clustering method achieve good speed and
scalability in large databases and make it effective for incremental and
dynamic clustering of incoming object
 Given n d-dimensional data objects or points in a cluster, we can
define the centroid x0, radius R and diameter D of the cluster as
follows:


BIRCH – Centroid, Radius and Diameter
• Given a cluster of instances , we define:
• Centroid: the center of a cluster

• Radius: average distance from member points to centroid

• Diameter: average pair-wise distance within a cluster


BIRCH – Centroid Euclidean and Manhattan distances
• The centroid Euclidean distance and centroid
Manhattan distance are defined between any two
clusters.
• Centroid Euclidean distance

• Centroid Manhattan distance


BIRCH
(Average inter-cluster, Average intra-cluster, Variance increase)
• The average inter-cluster, the average intra-cluster, and the
variance increase distances are defined as follows
• Average inter-cluster

• Average intra-cluster

• Variance increase distances


Clustering Feature
 CF = (N,LS,SS)
 N: Number of points in cluster
 LS: Sum of points in the cluster
 SS: Sum of squares of points in the cluster
 CF Tree
 Balanced search tree
 Node has CF triple for each child
 Leaf node represents cluster and has CF value for each
subcluster in it.
 Subcluster has maximum diameter


Clustering Feature Vector
(3,4) (4,7)
Clustering Feature: CF = (N, LS, SS) (2,6) (3,8)
(4,5)
N: Number of data points
LS: Ni=1=Xi 10
CF = (5, (16,30),(54,190))

SS: Ni=1=Xi2
9

8

7

6
(6,2) (8,4)
(7,2) (8,5)
5

4

3

2 (7,4)
1

0
0 1 2 3 4 5 6 7 8 9 10

CF = (5, (36,17),(262,65))
CF = (10, (52,47),(316,255))

CF1  CF2  ( N1  N 2 , LS1  LS2 , SS1  SS2 )

Merging two clusters
Cluster {Xi}: Cluster {Xj}:
i = 1, 2, …, N1 j = N1+1, N1+2, …, N1+N2

Cluster Xl = {Xi} + {Xj}:
l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2

CF Tree Root
B=7 CF1 CF2 CF3 CF6
child1 child2 child3 child6
L=6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node
prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next


Properties of CF-Tree
 Each non-leaf node has at most B entries
 Each leaf node has at most L CF entries which each satisfy threshold
T
 Node size is determined by dimensionality of data space and input
parameter P (page size)

Branching Factor
and
Thread hold


BIRCH Algorithm (CF-Tree Insertion)

Recurse down from root, find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / … / D4
Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF entry.
If there is no room for new leaf, split the parent node
Traverse back & up
69 Updating CFs on the path Analysis
Clustering or splitting nodes

Improve Clusters


BIRCH Algorithm (Overall steps)


Details of Each Step
 Phase 1: Load data into memory
 Build an initial in-memory CF-tree with the data (one scan)
 Subsequent phases become fast, accurate, less order sensitive
 Phase 2: Condense data
 Rebuild the CF-tree with a larger T
 Condensing is optional
 Phase 3: Global clustering
 Use existing clustering algorithm on CF entries
 Helps fix problem where natural clusters span nodes
 Phase 4: Cluster refining
 Do additional passes over the dataset & reassign data points to the
closest centroid from phase 3
 Refining is optional


Summary of BIRCH
 BIRCH works with very large data sets
 Explicitly bounded by computational resources.
 The computation complexity is O(n), where n is the number of
objects to be clustered.
 Runs with specified amount of memory (P)
 Superior to CLARANS and k-MEANS
 Quality, speed, stability and scalability


CURE (Clustering Using REpresentatives)
 CURE was proposed by Guha, Rastogi & Shim, 1998
 It stops the creation of a cluster hierarchy if a level consists of k
clusters
 Each cluster has c representatives (instead of one)
 Choose c well scattered points in the cluster
 Shrink them towards the mean of the cluster by a fraction of 
 The representatives capture the physical shape and geometry of the
cluster
 It can treat arbitrary shaped clusters and avoid single-link effect.
 Merge the closest two clusters
 Distance of two clusters: the distance between the two closest
representatives


CURE Algorithm


CURE Algorithm (Another form)


CURE Algorithm (for Large Databases)


Data Partitioning and Clustering
y s = 50
p=2
s/p = 25
x s/pq = 5
y y
y y

x x
x x

Cure: Shrinking Representative Points
 Shrink the multiple representative points towards the gravity center by
a fraction of .
 Multiple representatives capture the shape of the cluster
y y

x x


Clustering Categorical Data: ROCK
 ROCK: RObust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
 Use links to measure similarity/proximity
 Not distance-based with categorical attributes
 Computational complexity: O(n2  nmmma  n2 log n)
 Basic ideas (Jaccard coefficient):
Similarity function and neighbors:

T1  T2
Let T1 = {1,2,3}, T2={3,4,5} Sim(T1 , T2 ) 
T1  T2
{3} 1
Sim(T1, T 2)    0.2
{1,2,3,4,5} 5

ROCK: An Example
 Links: The number of common neighbors for the two points.
Using Jaccard
 Use Distances to determine neighbors
 (pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0
 (pt2,pt3) = 0.6, (pt2,pt4) = 0.2
 (pt3,pt4) = 0.2
 Use 0.2 as threshold for neighbors
 Pt2 and Pt3 have 3 common neighbors
 Resulting clusters (1), (2,3,4) which makes more sense


ROCK: Property & Algorithm
 Links: The number of common neighbours for the
two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3} {1,2,4}
 Algorithm
 Draw random sample
 Cluster with links (maybe agglomerative hierarchical)
 Label data in disk

CHAMELEON
 CHAMELEON: hierarchical clustering using dynamic modeling, by G.
Karypis, E.H. Han and V. Kumar’99
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high relative to
the internal interconnectivity of the clusters and closeness of
items within the clusters
 A two phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters


Graph-based clustering
 Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking the
connections to less similar points.
 The nearest neighbors of a point tend to belong to the same
class as the point itself.
 This reduces the impact of noise and outliers and sharpens
the distinction between clusters.


Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters


Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such as
density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)


Density-Based Clustering (Background)
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density -reachable
from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition: p MinPts = 5
|NEps (q)| >= MinPts q
Eps = 1 cm


Density-Based Clustering (Background)
 The neighborhood with in a radius Eps of a given object is call the 
neighborhood of the object.
 If the neighborhood of an object contains at least a minimum
number, MinPts, of objects, then the object is called a core object.
 Given a set of objects, D, we say that an object p is directly density-
reachable from object q if p is within the neighborhood of q, and q
is a core object.
 An object p is density-reachable from object q with respect to  and
MinPts in a set of objects D, if there is a chain of object p1, …, pn,
where p1=q and pn=p such that p(i+1) is directly density-reachable from
pi with respect to  and MinPts, for 1≤i≤n, from pi  D.
 An object p is density-connected to object q with respect to


Density-Based Clustering p
p1
q
 Density-reachable:
 A point p is density-reachable from a point q wrt. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-
reachable from pi
 Density-connected
 A point p is density-connected to a point q wrt. Eps, MinPts if there is a
point o such that both, p and q are density-reachable from o wrt. Eps and
MinPts.
p q

o


DBSCAN: Density Based Spatial Clustering of Applications with
Noise
 Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
Core MinPts = 5


DBSCAN: The Algorithm
 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.


Dbm630 lecture09

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dbm630 lecture09

Similar to Dbm630 lecture09 (20)

More from Tokyo Institute of Technology

More from Tokyo Institute of Technology (11)

Recently uploaded

Recently uploaded (20)

Dbm630 lecture09