3. Introduction
⢠Unsupervised learning is a type of machine
learning algorithm used to draw inferences from datasets
consisting of input data without labeled responses. The
most common unsupervised learning method is cluster
analysis, which is used for exploratory data analysis to
find hidden patterns or grouping in data.
3
4. What is clustering?
⢠A group of objects that are similar to other objects in the
cluster, and dissimilar to data points in other clusters.
4
5. Use of clustering
Clustering has been widely used across industries for
years:
⢠Biology - for genetic and species grouping;
⢠Medical imaging - for distinguishing between different
kinds of tissues;
⢠Market research - for differentiating groups of customers
based on some attributes
⢠Recommender systems - giving you better Amazon
purchase suggestions or Netflix movie matches.
5
6. Clustering algorithms
⢠Partition-based clustering
⢠Relatively efficient
⢠E.g. k-means
⢠Hierarchical clustering
⢠Produces trees of clusters
⢠E.g. Agglomerative, Divisive
⢠Density-based clustering
⢠Produces arbitrary shaped clusters
⢠E.g. DBSCAN
6
7. K-means clustering
⢠k-means is a partitioning clustering
⢠K-means divides the data into non-overlapping subsets
(clusters) without any cluster-internal structure
⢠Examples within a cluster are very similar
⢠Examples across different clusters are very different
7
11. How does k-means clustering works?
1. Randomly place k centroids, one for each cluster
2. Calculate the distance of each point from each centroid
3. Assign each data point(object) to the closest centroid,
creating a cluster
4. Recalculate the position of the k centroids
5. Repeat the steps 2-4, until the centroids no longer
move
11
14. ⢠K-means is partitioning algorithm relatively efficient for
medium and large sized databases
⢠Produces sphere-like clusters
⢠Needs number of clusters (k)
14
15. Hierarchical clustering
⢠Hierarchical clustering algorithms build a hierarchy of
clusters where each node is a cluster consists of the
clusters of its daughter nodes.
⢠Hierarchical clustering strategies
⢠Divisive (top down)
⢠Agglomerative (bottom up)
15
16. Agglomerative algorithm
1. Create n clusters, one for each data point
2. Compute the proximity matrix
3. Repeat
1. Merge the two closest clusters
2. Update the proximity matrix
4. Until only a single cluster remains
16
18. Distance between clusters
⢠Single-Linkage clustering
⢠Minimum distance between clusters
⢠Complete-Linkage Clustering
⢠Maximum distance between clusters
⢠Average linkage clustering
⢠Average distance between clusters
⢠Centroid linkage clustering
⢠Distance between cluster centroids
18
19. ⢠Advantages
⢠Doesnât required number of clusters to be specified
⢠Easy to implement
⢠Produces a dendrogram, which helps with understanding the data
19
20. ⢠Disadvantages
⢠Can never undo any previous steps throughout the algorithm
⢠Generally has long runtimes
⢠Sometimes difficult to identify the number of clusters by the
dendrogram
20
21. Hierarchical clustering Vs. K-means
K-means Hierarchical Clustering
Much more efficient Can be slow for large datasets
Requires the number of clusters to be
specified
Does not require the number of
clusters to run
Gives only one partitioning of the data
based on the predefined number of
clusters
Gives more than one partitioning
depending on the resolution
Potentially returns different clusters
each time it is run due to random
initialization of centroids
Always generates the same clusters
21
22. DBSCAN clustering
⢠When applied to tasks with arbitrary shaped clusters or
clusters within clusters, traditional techniques might not
be able to achieve good results
⢠Partitioning based algorithms has no notion of outliers that
is, all points are assigned to a cluster even if they do not
belong in any
⢠In contrast, density-based clustering locates regions
of high density that are separated from one another by
regions of low density. Density in this context is defined as
the number of points within a specified radius.
22
25. What is DBSCAN?
⢠DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
⢠Is one of the most common clustering algorithms
⢠Works based on density of objects
⢠R (Radius of neighborhood)
⢠Radius (R) that if includes enough number
of points within, we call it a dense area
⢠M (Min number of neighbors)
⢠The minimum number of data points
we want in a neighborhood to define a cluster
25