1. Introduction to Machine
Learning
Lecture 18
Clustering
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
2. Recap of Lecture 17
Clustering
g
Hierarchical clustering
Slide 2
Artificial Intelligence Machine Learning
3. Today’s Agenda
Partitional clustering: K-means
Applications of clustering
Using Weka
Slide 3
Artificial Intelligence Machine Learning
4. Partitional Clustering
Aim
Assign a set of objects into K clusters with no hierarchical
s uc u e
structure
How?
First approach: enumerate all partitions and get the one that
Fi h ll ii d h h
minimizes a measure of quality
However
H
To expensive when the number of elements increases
2·104 partitions
E.g.: Organize 30 objects into 3 groups
Thence, we need heuristic methods
Slide 4
Artificial Intelligence Machine Learning
5. Defining the Problem
The problem is
p
Map N objects into K clusters
Each bj t belongs t a separate cluster
E h object b l to tlt
Key factors
Criterion function
Algorithm process
We’ll see
Squared error algorithms
Slide 5
Artificial Intelligence Machine Learning
6. Squared Error Algorithms
Definition of squared error
q
Assume a collection of objects x1, x2, … xN
We want to organize them in K clusters c1, c2, … cK
The squared error criterion is defined as
where
Slide 6
Artificial Intelligence Machine Learning
7. Formulation of the Problem
Goal
Find the clusterization that minimizes the squared error over all
poss b e clusterizations
possible c us e a o s
Characteristics of k-means
It was discovered by several researches across different
disciplines
Requires the user to specify the number of clusters, which is k
In this way, we avoid the problem of determining the number of
clusters
Uses a heuristic procedure to finish with the best prototypes
Slide 7
Artificial Intelligence Machine Learning
8. K-means
The procedure
p
Initialize a k-partition randomly or based on some prior
1.
knowledge. Calculate the c us e p o o ype matrix M
o edge Ca cu a e e cluster prototype a
Assign each object of the data set to the nearest cluster center
2.
(ci)
Recalculate the cluster prototype matrix based on the current
3.
pa t t o
partition
Repeat steps 2 and 3 until there is no change for each cluster
4.
Will this lead the best solution?
I don’t know
At least, it will lead to an locally optimal solution
least
Slide 8
Artificial Intelligence Machine Learning
13. Conservative k-means alg.
Lloyd algorithm is fast but in each iteration it moves
y g
many data points, not necessarily causing better
convergence.
A more conservative method would be to move one
p
point at a time only if it improves the overall clustering
y p g
cost
The s a e t e c uste g cost o a pa t t o o data po ts is
e smaller the clustering of partition of points s
the better that clustering is
Different methods (e g , the squared e o d sto t o ) ca be
e e t et ods (e.g., t e squa ed error distortion) can
used to measure this clustering cost
Slide 13
Artificial Intelligence Machine Learning
14. Greedy k-means alg.
Select an arbitrary partition P into k clusters
1.
while forever
2.
bestChange ? 0
1.
for every cluster C
2.
2
for every element i not in C
1.
if moving i to cluster C reduces its clustering cost
g g
1.
if (cost(P) – cost(Pi ? C) > bestChange
1.
bestChange ? cost(P) – cost(Pi ? C)
i* ? I
C* ? C
if bestChange > 0
3.
Change partition P by moving i* to C*
1.
else
4.
return P
1.
Slide 14
Artificial Intelligence Machine Learning
15. Some Remarks
Further comments about k-means
No efficient and universal method for identifying the initial
pa o s
partitions
Run the algorithm many times with random initial partitions
The iterative approach cannot guarantee convergence to global
optimum
Incorporation of techniques such GAs or SA to empower the
p q p
search toward the global optimum
It is sensitive to outliers and noise
Some approaches such as ISODATA and PAM consider the
effect of outliers
The definition of “means” restricts the application to continuous
variables
New dissimilarity measures to deal with categorical variables
Slide 15
Artificial Intelligence Machine Learning
17. Traveling Salesman Problem
Up to millions of cities
First organize cities in clusters
Results of
10k cities
100k cities
1M cities
Slide 17
Artificial Intelligence Machine Learning
18. Bioinformatics – Gene Expression Data
Application to
pp
Genome sequencing projects
DNA microarray t h l i
i technologies
DNA microarray technology
Effective and efficient way to measure gene expression levels
of thousands of genes simultaneously
Investigation of the role of the genes
Clustering: Reveal hidden structures of biological data
Assumption: Functionally similar genes or proteins usually
share similar patterns or primary sequence structures
Slide 18
Artificial Intelligence Machine Learning
21. Next Class
Genetic Fuzzy Systems
Slide 21
Artificial Intelligence Machine Learning
22. Introduction to Machine
Learning
Lecture 18
Clustering
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull