Lecture18

Introduction to Machine
Learning
Lecture 18
Clustering

Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu

Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull

Recap of Lecture 17
Clustering
g

Hierarchical clustering

Slide 2
Artificial Intelligence Machine Learning

Today’s Agenda

Partitional clustering: K-means
Applications of clustering
Using Weka

Slide 3

Partitional Clustering
Aim
Assign a set of objects into K clusters with no hierarchical
s uc u e
structure
How?
First approach: enumerate all partitions and get the one that
Fi h ll ii d h h
minimizes a measure of quality
However
H
To expensive when the number of elements increases
2·104 partitions
E.g.: Organize 30 objects into 3 groups
Thence, we need heuristic methods

Slide 4

Defining the Problem
The problem is
p
Map N objects into K clusters
Each bj t belongs t a separate cluster
E h object b l to tlt
Key factors
Criterion function
Algorithm process

We’ll see
Squared error algorithms

Slide 5

Squared Error Algorithms
Definition of squared error
q
Assume a collection of objects x1, x2, … xN
We want to organize them in K clusters c1, c2, … cK
The squared error criterion is defined as

where

Slide 6

Formulation of the Problem
Goal
Find the clusterization that minimizes the squared error over all
poss b e clusterizations
possible c us e a o s

Characteristics of k-means
It was discovered by several researches across different
disciplines
Requires the user to specify the number of clusters, which is k
In this way, we avoid the problem of determining the number of
clusters
Uses a heuristic procedure to finish with the best prototypes

Slide 7

K-means
The procedure
p
Initialize a k-partition randomly or based on some prior
1.
knowledge. Calculate the c us e p o o ype matrix M
o edge Ca cu a e e cluster prototype a
Assign each object of the data set to the nearest cluster center
2.
(ci)
Recalculate the cluster prototype matrix based on the current
3.
pa t t o
partition
Repeat steps 2 and 3 until there is no change for each cluster
4.

Will this lead the best solution?
I don’t know
At least, it will lead to an locally optimal solution
least

Slide 8

Example of k-means

Slide 9

Example of k-means

Slide 10

Example of k-means

Slide 11

Example of k-means

Slide 12

Conservative k-means alg.
Lloyd algorithm is fast but in each iteration it moves
y g
many data points, not necessarily causing better
convergence.
A more conservative method would be to move one
p
point at a time only if it improves the overall clustering
y p g
cost
The s a e t e c uste g cost o a pa t t o o data po ts is
e smaller the clustering of partition of points s
the better that clustering is
Different methods (e g , the squared e o d sto t o ) ca be
e e t et ods (e.g., t e squa ed error distortion) can
used to measure this clustering cost

Slide 13

Greedy k-means alg.
Select an arbitrary partition P into k clusters
1.
while forever
2.
bestChange ? 0
1.
for every cluster C
2.
2
for every element i not in C
1.

if moving i to cluster C reduces its clustering cost
g g
1.
if (cost(P) – cost(Pi ? C) > bestChange
1.

bestChange ? cost(P) – cost(Pi ? C)
i* ? I
C* ? C
if bestChange > 0
3.
Change partition P by moving i* to C*
1.

else
4.
return P
1.

Slide 14

Some Remarks
Further comments about k-means
No efficient and universal method for identifying the initial
pa o s
partitions
Run the algorithm many times with random initial partitions
The iterative approach cannot guarantee convergence to global
optimum
Incorporation of techniques such GAs or SA to empower the
p q p
search toward the global optimum
It is sensitive to outliers and noise
Some approaches such as ISODATA and PAM consider the
effect of outliers
The definition of “means” restricts the application to continuous
variables
New dissimilarity measures to deal with categorical variables

Slide 15

APPLICATIONS

Slide 16

Traveling Salesman Problem
Up to millions of cities
First organize cities in clusters
Results of
10k cities
100k cities
1M cities

Slide 17

Bioinformatics – Gene Expression Data

Application to
pp
Genome sequencing projects
DNA microarray t h l i
i technologies
DNA microarray technology
Effective and efficient way to measure gene expression levels
of thousands of genes simultaneously
Investigation of the role of the genes
Clustering: Reveal hidden structures of biological data
Assumption: Functionally similar genes or proteins usually
share similar patterns or primary sequence structures

Slide 18


Slide 19


Slide 20

Next Class

Genetic Fuzzy Systems

Slide 21

Lecture18

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Lecture18

Semelhante a Lecture18 (20)

Mais de Albert Orriols-Puig

Mais de Albert Orriols-Puig (6)

Último

Último (20)

Lecture18