Selection K in K-means Clustering

•Transferir como PPTX, PDF•

3 gostaram•1,246 visualizações

Junghoon Kim

Tecnologia Educação

Why I choose this paper
• There is always an assumption in k-means
algorithm, but I really want to execute without
human’s intuition or insight.
• This paper is first review existing automatical
method for selecting the number of clusters for
k-means algorithm

Paper Format
1)
2)
3)
4)
5)

Introduction
review the main known method for selecting K
analyses the factors influencing the selection of K
describes the proposed evaluation measure
presents the results of applying the proposed
measure to select K for different data sets
6) concludes the paper

K-means Algorithm
• k-means algorithm is a method of clustering
algorithm originally from signal processing, that is
popular for machine learning and data mining.
• k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean until move distance is smaller than
threshold

K-means Algorithm
1) Pick a number (k) of point randomly
2) Assign every node to its nearest cluster center
3) Move each cluster center to the mean of its
assigned nodes
4) Repeat 2-3 until convergence

Clustering: Example 2, Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k2

k3

1

0
0

1

2

3

4

expression in condition 1

5

Comments on the K-Means Metho
d
• Strength
• Relatively efficient: O(tkn), where n is # instances, c is # clusters
, and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: simulated annealing or ge
netic algorithms

• Weakness
• Need to specify c, the number of clusters, in advance
• Initialization Problem
• Not suitable to discover clusters with non-convex shapes

What’s the problem?
• Initialization problem

• it's a problem which is caused when much point is assigned to the part
of high density and less point is assigned to the part of low density

What’s the problem?
• hard to find cluster in non-convex shape

Existing Method
• Values of K determined through human’s viewpoint

• Using probabilistic theory
• Akeike’s information criterion
• if data sets are constructed by a set of Gaussian dist

• Hardy method
• if data sets are constructed by a set of Possion dist

• Monte Carlo techniques(associated null hypothesis)

Research Method
• The method has been validated on
15 artificial and 12 benchmark data sets.
• Also there are 12 benchmark data sets from the
UCI Repository Machine Learning Databases
• These fifteen artificial data sets show effective
sample of lots of distribution which can be usually
generated.

Recommendation Example
f(X) < 0.85, K = X
else K=1

Conclusion
• The new method is closely related to the approach
of K-means clustering because it takes into account
information reflecting the performance of the
algorithm
• The proposed method can suggest multiple values
of K to users for cases when different clustering
results could be obtained with various required
levels of detail
• this method is computationally expensive if used
with large data sets

improvement
• This paper did not mentioned how can we calculate
threshold(e.g, f(x) < 0.85), if we have lots of data
sets, we can apply learning algorithm to determine
threshold
• Experiment data sets are almost biased. This means,
having set of data is too ideal. It doesn't consider
the complexity in reality at all. It can be a way to
evaluate data randomly.
• It is an important issue that we know the range, or
maximum value of K.

Mais conteúdo relacionado

Mais procurados

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP

Af4201214217IJERA Editor

A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir

Training machine learning k means 2017Iwan Sofana

Optimising Data Using K-Means Clustering AlgorithmIJERA Editor

Performance Analysis of Different Clustering AlgorithmIOSR Journals

lecture_mooney.pptbutest

Unsupervised Learning: Clustering Experfy

Improved k-meansKasun Ranga Wijeweera

Noura2Dr-mahmoud Algamel

Lightning talk at MLConf NYC 2015Mohitdeep Singh

Instance based learningswapnac12

A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com

A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano

xldb-2015Mohitdeep Singh

AROPUB-IJPGE-14-30shirko mahmoudi

Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications

Unsupervised LearningSAHEEL FAL DESAI

A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal

New Approach for K-mean and K-medoids AlgorithmEditor IJCATR

Mais procurados (20)

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"

Af4201214217

A Novel Approach to Mathematical Concepts in Data Mining

Training machine learning k means 2017

Optimising Data Using K-Means Clustering Algorithm

Performance Analysis of Different Clustering Algorithm

lecture_mooney.ppt

Unsupervised Learning: Clustering

Improved k-means

Noura2

Lightning talk at MLConf NYC 2015

Instance based learning

A survey on Efficient Enhanced K-Means Clustering Algorithm

A Scalable Dataflow Implementation of Curran's Approximation Algorithm

xldb-2015

AROPUB-IJPGE-14-30

Premeditated Initial Points for K-Means Clustering

Unsupervised Learning

A Study of Efficiency Improvements Technique for K-Means Algorithm

New Approach for K-mean and K-medoids Algorithm

Semelhante a Selection K in K-means Clustering

Master's Thesis Presentation●๋•máńíکhá Gőýálツ

Clustering.pptx19526YuvaKumarIrigi

machine learning - Clustering in RSudhakar Chavan

CSA 3702 machine learning module 3Nandhini S

UNIT_V_Cluster Analysis.pptxsandeepsandy494692

Neural nw k meansEng. Dr. Dennis N. Mwighusa

Document clustering for forensic analysis an approach for improving compute...Madan Golla

Advanced database and data mining & clustering conceptsNithyananthSengottai

Fuzzy c means clustering protocol for wireless sensor networksmourya chandra

Experimental study of Data clustering using k- Means and modified algorithmsIJDKP

Pattern recognition binoy k means clustering108kaushik

26-Clustering MTech-2017.pptvikassingh569137

K means Clustering - algorithm to cluster n objectsVoidVampire

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya

K-Nearest Neighbor ClassifierNeha Kulkarni

Data mining techniques unit vmalathieswaran29

Unsupervised Learning in Machine LearningPyingkodi Maran

Parallel Algorithms K – means ClusteringAndreina Uzcategui

k-mean-clustering.pdfYatharthKhichar1

Clustering.pptxMukul Kumar Singh Chauhan

Semelhante a Selection K in K-means Clustering (20)

Master's Thesis Presentation

Clustering.pptx

machine learning - Clustering in R

CSA 3702 machine learning module 3

UNIT_V_Cluster Analysis.pptx

Neural nw k means

Document clustering for forensic analysis an approach for improving compute...

Advanced database and data mining & clustering concepts

Fuzzy c means clustering protocol for wireless sensor networks

Experimental study of Data clustering using k- Means and modified algorithms

Pattern recognition binoy k means clustering

26-Clustering MTech-2017.ppt

K means Clustering - algorithm to cluster n objects

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

K-Nearest Neighbor Classifier

Data mining techniques unit v

Unsupervised Learning in Machine Learning

Parallel Algorithms K – means Clustering

k-mean-clustering.pdf

Clustering.pptx

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Why Teams call analytics are critical to your entire businesspanagenda

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

DBX First Quarter 2024 Investor PresentationDropbox

Selection K in K-means Clustering

1. 2013 KSE Seminar 2013/10/11 Jung hoon Kim

2. TOPIC

3. Selection of K in K-means clustering

4. Why I choose this paper • There is always an assumption in k-means algorithm, but I really want to execute without human’s intuition or insight. • This paper is first review existing automatical method for selecting the number of clusters for k-means algorithm

5. Paper Format 1) 2) 3) 4) 5) Introduction review the main known method for selecting K analyses the factors influencing the selection of K describes the proposed evaluation measure presents the results of applying the proposed measure to select K for different data sets 6) concludes the paper

6. Small introduction

7. K-means Algorithm • k-means algorithm is a method of clustering algorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean until move distance is smaller than threshold

8. K-means Algorithm 1) Pick a number (k) of point randomly 2) Assign every node to its nearest cluster center 3) Move each cluster center to the mean of its assigned nodes 4) Repeat 2-3 until convergence

9. Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

10. Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

11. Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

12. Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

13. Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k2 k3 1 0 0 1 2 3 4 expression in condition 1 5

14. Comments on the K-Means Metho d • Strength • Relatively efficient: O(tkn), where n is # instances, c is # clusters , and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or ge netic algorithms • Weakness • Need to specify c, the number of clusters, in advance • Initialization Problem • Not suitable to discover clusters with non-convex shapes

15. What’s the problem?

16. What’s the problem? • Initialization problem • it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

17. What’s the problem? • hard to find cluster in non-convex shape

18. What’s the problem? • Selection of K

19. Existing Method • Values of K determined through human’s viewpoint • Using probabilistic theory • Akeike’s information criterion • if data sets are constructed by a set of Gaussian dist • Hardy method • if data sets are constructed by a set of Possion dist • Monte Carlo techniques(associated null hypothesis)

20. Paper proposed

21. Formula

22. Research Method • The method has been validated on 15 artificial and 12 benchmark data sets. • Also there are 12 benchmark data sets from the UCI Repository Machine Learning Databases • These fifteen artificial data sets show effective sample of lots of distribution which can be usually generated.

23. Sample

24. Sample

25. Sample

26. Sample

27. Recommendation Example f(X) < 0.85, K = X else K=1

28. Conclusion • The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algorithm • The proposed method can suggest multiple values of K to users for cases when different clustering results could be obtained with various required levels of detail • this method is computationally expensive if used with large data sets

29. improvement • This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold • Experiment data sets are almost biased. This means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly. • It is an important issue that we know the range, or maximum value of K.

30. Do you have any question?

31. thank you

Selection K in K-means Clustering

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Selection K in K-means Clustering

Semelhante a Selection K in K-means Clustering (20)

Último

Último (20)

Selection K in K-means Clustering