SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
K-means++ Seeding Algorithm, 

 Implementation in MLDemos!

             Renaud Richardet!
             Brain Mind Institute !
       Ecole Polytechnique Fédérale 

      de Lausanne (EPFL), Switzerland!
          renaud.richardet@epfl.ch !
                      !
K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
   centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
   function, compared to global optimal clustering!
K-means++!
•  A seeding technique for k-means

   from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
   each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!
Algorithm!




c	
  ∈	
  C:	
  cluster	
  center	
  
x	
  ∈	
  	
  X:	
  data	
  point	
  
D(x):	
  distance	
  between	
  x	
  and	
  the	
  nearest	
  ck	
  that	
  has	
  already	
  been	
  chosen	
  	
  
	
  
Implementation!
•  Based on Apache Commons Math’s
   KMeansPlusPlusClusterer and 

   Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!
Implementation Test Dataset: 4 squares (n=16)!
Expected: 4 nice clusters!
Sample Output!
	
  1:	
  first	
  cluster	
  center	
  0	
  at	
  rand:	
  x=4	
  [-­‐2.0;	
  2.0]	
  
	
  1:	
  initial	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  18.0	
  
	
  1:	
  initial	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  6	
  [	
  2.0;-­‐2.0]	
  =	
  32.0	
  
	
  1:	
  initial	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  	
  1.0	
  
	
  1:	
  initial	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  10[	
  2.0;-­‐1.0]	
  =	
  25.0	
  
	
  1:	
  initial	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  	
  9.0	
  
	
  	
  	
  	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  1	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=3345.0	
  
	
  3:	
  	
  	
  random	
  index	
  1532.706909	
  
	
  4:	
  	
  new	
  cluster	
  point:	
  x=6	
  [2.0;-­‐2.0]	
  	
  
Sample Output (2)!
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  	
  2.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  25.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  10[2.0	
  ;-­‐1.0]	
  =	
  	
  1.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  17.0	
  
              	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  2	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=961.0	
  
	
  3:	
  	
  	
  random	
  index	
  103.404701	
  
	
  4:	
  	
  	
  new	
  cluster	
  point:	
  x=1	
  [2.0;1.0]	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  13.0	
  
	
  […]	
  
Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
   means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means!

   optimal clustering 115 times (57.5%) !
•  K-means++ !

   optimal clustering 182 times (91%)!
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set

   daily measures of sensors in an urban waste water
   treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
   and k-means++ with k=13, and recorded RSS!




•  Difference highly significant (P < 0.0001) !
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
   means.!
•  Steinley [2007]: evaluated 12 different techniques
   (omitting k-means++). Recommends multiple
   random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
   means++). Unable to provide recommendations
   when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
   recommends using Milligan’s [1980] or Mirkin’s
   [2005] seeding technique, and Bradley’s [1998]
   when dataset is very large.!
Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
   dataset, we showed that our implementation of
   the k-means++ seeding procedure in the
   MLDemos software package yields a significant
   reduction of the RSS. !
•  A short literature survey revealed that many
   seeding procedures exist for k-means, and that
   some alternatives to k-means++ might yield
   even larger improvements.!
References!
•    Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
     seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
     Discrete algorithms 1027–1035 (2007).!
•    Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
     K-Means+”. Unpublished working paper available at
     http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•    Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
     clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
     (1998).!
•    Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
     different methods for initializing the K-means clustering algorithm”.
     Unpublished working paper available at http://apghosh.public.iastate.edu/
     files/IEEEclust2.pdf (2011).!
•    Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
     Pattern Recognition, vol. 12, 41–50 (1980). !
•    Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
     and Hall (2005). !
•    Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
     evaluation of several techniques”. Journal of Classification 24, 99–121
     (2007).!

Mais conteúdo relacionado

Mais procurados

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Introduction pervasive game
Introduction pervasive gameIntroduction pervasive game
Introduction pervasive gameMartin Ortner
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice CryptographyPriyanka Aash
 
Quantum threat: How to protect your optical network
Quantum threat: How to protect your optical networkQuantum threat: How to protect your optical network
Quantum threat: How to protect your optical networkADVA
 
Data encryption techniques and standard
Data encryption techniques and standardData encryption techniques and standard
Data encryption techniques and standardSarika Jadhav
 
Cryptography Basics Pki
Cryptography Basics PkiCryptography Basics Pki
Cryptography Basics PkiSylvain Maret
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptSyedNahin1
 
Introduction - Lattice-based Cryptography
Introduction - Lattice-based CryptographyIntroduction - Lattice-based Cryptography
Introduction - Lattice-based CryptographyAlexandre Augusto Giron
 
Post quantum cryptography - thesis
Post quantum cryptography - thesisPost quantum cryptography - thesis
Post quantum cryptography - thesisSamy Shehata
 
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWE
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWELattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWE
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWEPriyanka Aash
 
Property-Based TPM Virtualization
Property-Based TPM VirtualizationProperty-Based TPM Virtualization
Property-Based TPM VirtualizationMarcel Winandy
 
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]RootedCON
 
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...National Chengchi University
 
Computer Security Lecture 3: Classical Encryption Techniques 2
Computer Security Lecture 3: Classical Encryption Techniques 2Computer Security Lecture 3: Classical Encryption Techniques 2
Computer Security Lecture 3: Classical Encryption Techniques 2Mohamed Loey
 
Quantum Key Distribution
Quantum Key DistributionQuantum Key Distribution
Quantum Key DistributionShahrikh Khan
 
Random Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityRandom Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityMahbubur Rahman
 

Mais procurados (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction pervasive game
Introduction pervasive gameIntroduction pervasive game
Introduction pervasive game
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice Cryptography
 
Quantum threat: How to protect your optical network
Quantum threat: How to protect your optical networkQuantum threat: How to protect your optical network
Quantum threat: How to protect your optical network
 
Data encryption techniques and standard
Data encryption techniques and standardData encryption techniques and standard
Data encryption techniques and standard
 
Cryptography Basics Pki
Cryptography Basics PkiCryptography Basics Pki
Cryptography Basics Pki
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
Introduction - Lattice-based Cryptography
Introduction - Lattice-based CryptographyIntroduction - Lattice-based Cryptography
Introduction - Lattice-based Cryptography
 
Chapter- I introduction
Chapter- I introductionChapter- I introduction
Chapter- I introduction
 
Post quantum cryptography - thesis
Post quantum cryptography - thesisPost quantum cryptography - thesis
Post quantum cryptography - thesis
 
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWE
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWELattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWE
Lattice-Based Cryptography: CRYPTANALYSIS OF COMPACT-LWE
 
One time Pad Encryption
One time Pad EncryptionOne time Pad Encryption
One time Pad Encryption
 
Property-Based TPM Virtualization
Property-Based TPM VirtualizationProperty-Based TPM Virtualization
Property-Based TPM Virtualization
 
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]
Eduardo Sanchez & Rafael Sojo - Taller de CTF [rooted2018]
 
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...
Threshold-optimal DSAECDSA signatures and an application to Bitcoin wallet se...
 
Scalable k-means plus plus
Scalable k-means plus plusScalable k-means plus plus
Scalable k-means plus plus
 
Computer Security Lecture 3: Classical Encryption Techniques 2
Computer Security Lecture 3: Classical Encryption Techniques 2Computer Security Lecture 3: Classical Encryption Techniques 2
Computer Security Lecture 3: Classical Encryption Techniques 2
 
Quantum Key Distribution
Quantum Key DistributionQuantum Key Distribution
Quantum Key Distribution
 
Random Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityRandom Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network Security
 
1524 elliptic curve cryptography
1524 elliptic curve cryptography1524 elliptic curve cryptography
1524 elliptic curve cryptography
 

Destaque

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initializationdjempol
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansShinichi Tamura
 
Kmeans
KmeansKmeans
KmeansWagner
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장Juhui Park
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010Kostas Tampakis
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tnoKostas Tampakis
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschMarcel Rietveld ✔
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draftAndy Lehman
 

Destaque (20)

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
 
Kmeans
KmeansKmeans
Kmeans
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
 
Social Media Payments Opps and Challenges
Social Media Payments Opps and ChallengesSocial Media Payments Opps and Challenges
Social Media Payments Opps and Challenges
 
Foto surreali copia 21
Foto surreali copia 21Foto surreali copia 21
Foto surreali copia 21
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
 
Et dieu crea_la_mer
Et dieu crea_la_merEt dieu crea_la_mer
Et dieu crea_la_mer
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
 
Laponsko
LaponskoLaponsko
Laponsko
 

Semelhante a Kmeans plusplus

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12thanimesh dwivedi
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithmsMark Moriarty
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Alex Orso
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svmUmmeHaniAsif
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
 

Semelhante a Kmeans plusplus (20)

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
P1121133727
P1121133727P1121133727
P1121133727
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svm
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Kmeans plusplus

  • 1. K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  • 2. K-means! •  K-means: widely used clustering technique! •  Initialization: blind random on input data! •  Drawback: very sensitive to choice of initial cluster centers (seeds)! •  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  • 3. K-means++! •  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]! •  Idea: spread the k initial cluster centers away from each other.! •  O(log k)-competitive with the optimal clustering" •  substantial convergence time speedups (empirical)!
  • 4. Algorithm! c  ∈  C:  cluster  center   x  ∈    X:  data  point   D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  • 5. Implementation! •  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation! •  Implemented directly in MLDemos’ core!
  • 6. Implementation Test Dataset: 4 squares (n=16)!
  • 7. Expected: 4 nice clusters!
  • 8. Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  • 9. Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  • 10. Evaluation on Test Dataset! •  200 clustering runs, each with and without k- means++ initialization! •  Measure RSS (intra-class variance)! •  K-means!
 optimal clustering 115 times (57.5%) ! •  K-means++ !
 optimal clustering 182 times (91%)!
  • 11. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the evaluation dataset (n=200)!
  • 12. Evaluation on Real Dataset! •  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)! •  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS! •  Difference highly significant (P < 0.0001) !
  • 13. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the UCI real world dataset (n=500)!
  • 14. Alternatives Seeding Algorithms! •  Extensive research into seeding techniques for k- means.! •  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.! •  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. ! •  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  • 15. Conclusions and Future Work! •  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. ! •  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  • 16. References! •  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).! •  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).! •  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).! •  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).! •  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). ! •  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). ! •  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!