SlideShare uma empresa Scribd logo
1 de 33
K-Means Clustering with
Scikit-Learn
Sarah Guido
PyData SV 2014
About Me
• Today: graduated from the University of Michigan!
• Soon: data scientist at Reonomy
• PyLadies co-organizer
• @sarah_guido
Outline
• What is k-means clustering?
• How it works
• When to use it
• K-means clustering in scikit-learn
• Basic implementation
• Implementation with tuned parameters
Clustering
• Unsupervised learning
• Unlabeled data
• Split observations into groups
• Distance between data points
• Exploring the data
K-means clustering
• Formally: a method of vector quantization
• Partition space into Voronoi cells
• Separate samples
into n groups of
equal variance
• Uses the
Euclidean
distance metric
K-means clustering
• Iterative refinement
• Three basic steps
• Step 1: Choose k
• Iterate over:
• Step 2: Assignment
• Step 3: Update
• Repeats until convergence has been reached
K-means clustering
• Assignment
• Update
K-means clustering
• Advantages
• Scales well
• Efficient
• Will always converge
• Disadvantages
• Choosing the wrong k
• Convergence to local minimum
K-means clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance can be measured in a linear fashion
Scikit-Learn
• Machine learning module
• Open-source
• Built-in datasets
• Good resources for learning
Scikit-Learn
• Model = EstimatorObject()
• Unsupervised:
• Model.fit(dataset.data)
• dataset.data = dataset
• Supervised would use the labels as a second
parameter
K-means in scikit-learn
• Efficient and fast
• You: pick n clusters, kmeans: finds n initial
centroids
• Run clustering jobs in parallel
Dataset
• University of California Machine Learning
Repository
• Individual household power consumption
K-means in scikit-learn
K-means in scikit-learn
• Results
K-means parameters
• n_clusters
• max_iter
• n_init
• init
• precompute_distances
• tol
• n_jobs
• random_state
n_clusters: choosing k
• Graphing the variance
• Information criterion
• Cross-validation
n_clusters: choosing k
• Graphing the variance
• from scipy.spatial.distance import cdist, pdist
• cdist: distance computation between sets of
observations
• pdist: pairwise distances between observations in the
same set
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
n_clusters = 4 n_clusters = 7
n_clusters: choosing k
• n_clusters = 8 (default)
init
• k-means++
• Default
• Selects initial clusters in a way that speeds up
convergence
• random
• Choose k rows at random for initial centroids
• Ndarray that gives initial centers
• (n_clusters, n_features)
K-means revised
• Set n_clusters
• 7, 8
• Set init
• kmeans++, random
K-means revised
n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
K-means revised
n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
Comparing results: silhouette score
• Silhouette coefficient
• No ground truth
• Mean distance between an observation and all other
points in its cluster
• Mean distance between an observation and all other
points in the next nearest cluster
• Silhouette score in scikit-learn
• Mean of silhouette coefficient for all of the observations
• Closer to 1, the better the fit
• Large dataset == long time
Comparing results: silhouette score
• n_clusters=8, init=kmeans++
• 0.8117
• n_clusters=8, init=random
• 0.6511
• n_clusters=7, init=kmeans++
• 0.7719
• n_clusters=7, init=random
• 0.7037
What does this tell us?
• Patterns exist
• Groups of similar observations exist
• Sometimes, the defaults work
• We need more exploration!
A few tips
• Clustering is a good way to explore your data
• Intuition fails in high dimensions
• Use dimensionality reduction
• Combine with other models
• Know your data
Materials and resources
• Scikit-learn documentation
• scikit-learn.org/stable/documentation.html
• Datasets
• http://archive.ics.uci.edu/ml/datasets.html
• Mldata.org
• Blogs
• http://datasciencelab.wordpress.com/
Contact me!
• Twitter: @sarah_guido
• www.linkedin.com/in/sarahguido/
• https://github.com/sarguido

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Learning from imbalanced data
Learning from imbalanced data Learning from imbalanced data
Learning from imbalanced data
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 

Destaque

Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 

Destaque (7)

Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Semelhante a K-means Clustering with Scikit-Learn

Modelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics MethodologyModelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
alien_gmx
 

Semelhante a K-means Clustering with Scikit-Learn (20)

Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
k-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxk-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptx
 
Chapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdfChapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdf
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Data Mining: Implementation of Data Mining Techniques using RapidMiner software
Data Mining: Implementation of Data Mining Techniques using RapidMiner softwareData Mining: Implementation of Data Mining Techniques using RapidMiner software
Data Mining: Implementation of Data Mining Techniques using RapidMiner software
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics MethodologyModelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
KNN
KNN KNN
KNN
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor Approach
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGIS
 

Mais de Sarah Guido

Mais de Sarah Guido (8)

Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 
The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)
 
The Wild West of Data Wrangling
The Wild West of Data WranglingThe Wild West of Data Wrangling
The Wild West of Data Wrangling
 
The Importance of Community
The Importance of CommunityThe Importance of Community
The Importance of Community
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Network theory - PyCon 2015
Network theory - PyCon 2015Network theory - PyCon 2015
Network theory - PyCon 2015
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

K-means Clustering with Scikit-Learn