SlideShare uma empresa Scribd logo
1 de 31
www.edureka.in/data-science
Slide 1
Clustering
www.edureka.in/data-science
Slide 2
Clustering: Scenarios
The following scenarios implement Clustering:
 A telephone company needs to establish its network by putting its towers in a particular region
it has acquired. The location of putting these towers can be found by clustering algorithm so
that all its users receive maximum signal strength.
 Cisco wants to open its new office in California. The management wants to be cordial to its
employees and want their office in a location so that its employees’ commutation is reduced to
minimum.
 The Miami DEA wants to make its law enforcement more stringent and hence have decided to
make their patrol vans stationed across the area so that the areas of high crime rates are in
vicinity to the patrol vans.
 A Hospital Care chain wants to open a series of Emergency-Care wards, keeping in mind the
factor of maximum accident prone areas in a region.
www.edureka.in/data-science
Slide 3
What is Clustering?
Slide 3
Organizing data into clusters such that there is:
 High intra-cluster similarity
 Low inter-cluster similarity
 Informally, finding natural groupings among objects
Why Clustering?
www.edureka.in/data-science
Slide 4
Why Clustering?
Slide 4
 Organizing data into clusters shows internal structure of the data
Ex. Clusty and clustering genes
 Sometimes the partitioning is the goal
Ex. Market segmentation
 Prepare for other AI techniques
Ex. Summarize news (cluster and then find centroid)
 Discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
www.edureka.in/data-science
Slide 5
Clustering algorithms may be classified as:
Exclusive Clustering:
Data is grouped in an exclusive way, so that if a certain
datum belongs to a definite cluster then it could not be
included in another cluster.
E.g. K-means
Overlapping Clustering:
The overlapping clustering, uses fuzzy sets to cluster data,
so that each point may belong to two or more clusters with
different degrees of membership.
E.g. Fuzzy C-means
www.edureka.in/data-science
Slide 6
Hierarchical Clustering:
It is based on the union between the two
nearest clusters. The beginning condition is
realized by setting every datum as a cluster.
There are certain properties which one cluster
receives in hierarchy from another cluster.
Clustering algorithms may be classified as:
www.edureka.in/data-science
Slide 7
“Clustering is in the eye of the beholder."
The most appropriate clustering algorithm for a particular problem often needs to be chosen
experimentally, unless there is a mathematical reason to prefer one cluster model over
another.
It should be noted that an algorithm that is designed for one kind of model has no chance on a
data set that contains a radically different kind of model.
For example, k-means cannot find non-convex clusters.
www.edureka.in/data-science
Similarity/Dissimilarity Measurement
Slide 8
To achieve Clustering, a similarity/dissimilarity
measure must be determined so as to cluster the
data points based either on :
1. Similarity in the data or
2. Dissimilarity in the data
The measure reflects the degree of closeness or
separation of the target objects and should
correspond to the characteristics that are
believed to distinguish the clusters embedded in
the data.
Measurement
Similarity Dissimilarity
www.edureka.in/data-science
Slide 9
Similarity Measurement
Similarity measures the degree to which a pair of
objects are alike.
Concerning structural patterns represented as strings
or sequences of symbols, the concept of pattern
resemblance has typically been viewed from three
main perspectives:
 Similarity as matching, according to which
patterns are seen as different viewpoints,
possible instantiations or noisy versions of the
same object;
 Structural resemblance, based on the similarity of
their composition rules and primitives;
 Content-based similarity.
www.edureka.in/data-science
Dissimilarity Measurement: Distance Measures
Slide 10
Similarity can also be measured in
terms of the placing of data points.
By finding the distance between the
data points , the distance/difference of
the point to the cluster can be found.
Distance
Measures
Euclidean Distance Measure
Manhattan Distance Measure
Cosine Distance Measure
Tanimoto Distance Measure
Squared Euclidean Distance
Measure
www.edureka.in/data-science
Slide 11
Difference between Euclidean and Manhattan
From this image we can say that, The Euclidean distance measure gives 5.65 as the distance
between (2, 2) and (6, 6) whereas the Manhattan distance is 8.0
Slide 11
Mathematically, Euclidean distance between two
n-dimensional vectors
(a1, a2, ... , an) and (b1,b2,...,bn) is:
d = |a1 – b1| + |a2 – b2| + ... + |an – bn|
Manhattan distance between two n-
dimensional vectors
www.edureka.in/data-science
Cosine Distance Measure
The formula for the cosine distance between n-dimensional vectors
(a1, a2, ... , an) and (b1, b2, ...,bn) is
Slide 12
www.edureka.in/data-science
Slide 13
Slide 13
K-Means Clustering
www.edureka.in/data-science
Slide 14
Slide 14
K-Means Clustering
The process by which objects are classified into
a number of groups so that they are as much
dissimilar as possible from one group to another
group, but as much similar as possible within
each group.
The objects in group 1 should be as similar as
possible.
But there should be much difference between an
object in group 1 and group 2.
The attributes of the objects are allowed to
determine which objects should be grouped
together.
Total population
Group 1
Group 2 Group 3
Group 4
www.edureka.in/data-science
Slide 15
Slide 15
Current
Balance
High
High
Medium
Medium
Low
Low
Gross Monthly Income
Example Cluster 1
High Balance
Low Income
Example Cluster 2
High Income
Low Balance
 Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance.
 The objects in Cluster 1 have similar characteristics (High Income and Low balance), on the other hand
the objects in Cluster 2 have the same characteristic (High Balance and Low Income).
 But there are much differences between an object in Cluster 1 and an object in Cluster 2.
Basic concepts of Cluster Analysis using two variables
K-Means Clustering
www.edureka.in/data-science
Slide 16
Process Flow of K-means
Iterate until stable (cluster centers converge):
1. Determine the centroid coordinate.
2. Determine the distance of each object to the
centroids.
3. Group the object based on minimum
distance (find the closest centroid)
Start
Number of
Cluster K
Centroid
Distance objects to
centroids
Grouping based on
minimum distance
End
No object
move
group?
+
www.edureka.in/data-science
Slide 17
K-Means Clustering Use-Case:
Problem Statement:
The newly appointed Governor has finally decided to do something for the society and wants to open
a chain of schools across a particular region, keeping in mind the distance travelled by children is
minimum, so that the percentage turnout is more.
Poor fella cannot decide himself and has asked its Data Science team to come up with the solution.
Bet, these guys have the solution to almost everything!!
www.edureka.in/data-science
Slide 18
Slide 18
K-Means Clustering Steps
1. If k=4, we select 4 random points in
the 2d space and assume them to be
cluster centers for the clusters to be
created.
www.edureka.in/data-science
Slide 19
Slide 19
2. We take up a random data point from
the space and find out its distance from
all the 4 clusters centers.
If the data point is closest to the pink
cluster center, it is colored pink.
K-Means Clustering Steps
www.edureka.in/data-science
Slide 20
Slide 20
3. Now we calculate the centroid of all
the pink points and assign that point
as the cluster center for that cluster.
Similarly, we calculate centroids for all
the 4 colored(clustered) points and
assign the new centroids as the
cluster centers.
K-Means Clustering Steps
www.edureka.in/data-science
Slide 21
Slide 21
4. Step-2 and step-3 are run iteratively, unless the cluster centers converge at a point and no
longer move.
Iteration-1 Iteration-2
K-Means Clustering Steps
www.edureka.in/data-science
Slide 22
Iteration-3 Iteration-4
5. We can see that the cluster centers are still not converged so we go ahead and iterate it more.
K-Means Clustering Steps
www.edureka.in/data-science
Slide 23
Finally, after multiple iterations, we reach a
stage where the cluster centers coverge and
the clusters look like as:
Here we have performed:
Iterations: 5
K-Means Clustering Steps
Slide 24 www.edureka.in/data-science
Q1. In cluster analysis objects are classified into a number of groups so that
1. They are as much dissimilar as possible from one group to another group,
but as much similar as possible within each group.
2. They are as much similar as possible from one group to another group, but
as much dissimilar as possible within each group.
Annie’s Question
Slide 25 www.edureka.in/data-science
Correct Answer.
Option 1: They are as much dissimilar as possible from one group to another
group, but as much similar as possible within each group.
Annie’s Answer
Slide 26 www.edureka.in/data-science
K-Means Mathematical Formulation
Distortion = =
(within cluster sum of squares)



m
i
i
i c
x
1
2
)
(  
 

k
j OwnedBy
i
j
i
j
x
1 )
(
2
)
(


Owned By(.): set of records that belong to the specified cluster center
D={x1,x2,…,xi,…,xm}  data set of m records
xi=(xi1,xi2,…,xin)  each record is an n-dimensional vector
ci =
cluster(xi)=
Slide 27 www.edureka.in/data-science
Goal: Find cluster centers that minimize Distortion
Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster center to zero








)
(
2
)
(
Distortion
j
OwnedBy
i
j
i
j
j
x









)
(
)
(
2
j
OwnedBy
i
j
i
x

 minimum)
(for
0





)
(
|
)
(
|
1
j
OwnedBy
i
i
j
j x
OwnedBy 


K-Means Mathematical Formulation
Slide 28 www.edureka.in/data-science
Will we find the Optimal Solution?
Not necessarily!
Try to come up with a converged solution, but does not have minimum distortion:
We might get stuck in local minimum, and not a global minimum
Slide 29 www.edureka.in/data-science
 Choose first center at random
 Choose second center that is far away from the first center
 … Choose jth center as far away as possible from the closest of centers 1 through
(j-1)
Idea 1: careful about where we start
Idea 2: Do many runs of K-means, each with different random starting point
How to find Optimal Solution?
Slide 30 www.edureka.in/data-science
Choosing the Number of Clusters
Elbow method
Objective
Function
Value
i.e.,
Distortion
K means Clustering

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Clustering
ClusteringClustering
Clustering
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
 
Random forest
Random forestRandom forest
Random forest
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 

Destaque

Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Association Analysis
Association AnalysisAssociation Analysis
Association Analysisguest0edcaf
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
phase rule & phase diagram
phase rule & phase diagramphase rule & phase diagram
phase rule & phase diagramYog's Malani
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
The phase rule
The phase ruleThe phase rule
The phase ruleJatin Garg
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesGilad Barkan
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Coacervation Phase Separation Techniques
Coacervation Phase Separation TechniquesCoacervation Phase Separation Techniques
Coacervation Phase Separation TechniquesGargi Nanda
 
Clustering training
Clustering trainingClustering training
Clustering trainingGabor Veress
 
Phase Diagrams and Phase Rule
Phase Diagrams and Phase RulePhase Diagrams and Phase Rule
Phase Diagrams and Phase RuleRuchi Pandey
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 

Destaque (18)

Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Association Analysis
Association AnalysisAssociation Analysis
Association Analysis
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
phase rule & phase diagram
phase rule & phase diagramphase rule & phase diagram
phase rule & phase diagram
 
Phase rule
Phase rulePhase rule
Phase rule
 
MOLECULAR DOCKING
MOLECULAR DOCKINGMOLECULAR DOCKING
MOLECULAR DOCKING
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
The phase rule
The phase ruleThe phase rule
The phase rule
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummies
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Coacervation Phase Separation Techniques
Coacervation Phase Separation TechniquesCoacervation Phase Separation Techniques
Coacervation Phase Separation Techniques
 
Clustering training
Clustering trainingClustering training
Clustering training
 
Phase Diagrams and Phase Rule
Phase Diagrams and Phase RulePhase Diagrams and Phase Rule
Phase Diagrams and Phase Rule
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 

Semelhante a K means Clustering

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingIOSR Journals
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Predicting Students Performance using K-Median Clustering
Predicting Students Performance using  K-Median ClusteringPredicting Students Performance using  K-Median Clustering
Predicting Students Performance using K-Median ClusteringIIRindia
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansWrishin Bhattacharya
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxSureshPolisetty2
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationChristopher Peter Makris
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 

Semelhante a K means Clustering (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Predicting Students Performance using K-Median Clustering
Predicting Students Performance using  K-Median ClusteringPredicting Students Performance using  K-Median Clustering
Predicting Students Performance using K-Median Clustering
 
Visualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLABVisualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLAB
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
47 292-298
47 292-29847 292-298
47 292-298
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptx
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image Segmentation
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 

Mais de Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaEdureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaEdureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaEdureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaEdureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaEdureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaEdureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaEdureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaEdureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | EdurekaEdureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEdureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEdureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaEdureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaEdureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaEdureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaEdureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | EdurekaEdureka!
 

Mais de Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Último

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 

Último (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

K means Clustering

  • 2. www.edureka.in/data-science Slide 2 Clustering: Scenarios The following scenarios implement Clustering:  A telephone company needs to establish its network by putting its towers in a particular region it has acquired. The location of putting these towers can be found by clustering algorithm so that all its users receive maximum signal strength.  Cisco wants to open its new office in California. The management wants to be cordial to its employees and want their office in a location so that its employees’ commutation is reduced to minimum.  The Miami DEA wants to make its law enforcement more stringent and hence have decided to make their patrol vans stationed across the area so that the areas of high crime rates are in vicinity to the patrol vans.  A Hospital Care chain wants to open a series of Emergency-Care wards, keeping in mind the factor of maximum accident prone areas in a region.
  • 3. www.edureka.in/data-science Slide 3 What is Clustering? Slide 3 Organizing data into clusters such that there is:  High intra-cluster similarity  Low inter-cluster similarity  Informally, finding natural groupings among objects Why Clustering?
  • 4. www.edureka.in/data-science Slide 4 Why Clustering? Slide 4  Organizing data into clusters shows internal structure of the data Ex. Clusty and clustering genes  Sometimes the partitioning is the goal Ex. Market segmentation  Prepare for other AI techniques Ex. Summarize news (cluster and then find centroid)  Discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.
  • 5. www.edureka.in/data-science Slide 5 Clustering algorithms may be classified as: Exclusive Clustering: Data is grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. E.g. K-means Overlapping Clustering: The overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. E.g. Fuzzy C-means
  • 6. www.edureka.in/data-science Slide 6 Hierarchical Clustering: It is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. There are certain properties which one cluster receives in hierarchy from another cluster. Clustering algorithms may be classified as:
  • 7. www.edureka.in/data-science Slide 7 “Clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of model has no chance on a data set that contains a radically different kind of model. For example, k-means cannot find non-convex clusters.
  • 8. www.edureka.in/data-science Similarity/Dissimilarity Measurement Slide 8 To achieve Clustering, a similarity/dissimilarity measure must be determined so as to cluster the data points based either on : 1. Similarity in the data or 2. Dissimilarity in the data The measure reflects the degree of closeness or separation of the target objects and should correspond to the characteristics that are believed to distinguish the clusters embedded in the data. Measurement Similarity Dissimilarity
  • 9. www.edureka.in/data-science Slide 9 Similarity Measurement Similarity measures the degree to which a pair of objects are alike. Concerning structural patterns represented as strings or sequences of symbols, the concept of pattern resemblance has typically been viewed from three main perspectives:  Similarity as matching, according to which patterns are seen as different viewpoints, possible instantiations or noisy versions of the same object;  Structural resemblance, based on the similarity of their composition rules and primitives;  Content-based similarity.
  • 10. www.edureka.in/data-science Dissimilarity Measurement: Distance Measures Slide 10 Similarity can also be measured in terms of the placing of data points. By finding the distance between the data points , the distance/difference of the point to the cluster can be found. Distance Measures Euclidean Distance Measure Manhattan Distance Measure Cosine Distance Measure Tanimoto Distance Measure Squared Euclidean Distance Measure
  • 11. www.edureka.in/data-science Slide 11 Difference between Euclidean and Manhattan From this image we can say that, The Euclidean distance measure gives 5.65 as the distance between (2, 2) and (6, 6) whereas the Manhattan distance is 8.0 Slide 11 Mathematically, Euclidean distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2,...,bn) is: d = |a1 – b1| + |a2 – b2| + ... + |an – bn| Manhattan distance between two n- dimensional vectors
  • 12. www.edureka.in/data-science Cosine Distance Measure The formula for the cosine distance between n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ...,bn) is Slide 12
  • 14. www.edureka.in/data-science Slide 14 Slide 14 K-Means Clustering The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. The objects in group 1 should be as similar as possible. But there should be much difference between an object in group 1 and group 2. The attributes of the objects are allowed to determine which objects should be grouped together. Total population Group 1 Group 2 Group 3 Group 4
  • 15. www.edureka.in/data-science Slide 15 Slide 15 Current Balance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance  Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance.  The objects in Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the objects in Cluster 2 have the same characteristic (High Balance and Low Income).  But there are much differences between an object in Cluster 1 and an object in Cluster 2. Basic concepts of Cluster Analysis using two variables K-Means Clustering
  • 16. www.edureka.in/data-science Slide 16 Process Flow of K-means Iterate until stable (cluster centers converge): 1. Determine the centroid coordinate. 2. Determine the distance of each object to the centroids. 3. Group the object based on minimum distance (find the closest centroid) Start Number of Cluster K Centroid Distance objects to centroids Grouping based on minimum distance End No object move group? +
  • 17. www.edureka.in/data-science Slide 17 K-Means Clustering Use-Case: Problem Statement: The newly appointed Governor has finally decided to do something for the society and wants to open a chain of schools across a particular region, keeping in mind the distance travelled by children is minimum, so that the percentage turnout is more. Poor fella cannot decide himself and has asked its Data Science team to come up with the solution. Bet, these guys have the solution to almost everything!!
  • 18. www.edureka.in/data-science Slide 18 Slide 18 K-Means Clustering Steps 1. If k=4, we select 4 random points in the 2d space and assume them to be cluster centers for the clusters to be created.
  • 19. www.edureka.in/data-science Slide 19 Slide 19 2. We take up a random data point from the space and find out its distance from all the 4 clusters centers. If the data point is closest to the pink cluster center, it is colored pink. K-Means Clustering Steps
  • 20. www.edureka.in/data-science Slide 20 Slide 20 3. Now we calculate the centroid of all the pink points and assign that point as the cluster center for that cluster. Similarly, we calculate centroids for all the 4 colored(clustered) points and assign the new centroids as the cluster centers. K-Means Clustering Steps
  • 21. www.edureka.in/data-science Slide 21 Slide 21 4. Step-2 and step-3 are run iteratively, unless the cluster centers converge at a point and no longer move. Iteration-1 Iteration-2 K-Means Clustering Steps
  • 22. www.edureka.in/data-science Slide 22 Iteration-3 Iteration-4 5. We can see that the cluster centers are still not converged so we go ahead and iterate it more. K-Means Clustering Steps
  • 23. www.edureka.in/data-science Slide 23 Finally, after multiple iterations, we reach a stage where the cluster centers coverge and the clusters look like as: Here we have performed: Iterations: 5 K-Means Clustering Steps
  • 24. Slide 24 www.edureka.in/data-science Q1. In cluster analysis objects are classified into a number of groups so that 1. They are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. 2. They are as much similar as possible from one group to another group, but as much dissimilar as possible within each group. Annie’s Question
  • 25. Slide 25 www.edureka.in/data-science Correct Answer. Option 1: They are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. Annie’s Answer
  • 26. Slide 26 www.edureka.in/data-science K-Means Mathematical Formulation Distortion = = (within cluster sum of squares)    m i i i c x 1 2 ) (      k j OwnedBy i j i j x 1 ) ( 2 ) (   Owned By(.): set of records that belong to the specified cluster center D={x1,x2,…,xi,…,xm}  data set of m records xi=(xi1,xi2,…,xin)  each record is an n-dimensional vector ci = cluster(xi)=
  • 27. Slide 27 www.edureka.in/data-science Goal: Find cluster centers that minimize Distortion Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster center to zero         ) ( 2 ) ( Distortion j OwnedBy i j i j j x          ) ( ) ( 2 j OwnedBy i j i x   minimum) (for 0      ) ( | ) ( | 1 j OwnedBy i i j j x OwnedBy    K-Means Mathematical Formulation
  • 28. Slide 28 www.edureka.in/data-science Will we find the Optimal Solution? Not necessarily! Try to come up with a converged solution, but does not have minimum distortion: We might get stuck in local minimum, and not a global minimum
  • 29. Slide 29 www.edureka.in/data-science  Choose first center at random  Choose second center that is far away from the first center  … Choose jth center as far away as possible from the closest of centers 1 through (j-1) Idea 1: careful about where we start Idea 2: Do many runs of K-means, each with different random starting point How to find Optimal Solution?
  • 30. Slide 30 www.edureka.in/data-science Choosing the Number of Clusters Elbow method Objective Function Value i.e., Distortion