SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
mlcourse.ai. Clustering
Yury Kashnitskiy, Dmitry Ignatov
Higher School of Economics
November 16, 2018
(Higher School of Economics) Clustering 16.11.2018 1 / 24
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 2 / 24
Clustering
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 3 / 24
Clustering Problem formulation
Problem formulation
The main task of cluster analysis is to group instances into subgroups (clusters) of
similar ones.
These groups can be
Partitions
Hierarchies
Fuzzy partitions
Biclusters
Mixtures of distributions
(Higher School of Economics) Clustering 16.11.2018 4 / 24
Clustering Applications
Applications
Biology and medicine
Gene expression analysis
Tomography clustering
Humanitarian sciences
Sociology and anthropology
Psychology
Technical systems
Telemetry
Image segmentation
Marketing
Customer segmentation
Subgroup behavioral analysis
Text analytics
News clustering
Social networks
Comunity detection
(Higher School of Economics) Clustering 16.11.2018 5 / 24
Clustering methods
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 6 / 24
Clustering methods
How to measure dissimilarity of instances
Instances x ∈ Rm
are representaed as feature matrices.





x1
x2
...
xn





⇐⇒




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n




Minkowski distance
d(x, y) =
m
i=1
|xi
− yi
|p
1
p
Cosine distance
d(x, y) = 1 −
⟨x, y⟩
⟨x, x⟩ ⟨y, y⟩
Hamming distance
d(x, y) =
1
m
m
i=1
[xi
̸= yi
]
(Higher School of Economics) Clustering 16.11.2018 7 / 24
Clustering methods
k-Means
k-Means is an iterative algorithm to split data into k clusters.
Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined
as
cj =
1
|Cj |
i∈Cj
xi
The objective is the sum of squares of all distances between instances and
centroids of clusteres to which these instances belong.
J(C) =
k
j=1 i∈Cj
d(xi , cj )2
(Higher School of Economics) Clustering 16.11.2018 8 / 24
Clustering methods
k-Means
The algorithm
Input: Data, k — is a hyperparameter
Ouput: Partition of data into k clusters
* * *
1. Initialization: Set k points to be initial centroid
2. Update clusters: Given k centroids, each instance is attributed to one of
centroids. Thus, all instances attributed to a centroid cj
(j = 1 . . . k), form a cluster Cj .
3. Update centroids: For each cluster Cj , a new centroid is calculated as a
geometrical mean of all instances in this cluster.
Steps 2-3 are repeated until convergence.
(Higher School of Economics) Clustering 16.11.2018 9 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
For each k we can calculate J(C).
Then, we find such k that further increasing it does not decrease J “too much”.
Formally, we look for k that minimizes the following D(k):
D(k) =
|J(k) − J(k + 1)|
|J(k − 1) − J(k)|
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
−6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
4000
k
J(R)
Elbow Method
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Silhouette
Silhouette for an instance xi in a cluster C is a function
s(i) =
bi − ai
max(ai , bi )
,
where a(i) — is the mean distance from xi to all other instances from C, а bm(i)
— is the mean distance from xi to instances from other clusters.
(Higher School of Economics) Clustering 16.11.2018 12 / 24
Clustering methods
Silhouette
Acceptable number of clusters
(Higher School of Economics) Clustering 16.11.2018 13 / 24
Clustering methods
Silhouette
Bad number of clusters
(Higher School of Economics) Clustering 16.11.2018 14 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒






d(x1, x1) d(x1, x2) . . . d(x1, xn)
d(x2, x1)
...
... d(x2, xn)
...
...
...
...
d(xn, x1) d(xn, x2) · · · d(xn, xn)






(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒







0 d(x1, x2) d(x1, x3) · · · d(x1, xn)
0 d(x2, x3) · · · d(x2, xn)
... · · · · · ·
0 d(xn−1, xn)
0







(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Sequential merging of similar clusters
0 Start with each cluster having only one instance
1 Find two closest clusters
2 Merge them
Repeat steps 1-2 untill all instances are in the same cluster
How to define distance between clusters?
(Higher School of Economics) Clustering 16.11.2018 16 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
1 Single Linkage
d(A, B) = min
x∈A,y∈B
d(x, y)
2 Complete Linkage
d(A, B) = max
x∈A,y∈B
d(x, y)
(Higher School of Economics) Clustering 16.11.2018 17 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
3 Average Linkage
d(A, B) =
1
|A||B|
i∈A j∈B
d(xi , yj )
4 Weighted Average Linkage
Let clusterA be a union of clusters q и p. Then
d(A, B) =
d(p, B) + d(q, B)
2
5 Centroid Linkage
d(A, B) = ∥cA − cB ∥2
(Higher School of Economics) Clustering 16.11.2018 18 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Merging clusters can be depicted with a dendrogram.
Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 }
1 2 3 7 10 12 25 29
0
5
10
15
20
25
Objects
Clusterdistances
B
C
A
Distance between
cluster A and B
(Higher School of Economics) Clustering 16.11.2018 19 / 24
Clustering methods Density-based methods
Density-based methods
DBSCAN
DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise.
(Higher School of Economics) Clustering 16.11.2018 20 / 24
Clustering methods Density-based methods
DBSCAN algorithm
All point can be divided into elements of dense regions, border points and noise
(skipping formal definition here).
(Higher School of Economics) Clustering 16.11.2018 21 / 24
Clustering methods Density-based methods
DBSCAN. Example
Hyperparams: M = 4, Eps > 0
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Pros and cons
Pros
+ Can find clusters of any shape
+ Easy to implement
+ Can find noise in data
+ Nice complexity — O(n log(n)) with a good data sctructure
(otherwise — O(n2
) )
Cons
- Parametric
- Doesn’t work well when clusters differ in density
- Depends on the chosen metric
(Higher School of Economics) Clustering 16.11.2018 23 / 24
Clustering methods Density-based methods
Contacts
Questions
Thanks!
Please ask your questions in OpenDataSciene Slack team.
http://ods.ai
(Higher School of Economics) Clustering 16.11.2018 24 / 24

Mais conteúdo relacionado

Mais procurados

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!A Jorge Garcia
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding TheoryCapgemini
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1roszelan
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application ProgramIRJET Journal
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingMichaelRra
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJordan Open Source Association
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebraNikolaos Vasiloglou
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...theijes
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer scienceSandeep Katta
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Cemal Ardil
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3IRJET Journal
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinantss9182647608y
 

Mais procurados (18)

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!
 
Tutorial1
Tutorial1Tutorial1
Tutorial1
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application Program
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image Processing
 
Cmb part3
Cmb part3Cmb part3
Cmb part3
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysis
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebra
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer science
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
 
Tutorial7
Tutorial7Tutorial7
Tutorial7
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinants
 

Semelhante a mlcourse.ai. Clustering

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...IRJET Journal
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringIAEME Publication
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringprjpublications
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...csandit
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 

Semelhante a mlcourse.ai. Clustering (20)

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 

Mais de Yury Kashnitsky

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data ScienceYury Kashnitsky
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0Yury Kashnitsky
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPYury Kashnitsky
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun ResolutionYury Kashnitsky
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMAYury Kashnitsky
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewYury Kashnitsky
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхYury Kashnitsky
 

Mais de Yury Kashnitsky (8)

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data Science
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun Resolution
 
mlcourse.ai. Outro
mlcourse.ai. Outromlcourse.ai. Outro
mlcourse.ai. Outro
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMA
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overview
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данных
 

Último

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Último (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

mlcourse.ai. Clustering

  • 1. mlcourse.ai. Clustering Yury Kashnitskiy, Dmitry Ignatov Higher School of Economics November 16, 2018 (Higher School of Economics) Clustering 16.11.2018 1 / 24
  • 2. Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 2 / 24
  • 3. Clustering Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 3 / 24
  • 4. Clustering Problem formulation Problem formulation The main task of cluster analysis is to group instances into subgroups (clusters) of similar ones. These groups can be Partitions Hierarchies Fuzzy partitions Biclusters Mixtures of distributions (Higher School of Economics) Clustering 16.11.2018 4 / 24
  • 5. Clustering Applications Applications Biology and medicine Gene expression analysis Tomography clustering Humanitarian sciences Sociology and anthropology Psychology Technical systems Telemetry Image segmentation Marketing Customer segmentation Subgroup behavioral analysis Text analytics News clustering Social networks Comunity detection (Higher School of Economics) Clustering 16.11.2018 5 / 24
  • 6. Clustering methods Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 6 / 24
  • 7. Clustering methods How to measure dissimilarity of instances Instances x ∈ Rm are representaed as feature matrices.      x1 x2 ... xn      ⇐⇒     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     Minkowski distance d(x, y) = m i=1 |xi − yi |p 1 p Cosine distance d(x, y) = 1 − ⟨x, y⟩ ⟨x, x⟩ ⟨y, y⟩ Hamming distance d(x, y) = 1 m m i=1 [xi ̸= yi ] (Higher School of Economics) Clustering 16.11.2018 7 / 24
  • 8. Clustering methods k-Means k-Means is an iterative algorithm to split data into k clusters. Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined as cj = 1 |Cj | i∈Cj xi The objective is the sum of squares of all distances between instances and centroids of clusteres to which these instances belong. J(C) = k j=1 i∈Cj d(xi , cj )2 (Higher School of Economics) Clustering 16.11.2018 8 / 24
  • 9. Clustering methods k-Means The algorithm Input: Data, k — is a hyperparameter Ouput: Partition of data into k clusters * * * 1. Initialization: Set k points to be initial centroid 2. Update clusters: Given k centroids, each instance is attributed to one of centroids. Thus, all instances attributed to a centroid cj (j = 1 . . . k), form a cluster Cj . 3. Update centroids: For each cluster Cj , a new centroid is calculated as a geometrical mean of all instances in this cluster. Steps 2-3 are repeated until convergence. (Higher School of Economics) Clustering 16.11.2018 9 / 24
  • 10. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 11. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 12. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 13. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 14. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 15. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 16. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 17. Clustering methods Clustering quality and the number of clusters Elbow method For each k we can calculate J(C). Then, we find such k that further increasing it does not decrease J “too much”. Formally, we look for k that minimizes the following D(k): D(k) = |J(k) − J(k + 1)| |J(k − 1) − J(k)| (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 18. Clustering methods Clustering quality and the number of clusters Elbow method −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 2 3 4 5 6 7 8 9 10 0 500 1000 1500 2000 2500 3000 3500 4000 k J(R) Elbow Method (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 19. Clustering methods Clustering quality and the number of clusters Silhouette Silhouette for an instance xi in a cluster C is a function s(i) = bi − ai max(ai , bi ) , where a(i) — is the mean distance from xi to all other instances from C, а bm(i) — is the mean distance from xi to instances from other clusters. (Higher School of Economics) Clustering 16.11.2018 12 / 24
  • 20. Clustering methods Silhouette Acceptable number of clusters (Higher School of Economics) Clustering 16.11.2018 13 / 24
  • 21. Clustering methods Silhouette Bad number of clusters (Higher School of Economics) Clustering 16.11.2018 14 / 24
  • 22. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒       d(x1, x1) d(x1, x2) . . . d(x1, xn) d(x2, x1) ... ... d(x2, xn) ... ... ... ... d(xn, x1) d(xn, x2) · · · d(xn, xn)       (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 23. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒        0 d(x1, x2) d(x1, x3) · · · d(x1, xn) 0 d(x2, x3) · · · d(x2, xn) ... · · · · · · 0 d(xn−1, xn) 0        (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 24. Clustering methods Hierarchical methods Agglomerative clustering Sequential merging of similar clusters 0 Start with each cluster having only one instance 1 Find two closest clusters 2 Merge them Repeat steps 1-2 untill all instances are in the same cluster How to define distance between clusters? (Higher School of Economics) Clustering 16.11.2018 16 / 24
  • 25. Clustering methods Hierarchical methods Agglomerative clustering Linkage 1 Single Linkage d(A, B) = min x∈A,y∈B d(x, y) 2 Complete Linkage d(A, B) = max x∈A,y∈B d(x, y) (Higher School of Economics) Clustering 16.11.2018 17 / 24
  • 26. Clustering methods Hierarchical methods Agglomerative clustering Linkage 3 Average Linkage d(A, B) = 1 |A||B| i∈A j∈B d(xi , yj ) 4 Weighted Average Linkage Let clusterA be a union of clusters q и p. Then d(A, B) = d(p, B) + d(q, B) 2 5 Centroid Linkage d(A, B) = ∥cA − cB ∥2 (Higher School of Economics) Clustering 16.11.2018 18 / 24
  • 27. Clustering methods Hierarchical methods Agglomerative clustering Merging clusters can be depicted with a dendrogram. Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 } 1 2 3 7 10 12 25 29 0 5 10 15 20 25 Objects Clusterdistances B C A Distance between cluster A and B (Higher School of Economics) Clustering 16.11.2018 19 / 24
  • 28. Clustering methods Density-based methods Density-based methods DBSCAN DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise. (Higher School of Economics) Clustering 16.11.2018 20 / 24
  • 29. Clustering methods Density-based methods DBSCAN algorithm All point can be divided into elements of dense regions, border points and noise (skipping formal definition here). (Higher School of Economics) Clustering 16.11.2018 21 / 24
  • 30. Clustering methods Density-based methods DBSCAN. Example Hyperparams: M = 4, Eps > 0 (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 31. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 32. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 33. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 34. Clustering methods Density-based methods DBSCAN. Pros and cons Pros + Can find clusters of any shape + Easy to implement + Can find noise in data + Nice complexity — O(n log(n)) with a good data sctructure (otherwise — O(n2 ) ) Cons - Parametric - Doesn’t work well when clusters differ in density - Depends on the chosen metric (Higher School of Economics) Clustering 16.11.2018 23 / 24
  • 35. Clustering methods Density-based methods Contacts Questions Thanks! Please ask your questions in OpenDataSciene Slack team. http://ods.ai (Higher School of Economics) Clustering 16.11.2018 24 / 24