SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
Teaching k-Means New Tricks
Sergei Vassilvitskii
Google
k-Means Algorithm
The k-Means Algorithm [Lloyd ’57]
– Clusters points intro groups
– Remains a workhorse of machine learning even in the age of deep networks
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Initialize with random clusters
49
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Assign each point to nearest center
50
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Recompute optimum centers (means)
51
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat: Assign points to nearest center
52
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat: Recompute centers
53
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...
54
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
55
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
Total error reduced at every step - guaranteed to converge.
55
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
Total error reduced at every step - guaranteed to converge.
Minimizes:
56
(X, C) =
X
x2X
d(x, C)2
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in parallel)
– Clustering into many clusters
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random?
57
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random?
58
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random? A bad idea
59
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random? A bad idea
Even with many random restarts!
59
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
60
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
61
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
62
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
63
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
64
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
66
MR ML Algorithmics Sergei Vassilvitskii
Interpolate between two methods. Give preference to further points.
Let be the distance between and the nearest cluster center.
Sample next center proportionally to .
k-means++
67
D(p) p
D↵
(p)
MR ML Algorithmics Sergei Vassilvitskii
k-means++
68
D(p) p
Interpolate between two methods. Give preference to further points.
Let be the distance between and the nearest cluster center.
Sample next center proportionally to .D↵
(p)
D↵
(p)
P
x D↵(p)
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i){
Select next point p with probability ;
UpdateDistances();
}
MR ML Algorithmics Sergei Vassilvitskii
k-means++
69
D(p) p
Interpolate between two methods. Give preference to further points.
Let be the distance between and the nearest cluster center.
Sample next center proportionally to .D↵
(p)
↵ = 1
↵ = 2
Original Lloyd’s:
Furthest Point:
k-means++:
↵ = 0
D↵
(p)
P
x D↵(p)
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i){
Select next point p with probability ;
UpdateDistances();
}
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
71
Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in parallel)
– Clustering into many clusters
Dealing with large data
The new initialization approach:
– Leads to very good clusterings
– But is very sequential!
• Must select one cluster at a time, then update the distribution we are
sampling from
– How to adapt it in the world of parallel computing?
Speeding up initialization
Initialization:
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i) {
Select next point p with probability ;
UpdateDistance();
}
Improving the speed:
– Instead of selecting a single point, sample many points at a time
– Oversample: select more than k centers, and then select the best k out of them.
D2
(p)
P
x D2(x)
MR ML Algorithmics Sergei Vassilvitskii
k-means||
74
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i){
Select next point p with probability ;
UpdateDistances();
}
}
D2
(p)
P
p D2(p)
MR ML Algorithmics Sergei Vassilvitskii
k-means||
75
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ; ++i){
Select point p independently with probability
UpdateDistances();
}
Prune to k points total by clustering the clusters
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
MR ML Algorithmics Sergei Vassilvitskii
k-means||
76
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ; ++i){
Select point p independently with probability
UpdateDistances();
}
Prune to k points total by clustering the clusters
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Independent selection
Easy MR
MR ML Algorithmics Sergei Vassilvitskii
k-means||
77
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ; ++i){
Select point p independently with probability
UpdateDistances();
}
Prune to k points total by clustering the clusters
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Independent selection
Easy MR
Oversampling Parameter
MR ML Algorithmics Sergei Vassilvitskii
k-means||
78
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ; ++i){
Select point p independently with probability
UpdateDistances();
}
Prune to k points total by clustering the clusters
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Independent selection
Easy MR
Oversampling Parameter
Re-clustering step
MR ML Algorithmics Sergei Vassilvitskii
k-means||: Analysis
How Many Rounds?
– Theorem: After rounds, guarantee approximation
– In practice: fewer iterations are needed
– Need to re-cluster intermediate centers
Discussion:
– Number of rounds independent of k
– Tradeoff between number of rounds and memory
79
O(1)O(log`(n ))
O(k` log`(n ))
MR ML Algorithmics Sergei Vassilvitskii
How well does this work?
80
1e+12
1e+13
1 10
log # Rounds
1e+11
1e+12
1e+13
1
1e+11
1e+12
1e+13
1e+14
1e+15
1e+16
1 10
cost
log # Rounds
KDD Dataset, k=65
l/k=1
l/k=2
l/k=4
1e+10
1e+11
1e+12
1e+13
1e+14
1e+15
1e+16
1
cost
Random Initialization
k-means++
k-means||
l=1
l=2
l=4
MR ML Algorithmics Sergei Vassilvitskii
Performance vs. k-means++
– Even better on small datasets: 4600 points, 50 dimensions (SPAM)
– Accuracy:
– Time (iterations):
81
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in parallel)
– Clustering into many clusters
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
– Naive approach: linear scan
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
– Naive approach: linear scan
– Better approach [Elkan]:
• Use triangle inequality to see if the center could have possibly gotten closer
• Still expensive when k is large
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have many algorithms for nearest neighbors!
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have many algorithms for nearest neighbors!
First idea:
– Index the centers. Then do a query into this data structure for every point
– Need to rebuild the NN Data structure every time
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have many algorithms for nearest neighbors!
First idea:
– Index the centers. Then do a query into this data structure for every point
– Need to rebuild the NN Data structure every time
Better idea:
– Index the points!
– For every center, query the nearest points
Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into k=1000 clusters.
Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into k=1000 clusters.
Index based k-means:
– Simple implementation: 2-7x faster than traditional k-means
– No degradation in quality (same objective function value)
– More complex implementation:
• An additional 8-50x speed improvement !
K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better initialization approaches that are fast and accurate
– Parallel implementations to handle large datasets
– New implementations that handle points in many dimensions and clustering into
many clusters
– New approaches for online clustering
K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better initialization approaches that are fast and accurate
– Parallel implementations to handle large datasets
– New implementations that handle points in many dimensions and clustering into
many clusters
– New approaches for online clustering
More work remains!
– Non spherical clusters
– Other metric spaces
– Dealing with outliers
Thank You.
Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007.
Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++.
VLDB 2012.
Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by
ranked retrieval. WSDM 2014.

Mais conteúdo relacionado

Mais procurados

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 

Mais procurados (20)

Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
nn network
nn networknn network
nn network
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 

Destaque

MLconf NYC Corinna Cortes
MLconf NYC Corinna CortesMLconf NYC Corinna Cortes
MLconf NYC Corinna Cortes
MLconf
 
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
MLconf
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
butest
 

Destaque (20)

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
 
Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...
Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...
Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...
 
MLconf NYC Corinna Cortes
MLconf NYC Corinna CortesMLconf NYC Corinna Cortes
MLconf NYC Corinna Cortes
 
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Corinna Cortes, Head of Research, Google at MLconf NYC
Corinna Cortes, Head of Research, Google at MLconf NYCCorinna Cortes, Head of Research, Google at MLconf NYC
Corinna Cortes, Head of Research, Google at MLconf NYC
 
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15
Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15
Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
slides Céline Beji
slides Céline Bejislides Céline Beji
slides Céline Beji
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 1...
Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 1...Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 1...
Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 1...
 
K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor Presentation
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
KNN
KNN KNN
KNN
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 

Semelhante a Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Ukraine
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
Shubham Joshi
 

Semelhante a Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16 (20)

Data Mining: Implementation of Data Mining Techniques using RapidMiner software
Data Mining: Implementation of Data Mining Techniques using RapidMiner softwareData Mining: Implementation of Data Mining Techniques using RapidMiner software
Data Mining: Implementation of Data Mining Techniques using RapidMiner software
 
k-Means Clustering.pptx
k-Means Clustering.pptxk-Means Clustering.pptx
k-Means Clustering.pptx
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
K Means Clustering in ML.pptx
K Means Clustering in ML.pptxK Means Clustering in ML.pptx
K Means Clustering in ML.pptx
 
Why Reinvent the Wheel: Let's Build Question Answering Systems Together
Why Reinvent the Wheel: Let's Build Question Answering Systems TogetherWhy Reinvent the Wheel: Let's Build Question Answering Systems Together
Why Reinvent the Wheel: Let's Build Question Answering Systems Together
 
Distributed streaming k means
Distributed streaming k meansDistributed streaming k means
Distributed streaming k means
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
Scalable k-means plus plus
Scalable k-means plus plusScalable k-means plus plus
Scalable k-means plus plus
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 

Mais de MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

  • 1. Teaching k-Means New Tricks Sergei Vassilvitskii Google
  • 2. k-Means Algorithm The k-Means Algorithm [Lloyd ’57] – Clusters points intro groups – Remains a workhorse of machine learning even in the age of deep networks
  • 3. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Initialize with random clusters 49
  • 4. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Assign each point to nearest center 50
  • 5. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Recompute optimum centers (means) 51
  • 6. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat: Assign points to nearest center 52
  • 7. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat: Recompute centers 53
  • 8. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat... 54
  • 9. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change 55
  • 10. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. 55
  • 11. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. Minimizes: 56 (X, C) = X x2X d(x, C)2
  • 12. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  • 13. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? 57
  • 14. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? 58
  • 15. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? A bad idea 59
  • 16. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? A bad idea Even with many random restarts! 59
  • 17. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 60
  • 18. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 61
  • 19. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 62
  • 20. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 63
  • 21. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 64
  • 22. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  • 23. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  • 24. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  • 25. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  • 26. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 66
  • 27. MR ML Algorithmics Sergei Vassilvitskii Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to . k-means++ 67 D(p) p D↵ (p)
  • 28. MR ML Algorithmics Sergei Vassilvitskii k-means++ 68 D(p) p Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to .D↵ (p) D↵ (p) P x D↵(p) kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }
  • 29. MR ML Algorithmics Sergei Vassilvitskii k-means++ 69 D(p) p Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to .D↵ (p) ↵ = 1 ↵ = 2 Original Lloyd’s: Furthest Point: k-means++: ↵ = 0 D↵ (p) P x D↵(p) kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }
  • 30. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  • 31. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  • 32. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  • 33. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  • 34. MR ML Algorithmics Sergei Vassilvitskii k-means++ 71 Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)
  • 35. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  • 36. Dealing with large data The new initialization approach: – Leads to very good clusterings – But is very sequential! • Must select one cluster at a time, then update the distribution we are sampling from – How to adapt it in the world of parallel computing?
  • 37. Speeding up initialization Initialization: kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i) { Select next point p with probability ; UpdateDistance(); } Improving the speed: – Instead of selecting a single point, sample many points at a time – Oversample: select more than k centers, and then select the best k out of them. D2 (p) P x D2(x)
  • 38. MR ML Algorithmics Sergei Vassilvitskii k-means|| 74 kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); } } D2 (p) P p D2(p)
  • 39. MR ML Algorithmics Sergei Vassilvitskii k-means|| 75 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c))
  • 40. MR ML Algorithmics Sergei Vassilvitskii k-means|| 76 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR
  • 41. MR ML Algorithmics Sergei Vassilvitskii k-means|| 77 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR Oversampling Parameter
  • 42. MR ML Algorithmics Sergei Vassilvitskii k-means|| 78 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR Oversampling Parameter Re-clustering step
  • 43. MR ML Algorithmics Sergei Vassilvitskii k-means||: Analysis How Many Rounds? – Theorem: After rounds, guarantee approximation – In practice: fewer iterations are needed – Need to re-cluster intermediate centers Discussion: – Number of rounds independent of k – Tradeoff between number of rounds and memory 79 O(1)O(log`(n )) O(k` log`(n ))
  • 44. MR ML Algorithmics Sergei Vassilvitskii How well does this work? 80 1e+12 1e+13 1 10 log # Rounds 1e+11 1e+12 1e+13 1 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16 1 10 cost log # Rounds KDD Dataset, k=65 l/k=1 l/k=2 l/k=4 1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16 1 cost Random Initialization k-means++ k-means|| l=1 l=2 l=4
  • 45. MR ML Algorithmics Sergei Vassilvitskii Performance vs. k-means++ – Even better on small datasets: 4600 points, 50 dimensions (SPAM) – Accuracy: – Time (iterations): 81
  • 46. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  • 47. Large k How do you run k-means when k is large? – For every point, need to find the nearest center
  • 48. Large k How do you run k-means when k is large? – For every point, need to find the nearest center – Naive approach: linear scan
  • 49. Large k How do you run k-means when k is large? – For every point, need to find the nearest center – Naive approach: linear scan – Better approach [Elkan]: • Use triangle inequality to see if the center could have possibly gotten closer • Still expensive when k is large
  • 50. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors!
  • 51. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors! First idea: – Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time
  • 52. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors! First idea: – Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time Better idea: – Index the points! – For every center, query the nearest points
  • 53. Performance Two large datasets: – 1M points in each – 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.
  • 54. Performance Two large datasets: – 1M points in each – 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters. Index based k-means: – Simple implementation: 2-7x faster than traditional k-means – No degradation in quality (same objective function value) – More complex implementation: • An additional 8-50x speed improvement !
  • 55. K-Means Algorithm Almost 60 years on, still incredibly popular and useful approach It has gotten better with age: – Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets – New implementations that handle points in many dimensions and clustering into many clusters – New approaches for online clustering
  • 56. K-Means Algorithm Almost 60 years on, still incredibly popular and useful approach It has gotten better with age: – Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets – New implementations that handle points in many dimensions and clustering into many clusters – New approaches for online clustering More work remains! – Non spherical clusters – Other metric spaces – Dealing with outliers
  • 57. Thank You. Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007. Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++. VLDB 2012. Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by ranked retrieval. WSDM 2014.