SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Determining the k in k-means
with MapReduce
Thibault Debatty, Pietro Michiardi,
Wim Mees & Olivier Thonnard
Algorithms for MapReduce and Beyond 2014
Determining the k in k-means with MapReduce 2
Clustering & k-means
● Clustering
● K-means
[Stuart P. Lloyd. Least squares quantization in pcm. IEEE
Transactions on Information Theory, 28:129–137, 1982.]
– 1982 (a great year!)
– But still largely used
– Drawbacks (amongst others):
● Local minimum
● K is a parameter!
Determining the k in k-means with MapReduce 3
Clustering & k-means
● Determine k:
– VERY difficult
[Anil K Jain. Data Clustering : 50 Years Beyond K-Means.
Pattern Recognition Letters, 2009]
– Using cluster evaluation metrics:
Dunn's index, Elbow, Silhouette, “jump method” (based on
information theory), “Gap statistic”,...
O(k²)
Determining the k in k-means with MapReduce 4
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
Source:scikit-learn
Determining the k in k-means with MapReduce 5
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
normality test &
recursion
Determining the k in k-means with MapReduce 6
G-means
Dataset
Determining the k in k-means with MapReduce 7
G-means
1. Pick 2 centers
Determining the k in k-means with MapReduce 8
G-means
2. k-means
Determining the k in k-means with MapReduce 9
G-means
3. Project
Determining the k in k-means with MapReduce 10
G-means
3. Project
Determining the k in k-means with MapReduce 11
G-means
Normal?
No
=> recursion
4. Normality test
Determining the k in k-means with MapReduce 12
G-means
5. Recursion
Determining the k in k-means with MapReduce 13
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 14
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 15
MapReduce G-means
2. Reduce number of jobs
PickInitialCenters
while Not ClusteringCompleted do
KMeans
KMeansAndFindNewCenters
TestClusters
end while
Determining the k in k-means with MapReduce 16
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
3. Maximize
parallelism
4. Limit memory
usage
Determining the k in k-means with MapReduce 17
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
Bottleneck
3. Maximize
parallelism
4. Limit memory
usage (risk of crash)
Determining the k in k-means with MapReduce 18
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
In memory combiner
Determining the k in k-means with MapReduce 19
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Determining the k in k-means with MapReduce 20
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Experimentally:
64 Bytes / point
Determining the k in k-means with MapReduce 21
Comparison
MR multi-k-means MR G-means
Speed
Quality
all possible values of k
in a single job
Determining the k in k-means with MapReduce 22
Comparison
MR multi-k-means MR G-means
Speed O(nk²) computations O(nk) computations
But:
● more iterations
● more dataset reads
● log2
(k)
Quality New centers added if and
where needed
But:
tends to overestimate k!
Determining the k in k-means with MapReduce 23
Experimental results : Speed
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
Determining the k in k-means with MapReduce 24
Experimental results : Quality
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
k 100 200 400
kfound
150 279 639
Within Cluster Sum of Square
(less is better)
MR G-means 3.34 3.33 3.23
multi-k-means 3.71 3.6 3.39
(with same k)
x ~1.5
Determining the k in k-means with MapReduce 25
Conclusions & future work...
● MapReduce algorithm to determine k
● Running time proportional to k
● Future:
– Overestimation of k
– Test on real data
– Test scalability
– Reduce I/O (using Spark)
– Consider skewed data
– Consider impact of machine failure
Determining the k in k-means with MapReduce 26
Thank you!

Mais conteúdo relacionado

Mais procurados

3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream NetworkHartanto Sanjaya
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field TheoryIndraneel Pole
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTMHartanto Sanjaya
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingAlexander Schätzle
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentationPavel Popa
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...GISRUK conference
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, PadangHartanto Sanjaya
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, TomohonHartanto Sanjaya
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting techniqueUday Vakalapudi
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed LombokHartanto Sanjaya
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFishAnushree Prasanna Kumar
 

Mais procurados (20)

3D Watershed Celebes
3D Watershed Celebes3D Watershed Celebes
3D Watershed Celebes
 
3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network
 
3D Analyst - Cut and Fill
3D Analyst - Cut and Fill3D Analyst - Cut and Fill
3D Analyst - Cut and Fill
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field Theory
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentation
 
3D Analyst - Watershed
3D Analyst - Watershed3D Analyst - Watershed
3D Analyst - Watershed
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
On Skyline Groups
On Skyline GroupsOn Skyline Groups
On Skyline Groups
 
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed Lombok
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
 

Semelhante a Determining the k in k-means with MapReduce

Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblasMIT
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblasgraphulo
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesAjay Bidyarthy
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph MotifAMR koura
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem Eunice Lin
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Vissarion Fisikopoulos
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 

Semelhante a Determining the k in k-means with MapReduce (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
LSH
LSHLSH
LSH
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 

Mais de Thibault Debatty

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsThibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessThibault Debatty
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsThibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...Thibault Debatty
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT DetectionThibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataThibault Debatty
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 

Mais de Thibault Debatty (15)

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 

Último

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 

Último (20)

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 

Determining the k in k-means with MapReduce

  • 1. Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014
  • 2. Determining the k in k-means with MapReduce 2 Clustering & k-means ● Clustering ● K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.] – 1982 (a great year!) – But still largely used – Drawbacks (amongst others): ● Local minimum ● K is a parameter!
  • 3. Determining the k in k-means with MapReduce 3 Clustering & k-means ● Determine k: – VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] – Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,... O(k²)
  • 4. Determining the k in k-means with MapReduce 4 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center Source:scikit-learn
  • 5. Determining the k in k-means with MapReduce 5 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center normality test & recursion
  • 6. Determining the k in k-means with MapReduce 6 G-means Dataset
  • 7. Determining the k in k-means with MapReduce 7 G-means 1. Pick 2 centers
  • 8. Determining the k in k-means with MapReduce 8 G-means 2. k-means
  • 9. Determining the k in k-means with MapReduce 9 G-means 3. Project
  • 10. Determining the k in k-means with MapReduce 10 G-means 3. Project
  • 11. Determining the k in k-means with MapReduce 11 G-means Normal? No => recursion 4. Normality test
  • 12. Determining the k in k-means with MapReduce 12 G-means 5. Recursion
  • 13. Determining the k in k-means with MapReduce 13 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 14. Determining the k in k-means with MapReduce 14 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 15. Determining the k in k-means with MapReduce 15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while
  • 16. Determining the k in k-means with MapReduce 16 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure 3. Maximize parallelism 4. Limit memory usage
  • 17. Determining the k in k-means with MapReduce 17 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure Bottleneck 3. Maximize parallelism 4. Limit memory usage (risk of crash)
  • 18. Determining the k in k-means with MapReduce 18 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure In memory combiner
  • 19. Determining the k in k-means with MapReduce 19 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap
  • 20. Determining the k in k-means with MapReduce 20 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap Experimentally: 64 Bytes / point
  • 21. Determining the k in k-means with MapReduce 21 Comparison MR multi-k-means MR G-means Speed Quality all possible values of k in a single job
  • 22. Determining the k in k-means with MapReduce 22 Comparison MR multi-k-means MR G-means Speed O(nk²) computations O(nk) computations But: ● more iterations ● more dataset reads ● log2 (k) Quality New centers added if and where needed But: tends to overestimate k!
  • 23. Determining the k in k-means with MapReduce 23 Experimental results : Speed ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines
  • 24. Determining the k in k-means with MapReduce 24 Experimental results : Quality ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines k 100 200 400 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39 (with same k) x ~1.5
  • 25. Determining the k in k-means with MapReduce 25 Conclusions & future work... ● MapReduce algorithm to determine k ● Running time proportional to k ● Future: – Overestimation of k – Test on real data – Test scalability – Reduce I/O (using Spark) – Consider skewed data – Consider impact of machine failure
  • 26. Determining the k in k-means with MapReduce 26 Thank you!