SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Clustering
 Group a set of objects
 Objects in the same group should be similar
 For each group we have an object called centre
 Minimise the distance to the central point

 Unsupervised learning:
 Un-labelled data
 No training data
Lloyd’s K-means algo.
 Centres ← Randomly pick k points
 Iterate:
 Assign each point to the closest centre
 Calculate the new centre points: centroids of each cluster

 Problems:
 It iterates over the whole list of points -> Not suitable for
vast amounts of data.
 Bad initialization.
K-means++
 Centers ← Randomly pick ONE point from X
 Until we have enough centres:
 Choose from X the next centre with probability
𝐷(𝑝,𝑐)2

𝑖∈𝑋

𝐷(𝑥)2

 The probability increases when the distance to the
closest centre is high.
K-means#
 Centers ← Randomly pick 3 log k points from X
 Until we have enough centres:
 Choose from X the next 3 log k centres with
𝐷(𝑝,𝑐)2
probability
2
𝑖∈𝑋

𝐷(𝑥)

 It improves the coverage of the clusters of the
optimal solution.
Divide and conquer
k-means#

CENTERS1

POINTS1
WEIGHTS1
k-means#

k-means++

CENTERS2
POINTS2

CENTERS
WEIGHTS2
k-means#

CENTERS3

POINTS3
WEIGHTS3
Fast streaming k-means
One pass over
the points
selecting those
that are far away
from the already
selected
When there is no
space enough,
we remove those
centres that are
less interesting
Finally, we run
Lloyd’s algorithm
on the centres
using the
weights
Basic Method
 Single-pass k-means (explained before)
 Output: Not-so-good clustering but a good candidate

 Use weighted centers/ facilities from Step-1
 Output: Good clustering with fewer clusters

 Finding Nearest Neighbor: Most time consuming step
 NN based on random Projection- Simple
 Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements [1]
 Empirically, Projection search is a bit better than 64 bit LSH[4]
Scaling
 Map:
 Roughly cluster input data using Streaming k-means
 Output: Weighted Centers (Cluster’s Center and the
number of points it contains)

 Reduce:
 All centers passed to a single reducer
 Apply batch k-means or again one-pass (if there are
too many centers)
 Can use Combiner but not necessary
Scaling
References
 Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements by Kerui
Min et al.

 Fast and Accurate k-means for large datasets by Shindler
et al.
 Streaming k-Means Approximation by Jaiswal et al.
 Large Scale Single pass k-Means Clustering at Scale by
Ted Dunning
 Apache Mahout
Questions?

Mais conteúdo relacionado

Mais procurados

Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883
Vivek Sharma
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentation
Pavel Popa
 

Mais procurados (9)

Epipolarna - Project Presentation - Tracking
Epipolarna - Project Presentation - TrackingEpipolarna - Project Presentation - Tracking
Epipolarna - Project Presentation - Tracking
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentation
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
Hash tables
Hash tablesHash tables
Hash tables
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?
 

Semelhante a Distributed streaming k means

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 

Semelhante a Distributed streaming k means (20)

Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
6 clustering
6 clustering6 clustering
6 clustering
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Clustering
ClusteringClustering
Clustering
 
Clustering
ClusteringClustering
Clustering
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Clustering
ClusteringClustering
Clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
My8clst
My8clstMy8clst
My8clst
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Clustering
ClusteringClustering
Clustering
 

Mais de Jose Luis Lopez Pino

Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libres
Jose Luis Lopez Pino
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De Carrera
Jose Luis Lopez Pino
 

Mais de Jose Luis Lopez Pino (20)

Lessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingLessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketing
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
Scheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersScheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data Clusters
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libres
 
Esteganografia
EsteganografiaEsteganografia
Esteganografia
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De Carrera
 
Memoria Proyecto Fin de Carrera
Memoria Proyecto Fin de CarreraMemoria Proyecto Fin de Carrera
Memoria Proyecto Fin de Carrera
 
Presentacion CUSL nacional
Presentacion CUSL nacionalPresentacion CUSL nacional
Presentacion CUSL nacional
 
Resumen del proyecto Visuse
Resumen del proyecto VisuseResumen del proyecto Visuse
Resumen del proyecto Visuse
 
Presentacion cusl granadino
Presentacion cusl granadinoPresentacion cusl granadino
Presentacion cusl granadino
 
Como hacer un módulo para Visuse
Como hacer un módulo para VisuseComo hacer un módulo para Visuse
Como hacer un módulo para Visuse
 
Visuse: resumen del I Hackathon
Visuse: resumen del I HackathonVisuse: resumen del I Hackathon
Visuse: resumen del I Hackathon
 
Presentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónPresentacion Visuse para el Hachathón
Presentacion Visuse para el Hachathón
 
Desarrollar un módulo para Visuse
Desarrollar un módulo para VisuseDesarrollar un módulo para Visuse
Desarrollar un módulo para Visuse
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Distributed streaming k means

  • 1.
  • 2. Clustering  Group a set of objects  Objects in the same group should be similar  For each group we have an object called centre  Minimise the distance to the central point  Unsupervised learning:  Un-labelled data  No training data
  • 3. Lloyd’s K-means algo.  Centres ← Randomly pick k points  Iterate:  Assign each point to the closest centre  Calculate the new centre points: centroids of each cluster  Problems:  It iterates over the whole list of points -> Not suitable for vast amounts of data.  Bad initialization.
  • 4. K-means++  Centers ← Randomly pick ONE point from X  Until we have enough centres:  Choose from X the next centre with probability 𝐷(𝑝,𝑐)2 𝑖∈𝑋 𝐷(𝑥)2  The probability increases when the distance to the closest centre is high.
  • 5. K-means#  Centers ← Randomly pick 3 log k points from X  Until we have enough centres:  Choose from X the next 3 log k centres with 𝐷(𝑝,𝑐)2 probability 2 𝑖∈𝑋 𝐷(𝑥)  It improves the coverage of the clusters of the optimal solution.
  • 7. Fast streaming k-means One pass over the points selecting those that are far away from the already selected When there is no space enough, we remove those centres that are less interesting Finally, we run Lloyd’s algorithm on the centres using the weights
  • 8.
  • 9. Basic Method  Single-pass k-means (explained before)  Output: Not-so-good clustering but a good candidate  Use weighted centers/ facilities from Step-1  Output: Good clustering with fewer clusters  Finding Nearest Neighbor: Most time consuming step  NN based on random Projection- Simple  Compact Projection: Simple and Efficient Near Neighbor Search with Practical Memory Requirements [1]  Empirically, Projection search is a bit better than 64 bit LSH[4]
  • 10. Scaling  Map:  Roughly cluster input data using Streaming k-means  Output: Weighted Centers (Cluster’s Center and the number of points it contains)  Reduce:  All centers passed to a single reducer  Apply batch k-means or again one-pass (if there are too many centers)  Can use Combiner but not necessary
  • 12. References  Compact Projection: Simple and Efficient Near Neighbor Search with Practical Memory Requirements by Kerui Min et al.  Fast and Accurate k-means for large datasets by Shindler et al.  Streaming k-Means Approximation by Jaiswal et al.  Large Scale Single pass k-Means Clustering at Scale by Ted Dunning  Apache Mahout