SlideShare uma empresa Scribd logo
1 de 53
Distributed Computing Seminar

Lecture 4: Clustering – an Overview
     and Sample MapReduce
          Implementation

  Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet
                          Summer 2007

       Except as otherwise noted, the content of this presentation is © 2007
       Google Inc. and licensed under the Creative Commons Attribution 2.5
                                    License.
Outline
   Clustering
     Intuition
     Clustering   Algorithms
        The Distance Measure
        Hierarchical vs. Partitional
        K-Means Clustering

     Complexity
     Canopy Clustering
     MapReducing a large data set with K-Means
      and Canopy Clustering
Clustering
   What is clustering?
Google News

               They didn’t pick
                all 3,400,217
                related articles
                by hand…
               Or Amazon.com
               Or Netflix…
Other less glamorous things...
 Hospital Records
 Scientific Imaging
     Related   genes, related stars, related sequences
   Market Research
     Segmenting    markets, product positioning
 Social Network Analysis
 Data mining
 Image segmentation…
The Distance Measure
   How the similarity of two elements in a set
    is determined, e.g.
     Euclidean Distance
     Manhattan Distance
     Inner Product Space
     Maximum Norm
     Or any metric you define   over the space…
Types of Algorithms
   Hierarchical Clustering vs.
   Partitional Clustering
Hierarchical Clustering




   Builds or breaks up a hierarchy of clusters.
Partitional Clustering




   Partitions set into all clusters simultaneously.
Partitional Clustering




   Partitions set into all clusters simultaneously.
K-Means Clustering
   Super simple Partitional Clustering

 Choose the number of clusters, k
 Choose k points to be cluster centers
 Then…
K-Means Clustering
iterate {
  Compute distance from all points to all k-
    centers
  Assign each point to the nearest k-center
  Compute the average of all points assigned to
    all specific k-centers
  Replace the k-centers with the new averages
}
But!
   The complexity is pretty high:
    k   * n * O ( distance metric ) * num (iterations)

   Moreover, it can be necessary to send
    tons of data to each Mapper Node.
    Depending on your bandwidth and memory
    available, this could be impossible.
Furthermore
   There are three big ways a data set can be
    large:
     There   are a large number of elements in the
      set.
     Each element can have many features.
     There can be many clusters to discover
   Conclusion – Clustering can be huge, even
    when you distribute it.
Canopy Clustering
 Preliminary step to help parallelize
  computation.
 Clusters data into overlapping Canopies
  using super cheap distance metric.
 Efficient
 Accurate
Canopy Clustering
While there are unmarked points {
  pick a point which is not strongly marked
     call it a canopy center
  mark all points within some threshold of
     it as in it’s canopy
  strongly mark all points within some
     stronger threshold
}
After the canopy clustering…
 Resume hierarchical or partitional
  clustering as usual.
 Treat objects in separate clusters as being
  at infinite distances.
MapReduce Implementation:
   Problem – Efficiently partition a large data
    set (say… movies with user ratings!) into a
    fixed number of clusters using Canopy
    Clustering, K-Means Clustering, and a
    Euclidean distance measure.
The Distance Metric
   The Canopy Metric ($)
   The K-Means Metric ($$$)
Steps!
   Get Data into a form you can use (MR)
   Picking Canopy Centers (MR)
   Assign Data Points to Canopies (MR)
   Pick K-Means Cluster Centers
   K-Means algorithm (MR)
     Iterate!
Data Massage
   This isn’t interesting, but it has to be done.
Selecting Canopy Centers
Assigning Points to Canopies
K-Means Map
Elbow Criterion
   Choose a number of clusters s.t. adding a
    cluster doesn’t add interesting information.

 Rule of thumb to determine what number of
  Clusters should be chosen.
 Initial assignment of cluster seeds has
  bearing on final model performance.
 Often required to run clustering several
  times to get maximal performance
Conclusions
   Clustering is slick
   And it can be done super efficiently
   And in lots of different ways
   Tomorrow! Graph algorithms.

Mais conteúdo relacionado

Mais procurados

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAdeyemi Fowe
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 

Mais procurados (20)

Clustering
ClusteringClustering
Clustering
 
Clique
Clique Clique
Clique
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Birch
BirchBirch
Birch
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clustering
ClusteringClustering
Clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Birch
BirchBirch
Birch
 
K means clustring @jax
K means clustring @jaxK means clustring @jax
K means clustring @jax
 
Clustering
ClusteringClustering
Clustering
 

Destaque

Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceFarzan Hajian
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 

Destaque (20)

Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 

Semelhante a Lec4 Clustering

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.pptLPrashanthi
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster AnalysisSuman Mia
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 

Semelhante a Lec4 Clustering (20)

My8clst
My8clstMy8clst
My8clst
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
clustering.pptx
clustering.pptxclustering.pptx
clustering.pptx
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 

Mais de mobius.cn

Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerankmobius.cn
 
Parallel Programming Primer 1
Parallel Programming Primer 1Parallel Programming Primer 1
Parallel Programming Primer 1mobius.cn
 
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
Advanced Lighting Techniques   Dan Baker (Meltdown 2005)Advanced Lighting Techniques   Dan Baker (Meltdown 2005)
Advanced Lighting Techniques Dan Baker (Meltdown 2005)mobius.cn
 
Influence map
Influence mapInfluence map
Influence mapmobius.cn
 

Mais de mobius.cn (7)

Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec2 Mapred
Lec2 MapredLec2 Mapred
Lec2 Mapred
 
Parallel Programming Primer 1
Parallel Programming Primer 1Parallel Programming Primer 1
Parallel Programming Primer 1
 
Lec1 Intro
Lec1 IntroLec1 Intro
Lec1 Intro
 
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
Advanced Lighting Techniques   Dan Baker (Meltdown 2005)Advanced Lighting Techniques   Dan Baker (Meltdown 2005)
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
 
Influence map
Influence mapInfluence map
Influence map
 

Lec4 Clustering

  • 1. Distributed Computing Seminar Lecture 4: Clustering – an Overview and Sample MapReduce Implementation Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except as otherwise noted, the content of this presentation is © 2007 Google Inc. and licensed under the Creative Commons Attribution 2.5 License.
  • 2. Outline  Clustering  Intuition  Clustering Algorithms  The Distance Measure  Hierarchical vs. Partitional  K-Means Clustering  Complexity  Canopy Clustering  MapReducing a large data set with K-Means and Canopy Clustering
  • 3. Clustering  What is clustering?
  • 4. Google News  They didn’t pick all 3,400,217 related articles by hand…  Or Amazon.com  Or Netflix…
  • 5. Other less glamorous things...  Hospital Records  Scientific Imaging  Related genes, related stars, related sequences  Market Research  Segmenting markets, product positioning  Social Network Analysis  Data mining  Image segmentation…
  • 6. The Distance Measure  How the similarity of two elements in a set is determined, e.g.  Euclidean Distance  Manhattan Distance  Inner Product Space  Maximum Norm  Or any metric you define over the space…
  • 7. Types of Algorithms  Hierarchical Clustering vs.  Partitional Clustering
  • 8. Hierarchical Clustering  Builds or breaks up a hierarchy of clusters.
  • 9. Partitional Clustering  Partitions set into all clusters simultaneously.
  • 10. Partitional Clustering  Partitions set into all clusters simultaneously.
  • 11. K-Means Clustering  Super simple Partitional Clustering  Choose the number of clusters, k  Choose k points to be cluster centers  Then…
  • 12. K-Means Clustering iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages }
  • 13. But!  The complexity is pretty high: k * n * O ( distance metric ) * num (iterations)  Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.
  • 14. Furthermore  There are three big ways a data set can be large:  There are a large number of elements in the set.  Each element can have many features.  There can be many clusters to discover  Conclusion – Clustering can be huge, even when you distribute it.
  • 15. Canopy Clustering  Preliminary step to help parallelize computation.  Clusters data into overlapping Canopies using super cheap distance metric.  Efficient  Accurate
  • 16. Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }
  • 17. After the canopy clustering…  Resume hierarchical or partitional clustering as usual.  Treat objects in separate clusters as being at infinite distances.
  • 18. MapReduce Implementation:  Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.
  • 19. The Distance Metric  The Canopy Metric ($)  The K-Means Metric ($$$)
  • 20. Steps!  Get Data into a form you can use (MR)  Picking Canopy Centers (MR)  Assign Data Points to Canopies (MR)  Pick K-Means Cluster Centers  K-Means algorithm (MR)  Iterate!
  • 21. Data Massage  This isn’t interesting, but it has to be done.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. Elbow Criterion  Choose a number of clusters s.t. adding a cluster doesn’t add interesting information.  Rule of thumb to determine what number of Clusters should be chosen.  Initial assignment of cluster seeds has bearing on final model performance.  Often required to run clustering several times to get maximal performance
  • 53. Conclusions  Clustering is slick  And it can be done super efficiently  And in lots of different ways  Tomorrow! Graph algorithms.