SlideShare uma empresa Scribd logo
1 de 32
Super-fast Online Clustering


©MapR Technologies - Confidential   1
whoami – Ted Dunning




©MapR Technologies - Confidential   2
Clustering? Why?

     Because other people do it
       –   Really!


     Because cluster distances make great model features
       –   Better


     Because good clusters help with really fast nearest neighbor search
       –   Very nice


     Because we can use clusters as a surrogate for all the data
       –   And that lets us train models or do visualization


©MapR Technologies - Confidential              3
Agenda

     Nearest neighbor models
       –   Colored dots; need good distance metric; projection, LSH and k-means
           search
     K-means algorithms
       –   O(k d log n) per point for Lloyd’s algorithm
                  … not good for k = 2000, n = 108
       –   Surrogate methods
            •    fast, sloppy single pass clustering with κ = k log n
            •    fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point
            •    fast, in-memory, high-quality clustering of κ weighted centroids
            •    result consists of k high-quality centroids for the original data
     Results


©MapR Technologies - Confidential                       4
Nearest Neighbor Models

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy … but hard
       –   easy because it is so conceptually simple and you don’t have knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results


     Initial rapid prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            5
K-Nearest Neighbor Example




©MapR Technologies - Confidential   6
Comparison to Other Modeling Approaches

     Logistic regression
       –   Depends on linear separability
       –   k-nn works very well if logistic regression works
       –   k-nn can work very well even if logistic regression fails due to interactions
           producing non-linear decision surface
     Tree-based methods
       –   mostly roughly equivalent in accuracy with k-nn




©MapR Technologies - Confidential              7
Required Scale and Speed and Accuracy

     Want 20 million queries against 25 million references in 10,000 s
     Should be able to search > 100 million references
     Should be linearly and horizontally scalable
     Must have >50% overlap against reference search
     Evaluation by sub-sampling is viable, but tricky




©MapR Technologies - Confidential      8
How Hard is That?

     20 M x 25 M x 100 Flop = 50 P Flop


     1 CPU = 5 Gflops


     We need 10 M CPU seconds => 10,000 CPU’s


     Real-world efficiency losses may increase that by 10x


     Not good!



©MapR Technologies - Confidential     9
How Can We Search Faster?

     First rule: don’t do it
       –   If we can eliminate most candidates, we can do less work
       –   Projection search and k-means search


     Second rule: don’t do it
       –   We can convert big floating point math to clever bit-wise integer math
       –   Locality sensitive hashing


     Third rule: reduce dimensionality
       –   Projection search
       –   Random projection for very high dimension


©MapR Technologies - Confidential            10
Note the Circularity

     Clustering helps nearest neighbor search



     But clustering needs nearest neighbor search internally



     How droll !




©MapR Technologies - Confidential    11
Projection Search


                                         java.lang.TreeSet!




©MapR Technologies - Confidential   12
How Many Projections?




©MapR Technologies - Confidential   13
LSH Search

     Each random projection produces independent sign bit
     If two vectors have the same projected sign bits, they probably
      point in the same direction (i.e. cos θ ≈ 1)
     Distance in L2 is closely related to cosine

                          x - y 2 = x - 2(x × y) + y
                                         2             2



                                    = x 2 - 2 x y cosq + y 2

     We can replace (some) vector dot products with long integer XOR


©MapR Technologies - Confidential               14
LSH Bit-match Versus Cosine
                       1


                     0.8


                     0.6


                     0.4


                     0.2
          Y Ax is




                       0
                            0       8   16   24    32       40   48   56   64

                    - 0.2


                    - 0.4


                    - 0.6


                    - 0.8


                      -1

                                                  X Ax is




©MapR Technologies - Confidential                    15
Results




©MapR Technologies - Confidential   16
K-means Search

     First do clustering with lots (thousands) of clusters


     Then search nearest clusters to find nearest points


     We win if we find >50% overlap with “true” answer


     We lose if we can’t cluster super-fast
       –   more on this later




©MapR Technologies - Confidential      17
Lots of Clusters Are Fine




©MapR Technologies - Confidential   18
Lots of Clusters Are Fine




©MapR Technologies - Confidential   19
Some Details

     Clumpy data works better
       –   Real data is clumpy 


     Speedups of 100-200x seem practical with 50% overlap
       –   Projection search and LSH can be used to accelerate that (some)


     More experiments needed


     Definitely need fast search




©MapR Technologies - Confidential            20
So Now Some Clustering




©MapR Technologies - Confidential             21
Lloyd’s Algorithm

     Part of CS folk-lore
     Developed in the late 50’s for signal quantization, published in 80’s

       initialize k cluster centroids somehow
       for each of many iterations:
         for each data point:
               assign point to nearest cluster
         recompute cluster centroids from points assigned to clusters


         Highly variable quality, several restarts recommended



©MapR Technologies - Confidential          22
Ball k-means

     Provably better for highly clusterable data
     Tries to find initial centroids in the “core” of real clusters
     Avoids outliers in centroid computation

       initialize centroids randomly with distance maximizing tendency
       for each of a very few iterations:
         for each data point:
               assign point to nearest cluster
         recompute centroids using only points much closer than closest cluster




©MapR Technologies - Confidential          23
Surrogate Method

     Start with sloppy clustering into κ = k log n clusters
     Use these clusters as a weighted surrogate for the data
     Cluster surrogate data using ball k-means


     Results are provably high quality for highly clusterable data
     Sloppy clustering can be done on-line
     Surrogate can be kept in memory
     Ball k-means pass can be done at any time




©MapR Technologies - Confidential       24
Algorithm Costs

     O(k d log n) per point for Lloyd’s algorithm
                  … not so good for k = 2000, n = 108
     Surrogate methods
       –   fast, sloppy single pass clustering with κ = k log n
       –   fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
           per point
       –   fast, in-memory, high-quality clustering of κ weighted centroids
       –   result consists of k high-quality centroids
     This is a big deal:
       –   k d log n = 2000 x 10 x 26 = 50,000
       –   log k + log log n = 11 + 5 = 17
       –   3000 times faster makes the grade as a bona fide big deal


©MapR Technologies - Confidential                       25
The Internals

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Searcher interface
       –   ProjectionSearch, KmeansSearch, LshSearch, Brute


     Super-fast clustering
       –   Kmeans, StreamingKmeans




©MapR Technologies - Confidential           26
How It Works

     For each point
       –   Find approximately nearest centroid (distance = d)
       –   If d > threshold, new centroid
       –   Else possibly new cluster
       –   Else add to nearest centroid
     If centroids > K ~ C log N
       –   Recursively cluster centroids with higher threshold


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly

©MapR Technologies - Confidential             27
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       28
What About Map-Reduce

     Map-reduce implementation is nearly trivial
       –   Compute surrogate on each split
       –   Total surrogate is union of all partial surrogates
       –   Do in-memory clustering on total surrogate
     Threaded version shows linear speedup already
       –   Map-reduce speedup is likely, not entirely guaranteed




©MapR Technologies - Confidential               29
How Well Does it Work?

     Theoretical guarantees for well clusterable data
       –   Shindler, Wong and Meyerson, NIPS, 2011


     Evaluation on held-out data
       –   Need results here




©MapR Technologies - Confidential          30
Summary

     Nearest neighbor algorithms can be blazing fast


     But you need blazing fast clustering
       –   Which we now have




©MapR Technologies - Confidential     31
Contact Us!

     We’re hiring at MapR in California


     Contact Ted at tdunning@maprtech.com or @ted_dunning


     For slides and other info


http://www.mapr.com/company/events/speaking/la-hug-9-25-12




©MapR Technologies - Confidential     32

Mais conteúdo relacionado

Mais de MapR Technologies

Mais de MapR Technologies (20)

Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

LA HUG - Ted Dunning 2012-09-25

  • 1. Super-fast Online Clustering ©MapR Technologies - Confidential 1
  • 2. whoami – Ted Dunning ©MapR Technologies - Confidential 2
  • 3. Clustering? Why?  Because other people do it – Really!  Because cluster distances make great model features – Better  Because good clusters help with really fast nearest neighbor search – Very nice  Because we can use clusters as a surrogate for all the data – And that lets us train models or do visualization ©MapR Technologies - Confidential 3
  • 4. Agenda  Nearest neighbor models – Colored dots; need good distance metric; projection, LSH and k-means search  K-means algorithms – O(k d log n) per point for Lloyd’s algorithm … not good for k = 2000, n = 108 – Surrogate methods • fast, sloppy single pass clustering with κ = k log n • fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point • fast, in-memory, high-quality clustering of κ weighted centroids • result consists of k high-quality centroids for the original data  Results ©MapR Technologies - Confidential 4
  • 5. Nearest Neighbor Models  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 5
  • 6. K-Nearest Neighbor Example ©MapR Technologies - Confidential 6
  • 7. Comparison to Other Modeling Approaches  Logistic regression – Depends on linear separability – k-nn works very well if logistic regression works – k-nn can work very well even if logistic regression fails due to interactions producing non-linear decision surface  Tree-based methods – mostly roughly equivalent in accuracy with k-nn ©MapR Technologies - Confidential 7
  • 8. Required Scale and Speed and Accuracy  Want 20 million queries against 25 million references in 10,000 s  Should be able to search > 100 million references  Should be linearly and horizontally scalable  Must have >50% overlap against reference search  Evaluation by sub-sampling is viable, but tricky ©MapR Technologies - Confidential 8
  • 9. How Hard is That?  20 M x 25 M x 100 Flop = 50 P Flop  1 CPU = 5 Gflops  We need 10 M CPU seconds => 10,000 CPU’s  Real-world efficiency losses may increase that by 10x  Not good! ©MapR Technologies - Confidential 9
  • 10. How Can We Search Faster?  First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search  Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing  Third rule: reduce dimensionality – Projection search – Random projection for very high dimension ©MapR Technologies - Confidential 10
  • 11. Note the Circularity  Clustering helps nearest neighbor search  But clustering needs nearest neighbor search internally  How droll ! ©MapR Technologies - Confidential 11
  • 12. Projection Search java.lang.TreeSet! ©MapR Technologies - Confidential 12
  • 13. How Many Projections? ©MapR Technologies - Confidential 13
  • 14. LSH Search  Each random projection produces independent sign bit  If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1)  Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2  We can replace (some) vector dot products with long integer XOR ©MapR Technologies - Confidential 14
  • 15. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 15
  • 16. Results ©MapR Technologies - Confidential 16
  • 17. K-means Search  First do clustering with lots (thousands) of clusters  Then search nearest clusters to find nearest points  We win if we find >50% overlap with “true” answer  We lose if we can’t cluster super-fast – more on this later ©MapR Technologies - Confidential 17
  • 18. Lots of Clusters Are Fine ©MapR Technologies - Confidential 18
  • 19. Lots of Clusters Are Fine ©MapR Technologies - Confidential 19
  • 20. Some Details  Clumpy data works better – Real data is clumpy   Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH can be used to accelerate that (some)  More experiments needed  Definitely need fast search ©MapR Technologies - Confidential 20
  • 21. So Now Some Clustering ©MapR Technologies - Confidential 21
  • 22. Lloyd’s Algorithm  Part of CS folk-lore  Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters  Highly variable quality, several restarts recommended ©MapR Technologies - Confidential 22
  • 23. Ball k-means  Provably better for highly clusterable data  Tries to find initial centroids in the “core” of real clusters  Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster ©MapR Technologies - Confidential 23
  • 24. Surrogate Method  Start with sloppy clustering into κ = k log n clusters  Use these clusters as a weighted surrogate for the data  Cluster surrogate data using ball k-means  Results are provably high quality for highly clusterable data  Sloppy clustering can be done on-line  Surrogate can be kept in memory  Ball k-means pass can be done at any time ©MapR Technologies - Confidential 24
  • 25. Algorithm Costs  O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids – result consists of k high-quality centroids  This is a big deal: – k d log n = 2000 x 10 x 26 = 50,000 – log k + log log n = 11 + 5 = 17 – 3000 times faster makes the grade as a bona fide big deal ©MapR Technologies - Confidential 25
  • 26. The Internals  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 26
  • 27. How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 27
  • 28. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 28
  • 29. What About Map-Reduce  Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate  Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed ©MapR Technologies - Confidential 29
  • 30. How Well Does it Work?  Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011  Evaluation on held-out data – Need results here ©MapR Technologies - Confidential 30
  • 31. Summary  Nearest neighbor algorithms can be blazing fast  But you need blazing fast clustering – Which we now have ©MapR Technologies - Confidential 31
  • 32. Contact Us!  We’re hiring at MapR in California  Contact Ted at tdunning@maprtech.com or @ted_dunning  For slides and other info http://www.mapr.com/company/events/speaking/la-hug-9-25-12 ©MapR Technologies - Confidential 32

Notas do Editor

  1. The sub-bullets are just for reference and should be deleted later
  2. The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.
  3. This slide is red to indicate missing data