SlideShare uma empresa Scribd logo
1 de 30
Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning


©MapR Technologies - Confidential   1
Large-scale Single-pass k-Means
Clustering

©MapR Technologies - Confidential   2
Large-scale k-Means Clustering


©MapR Technologies - Confidential   3
Goals

     Cluster very large data sets
     Facilitate large nearest neighbor search
     Allow very large number of clusters
     Achieve good quality
       –   low average distance to nearest centroid on held-out data
     Based on Mahout Math
     Runs on Hadoop (really MapR) cluster
     FAST – cluster tens of millions in minutes




©MapR Technologies - Confidential            4
Non-goals

     Use map-reduce (but it is there)
     Minimize the number of clusters
     Support metrics other than L2




©MapR Technologies - Confidential        5
Anti-goals

     Multiple passes over original data
     Scale as O(k n)




©MapR Technologies - Confidential     6
Why?




©MapR Technologies - Confidential    7
K-nearest Neighbor with
  Super Fast k-means




©MapR Technologies - Confidential   8
What’s that?

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy 
 but hard
       –   easy because it is so conceptually simple and you don’t have knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results


     Initial prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            9
How We Did It

     2 week hackathon with 6 developers from customer bank
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup



©MapR Technologies - Confidential           10
How We Did It

     2 week hackathon with 6 developers from customer bank
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup
       –   well, really only 100-1000x after basic hygiene


©MapR Technologies - Confidential             11
What We Did

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Shared memory matrix
       –   FileBasedMatrix uses mmap to share very large dense matrices


     Searcher interface
       –   ProjectionSearch, KmeansSearch, LshSearch, Brute


     Super-fast clustering
       –   Kmeans, StreamingKmeans

©MapR Technologies - Confidential           12
Projection Search


                                         java.lang.TreeSet!




©MapR Technologies - Confidential   13
How Many Projections?




©MapR Technologies - Confidential   14
K-means Search

     Simple Idea
       –   pre-cluster the data
       –   to find the nearest points, search the nearest clusters


     Recursive application
       –   to search a cluster, use a Searcher!




©MapR Technologies - Confidential                 15
©MapR Technologies - Confidential   16
x




©MapR Technologies - Confidential       17
©MapR Technologies - Confidential   18
©MapR Technologies - Confidential   19
x




©MapR Technologies - Confidential       20
But This Requires k-means!

     Need a new k-means algorithm to get speed
       –   Hadoop is very slow at iterative map-reduce
       –   Maybe Pregel clones like Giraph would be better
       –   Or maybe not


     Streaming k-means is
       –   One pass (through the original data)
       –   Very fast (20 us per data point with threads)
       –   Very parallelizable




©MapR Technologies - Confidential             21
Basic Method

     Use a single pass of k-means with very many clusters
       –   output is a bad-ish clustering but a good surrogate
     Use weighted centroids from step 1 to do in-memory clustering
       –   output is a good clustering with fewer clusters




©MapR Technologies - Confidential             22
Algorithmic Details

Foreach data point xn
           compute distance to nearest centroid, ∂
           sample u, if u > ∂/ß add to nearest centroid
           else create new centroid

           if number of centroids > 10 log n
                       recursively cluster centroids
                       set ß = 1.5 ß if number of centroids did not decrease




©MapR Technologies - Confidential                        23
How It Works


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly




©MapR Technologies - Confidential             24
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (ÎŒs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       25
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!




©MapR Technologies - Confidential       26
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)




©MapR Technologies - Confidential                27
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)


     Empirically, projection search beats 64 bit LSH by a bit



©MapR Technologies - Confidential                28
Moving to Scale

     Map-reduce implementation nearly trivial


     Map: rough-cluster input data, output ß, weighted centroids


     Reduce:
       –   single reducer gets all centroids
       –   if too many centroids, merge using recursive clustering
       –   optionally do final clustering in-memory


     Combiner possible, but essentially never important



©MapR Technologies - Confidential             29
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Slides and such:
       –   http://info.mapr.com/ted-mlconf

       Hash tags: #mlconf #mahout #mapr




©MapR Technologies - Confidential            30

Mais conteĂșdo relacionado

Mais procurados

Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Chainer v3
Chainer v3Chainer v3
Chainer v3Seiya Tokui
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€Preferred Networks
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
vasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarksvasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some BenchmarksJonathan Skelton
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...Xiaoyu Shi
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line InterfaceJunho Cho
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recoverymapr-academy
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021ïŒ‰çŹŹ11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021RCCSRENKEI
 
211121 detection in crowded scenes one proposal, multiple predictions
211121 detection in crowded scenes   one proposal, multiple predictions211121 detection in crowded scenes   one proposal, multiple predictions
211121 detection in crowded scenes one proposal, multiple predictionstaeseon ryu
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸ
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸèšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸ
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸShinnosuke Furuya
 

Mais procurados (20)

Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€
IIBMP2019 èŹ›æŒ”èł‡æ–™ă€Œă‚ȘăƒŒăƒ—ăƒłă‚œăƒŒă‚čă§ć§‹ă‚ă‚‹æ·±ć±€ć­Šçż’ă€
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
vasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarksvasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarks
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line Interface
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recovery
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021ïŒ‰çŹŹ11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021
珏11曞 é…äżĄèŹ›çŸ© èšˆçź—ç§‘ć­ŠæŠ€èĄ“ç‰č論A2021
 
211121 detection in crowded scenes one proposal, multiple predictions
211121 detection in crowded scenes   one proposal, multiple predictions211121 detection in crowded scenes   one proposal, multiple predictions
211121 detection in crowded scenes one proposal, multiple predictions
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸ
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸèšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸ
èšˆçź—ćŠ›ć­Šă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒłă« GPU はćœčç«‹ă€ăźă‹ïŒŸ
 

Semelhante a Graphlab dunning-clustering

London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahoutMapR Technologies
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time LearningMapR Technologies
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportMapR Technologies
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 

Semelhante a Graphlab dunning-clustering (20)

London data science
London data scienceLondon data science
London data science
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 

Mais de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
T digest-update
T digest-updateT digest-update
T digest-updateTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Mais de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Graphlab dunning-clustering

  • 1. Large-scale Single-pass k-Means Clustering at Scale Ted Dunning ©MapR Technologies - Confidential 1
  • 3. Large-scale k-Means Clustering ©MapR Technologies - Confidential 3
  • 4. Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes ©MapR Technologies - Confidential 4
  • 5. Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2 ©MapR Technologies - Confidential 5
  • 6. Anti-goals  Multiple passes over original data  Scale as O(k n) ©MapR Technologies - Confidential 6
  • 8. K-nearest Neighbor with Super Fast k-means ©MapR Technologies - Confidential 8
  • 9. What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy 
 but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 9
  • 10. How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup ©MapR Technologies - Confidential 10
  • 11. How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene ©MapR Technologies - Confidential 11
  • 12. What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 12
  • 13. Projection Search java.lang.TreeSet! ©MapR Technologies - Confidential 13
  • 14. How Many Projections? ©MapR Technologies - Confidential 14
  • 15. K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher! ©MapR Technologies - Confidential 15
  • 16. ©MapR Technologies - Confidential 16
  • 17. x ©MapR Technologies - Confidential 17
  • 18. ©MapR Technologies - Confidential 18
  • 19. ©MapR Technologies - Confidential 19
  • 20. x ©MapR Technologies - Confidential 20
  • 21. But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable ©MapR Technologies - Confidential 21
  • 22. Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters ©MapR Technologies - Confidential 22
  • 23. Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease ©MapR Technologies - Confidential 23
  • 24. How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 24
  • 25. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (ÎŒs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 25
  • 26. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 26
  • 27. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 27
  • 28. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit ©MapR Technologies - Confidential 28
  • 29. Moving to Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but essentially never important ©MapR Technologies - Confidential 29
  • 30.  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr ©MapR Technologies - Confidential 30