SlideShare a Scribd company logo
1 of 18
Super-Fast Clustering
                Report from MapR workshop
©MapR Technologies - Confidential   1
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Twitter for this talk
       –   #mapr_uk


     Slides and such:
       –   http://info.mapr.com/ted-uk-05-2012



©MapR Technologies - Confidential          2
Company Background
      MapR provides the industry’s best Hadoop Distribution
       –    Combines the best of the Hadoop community
            contributions with significant internally
            financed infrastructure development
      Background of Team
       – Deep management bench with extensive analytic,
         storage, virtualization, and open source experience
       – Google, EMC, Cisco, VMWare, Network Appliance, IBM,
         Microsoft, Apache Foundation, Aster Data, Brio, ParAccel
      Proven
       –    MapR used across industries (Financial Services, Media,
            Telcom, Health Care, Internet Services, Government)
       –    Strategic OEM relationship with EMC and Cisco
       –    Over 1,000 installs

    ©MapR Technologies - Confidential           3
We Also Do …

     Open source development
       –   Zookeeper
       –   Hadoop
       –   Mahout
       –   Stuff
     Partner workshops
       –   Machine learning
       –   Information architecture
       –   Cluster design




©MapR Technologies - Confidential     4
We Also Do …

     Open source development
       –   Zookeeper
       –   Hadoop
       –   Mahout
       –   Stuff
     Partner workshops
       –   Machine learning
       –   Information architecture
       –   Cluster design




©MapR Technologies - Confidential     5
The Problem

     A certain bank
       –   had lots of customers
       –   had lots of prospective customers
       –   had a non-trivial number of fraudulent customers
       –   had a non-trivial number of fraudulent merchants


     They also
       –   collected data
       –   built models
       –   collected more data
       –   built more models


©MapR Technologies - Confidential           6
But …

     These models were arduous to build


     And hard to test


     So people suggested something simpler


     Like k-nearest neighbor




©MapR Technologies - Confidential   7
What’s that?

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy … but hard
       –   easy because it is so conceptually simple and you don’t have knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results


     Initial prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            8
What We Did

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Searcher interface
       –   ProjectionSearch, KmeansSearch, LshSearch, Brute


     Super-fast clustering
       –   Kmeans, StreamingKmeans




©MapR Technologies - Confidential           9
Projection Search




©MapR Technologies - Confidential   10
K-means Search




©MapR Technologies - Confidential   11
But These Require k-means!

     Need a new k-means algorithm to get speed


     Streaming k-means is
       –   One pass (through the original data)
       –   Very fast (20 us per data point with threads)
       –   Very parallelizable




©MapR Technologies - Confidential             12
How It Works

     For each point
       –   Find approximately nearest centroid (distance = d)
       –   If d > threshold, new centroid
       –   Else possibly new cluster
       –   Else add to nearest centroid
     If centroids > K ~ C log N
       –   Recursively cluster centroids with higher threshold


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly

©MapR Technologies - Confidential             13
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       14
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!




©MapR Technologies - Confidential       15
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)




©MapR Technologies - Confidential                16
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Slides and such:
       –   http://info.mapr.com/ted-uk-05-2012




©MapR Technologies - Confidential          17
Thank You




©MapR Technologies - Confidential   18

More Related Content

Viewers also liked

LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJELA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
superveromena
 
PetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPosterPetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPoster
Sierra Peterson
 

Viewers also liked (20)

M2 actividad2 10
M2 actividad2 10M2 actividad2 10
M2 actividad2 10
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
ortodoncia
ortodonciaortodoncia
ortodoncia
 
Fico success story
Fico success storyFico success story
Fico success story
 
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldn
 
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJELA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds..."Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
 
Utilización y selección
Utilización y selecciónUtilización y selección
Utilización y selección
 
Gradle_ToursJUG
Gradle_ToursJUGGradle_ToursJUG
Gradle_ToursJUG
 
Mais cultura
Mais culturaMais cultura
Mais cultura
 
Sarwat Jahan_cv
Sarwat Jahan_cvSarwat Jahan_cv
Sarwat Jahan_cv
 
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
PetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPosterPetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPoster
 
Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )
 
aparatologia ortodontica
aparatologia ortodontica aparatologia ortodontica
aparatologia ortodontica
 
JENKINS_BreizhJUG_20111003
JENKINS_BreizhJUG_20111003JENKINS_BreizhJUG_20111003
JENKINS_BreizhJUG_20111003
 

Similar to Super-Fast Clustering Report in MapR

Similar to Super-Fast Clustering Report in MapR (20)

London data science
London data scienceLondon data science
London data science
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

More from Data Science London

Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
Data Science London
 

More from Data Science London (20)

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer Behaviour
 
Bootstrapping Data Science
Bootstrapping Data ScienceBootstrapping Data Science
Bootstrapping Data Science
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
 
A Data Scientist in the Music Industry
A Data Scientist in the Music IndustryA Data Scientist in the Music Industry
A Data Scientist in the Music Industry
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Super-Fast Clustering Report in MapR

  • 1. Super-Fast Clustering Report from MapR workshop ©MapR Technologies - Confidential 1
  • 2. Contact: – tdunning@maprtech.com – @ted_dunning  Twitter for this talk – #mapr_uk  Slides and such: – http://info.mapr.com/ted-uk-05-2012 ©MapR Technologies - Confidential 2
  • 3. Company Background  MapR provides the industry’s best Hadoop Distribution – Combines the best of the Hadoop community contributions with significant internally financed infrastructure development  Background of Team – Deep management bench with extensive analytic, storage, virtualization, and open source experience – Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microsoft, Apache Foundation, Aster Data, Brio, ParAccel  Proven – MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco – Over 1,000 installs ©MapR Technologies - Confidential 3
  • 4. We Also Do …  Open source development – Zookeeper – Hadoop – Mahout – Stuff  Partner workshops – Machine learning – Information architecture – Cluster design ©MapR Technologies - Confidential 4
  • 5. We Also Do …  Open source development – Zookeeper – Hadoop – Mahout – Stuff  Partner workshops – Machine learning – Information architecture – Cluster design ©MapR Technologies - Confidential 5
  • 6. The Problem  A certain bank – had lots of customers – had lots of prospective customers – had a non-trivial number of fraudulent customers – had a non-trivial number of fraudulent merchants  They also – collected data – built models – collected more data – built more models ©MapR Technologies - Confidential 6
  • 7. But …  These models were arduous to build  And hard to test  So people suggested something simpler  Like k-nearest neighbor ©MapR Technologies - Confidential 7
  • 8. What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 8
  • 9. What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 9
  • 12. But These Require k-means!  Need a new k-means algorithm to get speed  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable ©MapR Technologies - Confidential 12
  • 13. How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 13
  • 14. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 14
  • 15. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 15
  • 16. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 16
  • 17. Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-uk-05-2012 ©MapR Technologies - Confidential 17
  • 18. Thank You ©MapR Technologies - Confidential 18

Editor's Notes

  1. MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering led emerging technologies and a 600 person for EMC’s NAS engineering team. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.