SlideShare uma empresa Scribd logo
1 de 52
Fast Single-pass k-means
Clustering
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software
Foundation
– particularly Mahout, Zookeeper and Drill
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
Agenda
• Rationale
• Theory
– clusterable data, k-mean failure modes, sketches
• Algorithms
– ball k-means, surrogate methods
• Implementation
– searchers, vectors, clusterers
• Results
• Application
RATIONALE
Why k-means?
• Clustering allows fast search
– k-nn models allow agile modeling
– lots of data points, 108 typical
– lots of clusters, 104 typical
• Model features
– Distance to nearest centroids
– Poor man’s manifold discovery
What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization to unseen data critical
– number of points per cluster
– distance distributions
– target function distributions
– model performance stability
An Example
The Problem
• Spirals are a classic “counter” example for k-
means
• Classic low dimensional manifold with added
noise
• But clustering still makes modeling work well
An Example
An Example
The Cluster Proximity Features
• Every point can be described by the nearest
cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point)
by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
x 4.3 bits + 1 sign bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation
Diagonalized Cluster Proximity
Lots of Clusters Are Fine
The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
nearby clusters
• In the limit we get k-nn modeling
– and probably use k-means to speed up search
THEORY
Intuitive Theory
• Traditionally, minimize over all distributions
– optimization is NP-complete
– that isn’t like real data
• Recently, assume well-clusterable data
• Interesting approximation bounds provable
s 2
Dk-1
2
(X) > Dk
2
(X)
1+O(s 2
)
For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)
ALGORITHMS
Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
in 80’s
initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters
• Highly variable quality, several restarts recommended
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
Surrogate Method
• Start with sloppy clustering into κ = k log n
clusters
• Use this sketch as a weighted surrogate for the
data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time
Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption
Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d
(log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy clusters may suffice
Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
Pragmatics
• But this requires a fast search internally
• Have to cluster on the fly for sketch
• Have to guarantee sketch quality
• Previous methods had very high complexity
How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
• Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of
clustering original
– or we can just use the result directly
IMPLEMENTATION
How Can We Search Faster?
• First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search
• Second rule: don’t do it
– We can convert big floating point math to clever bit-wise
integer math
– Locality sensitive hashing
• Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension
Projection Search
total ordering!
How Many Projections?
LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine
• We can replace (some) vector dot products with long
integer XOR
x - y 2
= x2
- 2(x× y)+ y2
= x2
- 2 x y cosq + y2
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
Results with 32 Bits
The Internals
• Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
• Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
• Super-fast clustering
– Kmeans, StreamingKmeans
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
What About Map-Reduce?
• Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
already
– Map-reduce speedup is likely, not entirely
guaranteed
How Well Does it Work?
• Theoretical guarantees for well clusterable
data
– Shindler, Wong and Meyerson, NIPS, 2011
• Evaluation on synthetic data
– Rough clustering produces correct surrogates
– Ball k-means strategy 1 performance is very good
with large k
APPLICATION
The Business Case
• Our customer has 100 million cards in
circulation
• Quick and accurate decision-making is key.
– Marketing offers
– Fraud prevention
Opportunity
• Demand of modeling is increasing rapidly
• So they are testing something simpler and
more agile
• Like k-nearest neighbor
What’s that?
• Find the k nearest training examples – lookalike
customers
• This is easy … but hard
– easy because it is so conceptually simple and you don’t
have knobs to turn or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
• Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
K-Nearest Neighbor Example
Required Scale and Speed and
Accuracy
• Want 20 million queries against 25 million
references in 10,000 s
• Should be able to search > 100 million
references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
search
How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop
• 1 CPU = 5 Gflops
• We need 10 M CPU seconds => 10,000 CPU’s
• Real-world efficiency losses may increase that by
10x
• Not good!
K-means Search
• First do clustering with lots (thousands) of clusters
• Then search nearest clusters to find nearest points
• We win if we find >50% overlap with “true” answer
• We lose if we can’t cluster super-fast
– more on this later
Lots of Clusters Are Fine
Lots of Clusters Are Fine
Some Details
• Clumpy data works better
– Real data is clumpy 
• Speedups of 100-200x seem practical with
50% overlap
– Projection search and LSH give additional 100x
• More experiments needed
Summary
• Nearest neighbor algorithms can be blazing
fast
• But you need blazing fast clustering
– Which we now have
Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Come get the slides at
http://www.slideshare.net/tdunning/oxford-05oct2012
• Get the code at
https://github.com/tdunning/knn
• Contact me at tdunning@maprtech.com or @ted_dunning

Mais conteúdo relacionado

Mais procurados

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
butest
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 

Mais procurados (20)

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K means
K meansK means
K means
 
Cluster analysis using k-means method in R
Cluster analysis using k-means method in RCluster analysis using k-means method in R
Cluster analysis using k-means method in R
 
K means clustring @jax
K means clustring @jaxK means clustring @jax
K means clustring @jax
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Kmeans plusplus
Kmeans plusplusKmeans plusplus
Kmeans plusplus
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithm
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
K means
K meansK means
K means
 

Semelhante a Fast Single-pass K-means Clusterting at Oxford

Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Lucidworks
 

Semelhante a Fast Single-pass K-means Clusterting at Oxford (20)

Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
 
Modern Cryptography
Modern CryptographyModern Cryptography
Modern Cryptography
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Class3
Class3Class3
Class3
 
Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in Industry
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
6 clustering
6 clustering6 clustering
6 clustering
 
KNN
KNNKNN
KNN
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 

Mais de MapR Technologies

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
ZurliaSoop
 
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
gynedubai
 
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men 🔝Bulandshahr🔝 ...
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men  🔝Bulandshahr🔝  ...➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men  🔝Bulandshahr🔝  ...
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men 🔝Bulandshahr🔝 ...
amitlee9823
 
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
amitlee9823
 
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
yynod
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdfreStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
Ken Fuller
 
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
gajnagarg
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men 🔝Nandyal🔝 Escorts...
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men  🔝Nandyal🔝   Escorts...➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men  🔝Nandyal🔝   Escorts...
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men 🔝Nandyal🔝 Escorts...
amitlee9823
 
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
only4webmaster01
 
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
gajnagarg
 

Último (20)

Personal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando NegronPersonal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando Negron
 
Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls fazilka Escorts ☎️9352988975 Two shot with one girl ...
 
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
 
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
 
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men 🔝Bulandshahr🔝 ...
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men  🔝Bulandshahr🔝  ...➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men  🔝Bulandshahr🔝  ...
➥🔝 7737669865 🔝▻ Bulandshahr Call-girls in Women Seeking Men 🔝Bulandshahr🔝 ...
 
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
 
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
 
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
 
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
 
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdfreStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
 
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
 
Personal Brand Exploration ppt.- Ronnie Jones
Personal Brand  Exploration ppt.- Ronnie JonesPersonal Brand  Exploration ppt.- Ronnie Jones
Personal Brand Exploration ppt.- Ronnie Jones
 
WhatsApp 📞 8448380779 ✅Call Girls In Salarpur Sector 81 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Salarpur Sector 81 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Salarpur Sector 81 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Salarpur Sector 81 ( Noida)
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men 🔝Nandyal🔝 Escorts...
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men  🔝Nandyal🔝   Escorts...➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men  🔝Nandyal🔝   Escorts...
➥🔝 7737669865 🔝▻ Nandyal Call-girls in Women Seeking Men 🔝Nandyal🔝 Escorts...
 
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hoodi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
Call Girls Jayanagar Just Call 👗 9155563397 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
Just Call Vip call girls Jammu Escorts ☎️9352988975 Two shot with one girl (J...
 

Fast Single-pass K-means Clusterting at Oxford

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • Rationale • Theory – clusterable data, k-mean failure modes, sketches • Algorithms – ball k-means, surrogate methods • Implementation – searchers, vectors, clusterers • Results • Application
  • 5. Why k-means? • Clustering allows fast search – k-nn models allow agile modeling – lots of data points, 108 typical – lots of clusters, 104 typical • Model features – Distance to nearest centroids – Poor man’s manifold discovery
  • 6. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization to unseen data critical – number of points per cluster – distance distributions – target function distributions – model performance stability
  • 8. The Problem • Spirals are a classic “counter” example for k- means • Classic low dimensional manifold with added noise • But clustering still makes modeling work well
  • 11. The Cluster Proximity Features • Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation
  • 13. Lots of Clusters Are Fine
  • 14. The Limiting Case • Too many clusters lead to over-fitting • Which we mediate by averaging over several nearby clusters • In the limit we get k-nn modeling – and probably use k-means to speed up search
  • 16. Intuitive Theory • Traditionally, minimize over all distributions – optimization is NP-complete – that isn’t like real data • Recently, assume well-clusterable data • Interesting approximation bounds provable s 2 Dk-1 2 (X) > Dk 2 (X) 1+O(s 2 )
  • 17. For Example Grouping these two clusters seriously hurts squared distance D4 2 (X) > 1 s 2 D5 2 (X)
  • 19. Lloyd’s Algorithm • Part of CS folk-lore • Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters • Highly variable quality, several restarts recommended
  • 20. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 21. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 22. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 23. Surrogate Method • Start with sloppy clustering into κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Cluster surrogate data using ball k-means • Results are provably good for highly clusterable data • Sloppy clustering is on-line • Surrogate can be kept in memory • Ball k-means pass can be done at any time
  • 24. Algorithm Costs • O(k d log n) per point per iteration for Lloyd’s algorithm • Number of iterations not well known • Iteration > log n reasonable assumption
  • 25. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice
  • 26. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 27. Pragmatics • But this requires a fast search internally • Have to cluster on the fly for sketch • Have to guarantee sketch quality • Previous methods had very high complexity
  • 28. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold • Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 30. How Can We Search Faster? • First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search • Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing • Third rule: reduce dimensionality – Projection search – Random projection for very high dimension
  • 33. LSH Search • Each random projection produces independent sign bit • If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1) • Distance in L2 is closely related to cosine • We can replace (some) vector dot products with long integer XOR x - y 2 = x2 - 2(x× y)+ y2 = x2 - 2 x y cosq + y2
  • 34. LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 36. The Internals • Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid • Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute • Super-fast clustering – Kmeans, StreamingKmeans
  • 37. Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 38. What About Map-Reduce? • Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate • Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed
  • 39. How Well Does it Work? • Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011 • Evaluation on synthetic data – Rough clustering produces correct surrogates – Ball k-means strategy 1 performance is very good with large k
  • 41. The Business Case • Our customer has 100 million cards in circulation • Quick and accurate decision-making is key. – Marketing offers – Fraud prevention
  • 42. Opportunity • Demand of modeling is increasing rapidly • So they are testing something simpler and more agile • Like k-nearest neighbor
  • 43. What’s that? • Find the k nearest training examples – lookalike customers • This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results • Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 45. Required Scale and Speed and Accuracy • Want 20 million queries against 25 million references in 10,000 s • Should be able to search > 100 million references • Should be linearly and horizontally scalable • Must have >50% overlap against reference search
  • 46. How Hard is That? • 20 M x 25 M x 100 Flop = 50 P Flop • 1 CPU = 5 Gflops • We need 10 M CPU seconds => 10,000 CPU’s • Real-world efficiency losses may increase that by 10x • Not good!
  • 47. K-means Search • First do clustering with lots (thousands) of clusters • Then search nearest clusters to find nearest points • We win if we find >50% overlap with “true” answer • We lose if we can’t cluster super-fast – more on this later
  • 48. Lots of Clusters Are Fine
  • 49. Lots of Clusters Are Fine
  • 50. Some Details • Clumpy data works better – Real data is clumpy  • Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH give additional 100x • More experiments needed
  • 51. Summary • Nearest neighbor algorithms can be blazing fast • But you need blazing fast clustering – Which we now have
  • 52. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Come get the slides at http://www.slideshare.net/tdunning/oxford-05oct2012 • Get the code at https://github.com/tdunning/knn • Contact me at tdunning@maprtech.com or @ted_dunning

Notas do Editor

  1. The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap
  2. The sub-bullets are just for reference and should be deleted later
  3. This slide is red to indicate missing data
  4. The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.