LA HUG - Ted Dunning 2012-09-25

Super-fast Online Clustering

©MapR Technologies - Confidential 1

whoami – Ted Dunning


Clustering? Why?

 Because other people do it
– Really!

 Because cluster distances make great model features
– Better

 Because good clusters help with really fast nearest neighbor search
– Very nice

 Because we can use clusters as a surrogate for all the data
– And that lets us train models or do visualization


Agenda

 Nearest neighbor models
– Colored dots; need good distance metric; projection, LSH and k-means
search
 K-means algorithms
– O(k d log n) per point for Lloyd’s algorithm
… not good for k = 2000, n = 108
– Surrogate methods
• fast, sloppy single pass clustering with κ = k log n
• fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point
• fast, in-memory, high-quality clustering of κ weighted centroids
• result consists of k high-quality centroids for the original data
 Results


Nearest Neighbor Models

 Find the k nearest training examples
 Use the average value of the target variable from them

 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results

 Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time


K-Nearest Neighbor Example


Comparison to Other Modeling Approaches

 Logistic regression
– Depends on linear separability
– k-nn works very well if logistic regression works
– k-nn can work very well even if logistic regression fails due to interactions
producing non-linear decision surface
 Tree-based methods
– mostly roughly equivalent in accuracy with k-nn


Required Scale and Speed and Accuracy

 Want 20 million queries against 25 million references in 10,000 s
 Should be able to search > 100 million references
 Should be linearly and horizontally scalable
 Must have >50% overlap against reference search
 Evaluation by sub-sampling is viable, but tricky


How Hard is That?

 20 M x 25 M x 100 Flop = 50 P Flop

 1 CPU = 5 Gflops

 We need 10 M CPU seconds => 10,000 CPU’s

 Real-world efficiency losses may increase that by 10x

 Not good!


How Can We Search Faster?

 First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search

 Second rule: don’t do it
– We can convert big floating point math to clever bit-wise integer math
– Locality sensitive hashing

 Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension


Note the Circularity

 Clustering helps nearest neighbor search

 But clustering needs nearest neighbor search internally

 How droll !


Projection Search

java.lang.TreeSet!


How Many Projections?


LSH Search

 Each random projection produces independent sign bit
 If two vectors have the same projected sign bits, they probably
point in the same direction (i.e. cos θ ≈ 1)
 Distance in L2 is closely related to cosine

x - y 2 = x - 2(x × y) + y
2 2

= x 2 - 2 x y cosq + y 2

 We can replace (some) vector dot products with long integer XOR


LSH Bit-match Versus Cosine
1

0.8

0.6

0.4

0.2
Y Ax is

0
0 8 16 24 32 40 48 56 64

- 0.2

- 0.4

- 0.6

- 0.8

-1

X Ax is


Results


K-means Search

 First do clustering with lots (thousands) of clusters

 Then search nearest clusters to find nearest points

 We win if we find >50% overlap with “true” answer

 We lose if we can’t cluster super-fast
– more on this later


Lots of Clusters Are Fine


Some Details

 Clumpy data works better
– Real data is clumpy 

 Speedups of 100-200x seem practical with 50% overlap
– Projection search and LSH can be used to accelerate that (some)

 More experiments needed

 Definitely need fast search


So Now Some Clustering


Lloyd’s Algorithm

 Part of CS folk-lore
 Developed in the late 50’s for signal quantization, published in 80’s

initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters

 Highly variable quality, several restarts recommended


Ball k-means

 Provably better for highly clusterable data
 Tries to find initial centroids in the “core” of real clusters
 Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster


Surrogate Method

 Start with sloppy clustering into κ = k log n clusters
 Use these clusters as a weighted surrogate for the data
 Cluster surrogate data using ball k-means

 Results are provably high quality for highly clusterable data
 Sloppy clustering can be done on-line
 Surrogate can be kept in memory
 Ball k-means pass can be done at any time


Algorithm Costs

 O(k d log n) per point for Lloyd’s algorithm
… not so good for k = 2000, n = 108
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
per point
– fast, in-memory, high-quality clustering of κ weighted centroids
– result consists of k high-quality centroids
 This is a big deal:
– k d log n = 2000 x 10 x 26 = 50,000
– log k + log log n = 11 + 5 = 17
– 3000 times faster makes the grade as a bona fide big deal


The Internals

 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid

 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute

 Super-fast clustering
– Kmeans, StreamingKmeans


How It Works

 For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
 If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold

 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly


Parallel Speedup?

200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

What About Map-Reduce

 Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
 Threaded version shows linear speedup already
– Map-reduce speedup is likely, not entirely guaranteed


How Well Does it Work?

 Theoretical guarantees for well clusterable data
– Shindler, Wong and Meyerson, NIPS, 2011

 Evaluation on held-out data
– Need results here


Summary

 Nearest neighbor algorithms can be blazing fast

 But you need blazing fast clustering
– Which we now have


Contact Us!

 We’re hiring at MapR in California

 Contact Ted at tdunning@maprtech.com or @ted_dunning

 For slides and other info

http://www.mapr.com/company/events/speaking/la-hug-9-25-12


LA HUG - Ted Dunning 2012-09-25

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

LA HUG - Ted Dunning 2012-09-25

Notas do Editor