Mais conteúdo relacionado Mais de MapR Technologies (20) LA HUG - Ted Dunning 2012-09-253. Clustering? Why?
Because other people do it
– Really!
Because cluster distances make great model features
– Better
Because good clusters help with really fast nearest neighbor search
– Very nice
Because we can use clusters as a surrogate for all the data
– And that lets us train models or do visualization
©MapR Technologies - Confidential 3
4. Agenda
Nearest neighbor models
– Colored dots; need good distance metric; projection, LSH and k-means
search
K-means algorithms
– O(k d log n) per point for Lloyd’s algorithm
… not good for k = 2000, n = 108
– Surrogate methods
• fast, sloppy single pass clustering with κ = k log n
• fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point
• fast, in-memory, high-quality clustering of κ weighted centroids
• result consists of k high-quality centroids for the original data
Results
©MapR Technologies - Confidential 4
5. Nearest Neighbor Models
Find the k nearest training examples
Use the average value of the target variable from them
This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
©MapR Technologies - Confidential 5
7. Comparison to Other Modeling Approaches
Logistic regression
– Depends on linear separability
– k-nn works very well if logistic regression works
– k-nn can work very well even if logistic regression fails due to interactions
producing non-linear decision surface
Tree-based methods
– mostly roughly equivalent in accuracy with k-nn
©MapR Technologies - Confidential 7
8. Required Scale and Speed and Accuracy
Want 20 million queries against 25 million references in 10,000 s
Should be able to search > 100 million references
Should be linearly and horizontally scalable
Must have >50% overlap against reference search
Evaluation by sub-sampling is viable, but tricky
©MapR Technologies - Confidential 8
9. How Hard is That?
20 M x 25 M x 100 Flop = 50 P Flop
1 CPU = 5 Gflops
We need 10 M CPU seconds => 10,000 CPU’s
Real-world efficiency losses may increase that by 10x
Not good!
©MapR Technologies - Confidential 9
10. How Can We Search Faster?
First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search
Second rule: don’t do it
– We can convert big floating point math to clever bit-wise integer math
– Locality sensitive hashing
Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension
©MapR Technologies - Confidential 10
11. Note the Circularity
Clustering helps nearest neighbor search
But clustering needs nearest neighbor search internally
How droll !
©MapR Technologies - Confidential 11
14. LSH Search
Each random projection produces independent sign bit
If two vectors have the same projected sign bits, they probably
point in the same direction (i.e. cos θ ≈ 1)
Distance in L2 is closely related to cosine
x - y 2 = x - 2(x × y) + y
2 2
= x 2 - 2 x y cosq + y 2
We can replace (some) vector dot products with long integer XOR
©MapR Technologies - Confidential 14
15. LSH Bit-match Versus Cosine
1
0.8
0.6
0.4
0.2
Y Ax is
0
0 8 16 24 32 40 48 56 64
- 0.2
- 0.4
- 0.6
- 0.8
-1
X Ax is
©MapR Technologies - Confidential 15
17. K-means Search
First do clustering with lots (thousands) of clusters
Then search nearest clusters to find nearest points
We win if we find >50% overlap with “true” answer
We lose if we can’t cluster super-fast
– more on this later
©MapR Technologies - Confidential 17
20. Some Details
Clumpy data works better
– Real data is clumpy
Speedups of 100-200x seem practical with 50% overlap
– Projection search and LSH can be used to accelerate that (some)
More experiments needed
Definitely need fast search
©MapR Technologies - Confidential 20
21. So Now Some Clustering
©MapR Technologies - Confidential 21
22. Lloyd’s Algorithm
Part of CS folk-lore
Developed in the late 50’s for signal quantization, published in 80’s
initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters
Highly variable quality, several restarts recommended
©MapR Technologies - Confidential 22
23. Ball k-means
Provably better for highly clusterable data
Tries to find initial centroids in the “core” of real clusters
Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster
©MapR Technologies - Confidential 23
24. Surrogate Method
Start with sloppy clustering into κ = k log n clusters
Use these clusters as a weighted surrogate for the data
Cluster surrogate data using ball k-means
Results are provably high quality for highly clusterable data
Sloppy clustering can be done on-line
Surrogate can be kept in memory
Ball k-means pass can be done at any time
©MapR Technologies - Confidential 24
25. Algorithm Costs
O(k d log n) per point for Lloyd’s algorithm
… not so good for k = 2000, n = 108
Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
per point
– fast, in-memory, high-quality clustering of κ weighted centroids
– result consists of k high-quality centroids
This is a big deal:
– k d log n = 2000 x 10 x 26 = 50,000
– log k + log log n = 11 + 5 = 17
– 3000 times faster makes the grade as a bona fide big deal
©MapR Technologies - Confidential 25
26. The Internals
Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering
– Kmeans, StreamingKmeans
©MapR Technologies - Confidential 26
27. How It Works
For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold
Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
©MapR Technologies - Confidential 27
28. Parallel Speedup?
200
Non- threaded
✓
100
2
Tim e per point (μs)
Threaded version
3
50
4
40 6
5
8
30
10 14
12
20 Perfect Scaling 16
10
1 2 3 4 5 20
Threads
©MapR Technologies - Confidential 28
29. What About Map-Reduce
Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
Threaded version shows linear speedup already
– Map-reduce speedup is likely, not entirely guaranteed
©MapR Technologies - Confidential 29
30. How Well Does it Work?
Theoretical guarantees for well clusterable data
– Shindler, Wong and Meyerson, NIPS, 2011
Evaluation on held-out data
– Need results here
©MapR Technologies - Confidential 30
31. Summary
Nearest neighbor algorithms can be blazing fast
But you need blazing fast clustering
– Which we now have
©MapR Technologies - Confidential 31
32. Contact Us!
We’re hiring at MapR in California
Contact Ted at tdunning@maprtech.com or @ted_dunning
For slides and other info
http://www.mapr.com/company/events/speaking/la-hug-9-25-12
©MapR Technologies - Confidential 32
Notas do Editor The sub-bullets are just for reference and should be deleted later The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple. This slide is red to indicate missing data