This presentation is part of my work for the course 'Big Data Analytics Projects' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
How to Troubleshoot Apps for the Modern Connected Worker
Distributed streaming k means
1.
2. Clustering
Group a set of objects
Objects in the same group should be similar
For each group we have an object called centre
Minimise the distance to the central point
Unsupervised learning:
Un-labelled data
No training data
3. Lloyd’s K-means algo.
Centres ← Randomly pick k points
Iterate:
Assign each point to the closest centre
Calculate the new centre points: centroids of each cluster
Problems:
It iterates over the whole list of points -> Not suitable for
vast amounts of data.
Bad initialization.
4. K-means++
Centers ← Randomly pick ONE point from X
Until we have enough centres:
Choose from X the next centre with probability
𝐷(𝑝,𝑐)2
𝑖∈𝑋
𝐷(𝑥)2
The probability increases when the distance to the
closest centre is high.
5. K-means#
Centers ← Randomly pick 3 log k points from X
Until we have enough centres:
Choose from X the next 3 log k centres with
𝐷(𝑝,𝑐)2
probability
2
𝑖∈𝑋
𝐷(𝑥)
It improves the coverage of the clusters of the
optimal solution.
7. Fast streaming k-means
One pass over
the points
selecting those
that are far away
from the already
selected
When there is no
space enough,
we remove those
centres that are
less interesting
Finally, we run
Lloyd’s algorithm
on the centres
using the
weights
8.
9. Basic Method
Single-pass k-means (explained before)
Output: Not-so-good clustering but a good candidate
Use weighted centers/ facilities from Step-1
Output: Good clustering with fewer clusters
Finding Nearest Neighbor: Most time consuming step
NN based on random Projection- Simple
Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements [1]
Empirically, Projection search is a bit better than 64 bit LSH[4]
10. Scaling
Map:
Roughly cluster input data using Streaming k-means
Output: Weighted Centers (Cluster’s Center and the
number of points it contains)
Reduce:
All centers passed to a single reducer
Apply batch k-means or again one-pass (if there are
too many centers)
Can use Combiner but not necessary
12. References
Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements by Kerui
Min et al.
Fast and Accurate k-means for large datasets by Shindler
et al.
Streaming k-Means Approximation by Jaiswal et al.
Large Scale Single pass k-Means Clustering at Scale by
Ted Dunning
Apache Mahout