Paris Data Geeks

Practical Machine Learning
with Mahout

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering

What Works at Scale
• Logging
• Counting
• Session grouping

What Works at Scale
• Logging
• Counting
• Really. Don’t bet on anything much more
complex than these

What Works at Scale
• Logging
• Counting
• Really. Don’t bet on anything much more
complex than these
• These are harder than they look

Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”
• But soooo much more is possible

Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s

Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations

Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y

• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y

Rij = AuiBuju
å
= AT
B

Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
K = AT
A
A » USVT
K » VS2
VT
r = VS2
VT
h
r =sparsify(AT
A)h

But Wait!
• Cooccurrence
• Cross occurrence
K = AT
A
K = BT
A

For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”

The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-
data)

Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff

Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be

What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue

Diagonalized Cluster Proximity

Clusters as Distribution Surrogate

For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)

Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large

Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k,
looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice

Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be
big
• Happily, approximate nearest centroid works
fine

Projection Search
total ordering!

LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Quality
• Ball k-means implementation appears significantly
better than simple k-means
• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance
to nearest cluster

Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout

Paris Data Geeks

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (16)

Destaque

Destaque (7)

Semelhante a Paris Data Geeks

Semelhante a Paris Data Geeks (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Paris Data Geeks