VSSML18. Clustering and Latent Dirichlet Allocation

Valencian Summer School in Machine Learning
4rd edition
September 13-14, 2018

BigML, Inc 2
Clustering and Topic Models
“One of these things is not like the other things . . . “
Charles Parker
VP Algorithms, BigML, Inc

BigML, Inc 4
What is Clustering?
• An unsupervised learning technique
• No labels necessary
• Useful for finding similar instances
• Smart sampling/labelling
• Finds “self-similar" groups of instances
• Customer: groups with similar behavior
• Medical: patients with similar diagnostic measurements
• Defines each group by a “centroid”
• Geometric center of the group
• Represents the “average” member
• Number of centroids (k) can be specified or determined

BigML, Inc 5
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51

BigML, Inc 6
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
auth = pin
amount ~ $100
Same:
date: Mon != Wed
customer: Sally != Bob
account: 6788 != 3421
class: clothes != gas
zip: 26339 != 46140
Different:
date = Wed (2 out of 3)
customer = Bob
account = 3421
auth = pin
class = gas
zip = 46140
amount = $104
Centroid:
similar

BigML, Inc 7
Use Cases
• Customer segmentation
• Which customers are similar?
• Active learning
• Labelling unlabelled data efﬁciently
• Item discovery
• What other items are similar to this one?

BigML, Inc 8
Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-
game purchases
• Assumption: Usage correlates to LTV
0%
3%
1%

BigML, Inc 9
Active Learning
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for
diabetes and label the dataset to
build a model but the test is
expensive*.

BigML, Inc 10
Active Learning
*For a more realistic example of high cost, imagine a dataset with a
billion transactions, each one needing to be labelled as fraud/not-
fraud. Or a million images which need to be labeled as cat/not-cat.
2323

BigML, Inc 11
Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible ﬂavor
characteristics.
Smoky
Fruity

BigML, Inc 12
Clusters Demo #1

BigML, Inc 13
Human Expert
Cluster into 3 groups…

BigML, Inc 15
Human Expert
• Jesa used prior knowledge to select possible features that
separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen features
• Separation quality was then tested to ensure:
• Met criteria of K=3
• Groups were sufﬁciently “distant”
• No crossover

BigML, Inc 16
Human Expert
• Aspect Ratio (Length / Width)
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences

BigML, Inc 17
Clustering Features
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2,75 6
box 1 6
block 1,6 6
screw 8 3
battery 5 3
key 4,25 3
bead 1 2

BigML, Inc 18
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can ﬁnd clusters using distances

in n-dimensional feature space
K=3

BigML, Inc 19
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points

BigML, Inc 20
K-Means Algorithm
K=3

BigML, Inc 21
K-Means Algorithm
K=3
Repeat until centroids stop moving

BigML, Inc 22
There is no Right Answer!
Metal Other
Wood

BigML, Inc 23
Starting Points
• Random points or instances in n-dimensional space
• Might start "too close"
• Risk of sub-optimal convergence
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the ﬁrst point is chosen randomly from instances
• each subsequent point is chosen from the remaining
instances with a probability proportional to the squared
distance from the point's closest existing cluster center

BigML, Inc 24
K++ Initial Centers
Low 
Probability
High 
ProbabilityHighest 
Probability
K=3

BigML, Inc 25
K++ Initial Centers
Low 
Probability
Low 
Probability
K=3

BigML, Inc 26
K++ Initial Centers
K=3

BigML, Inc 27
Scaling Matters
price
number of bedrooms
d = 160,000
d = 1

BigML, Inc 28
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• How far is “red” from “green”?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown ideal number of clusters, “K”?

BigML, Inc 29
Distance to Missing?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values

BigML, Inc 30
Distance to Categorical?
• Define special distance function: For two instances 𝑥 and 𝑦
and the categorical field 𝑎:
• if 𝑥 𝑎 ＝ 𝑦 𝑎 then 
(𝑥,𝑦)distance＝0 (or field scaling value)  
else  
(𝑥,𝑦)distance＝1
Approach: similar to “k-prototypes”

BigML, Inc 31
Distance to Categorical?
animal favorite toy toy color
cat ball red
cat ball green
d=0 d=0 d=1
cat laser red
dog squeaky red
d=1 d=1 d=0
D = 1
Then compute Euclidean distance between vectors
D = √2
Note: the centroid is assigned the most common
category of the member instances

BigML, Inc 32
Text Vectors
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
1 0 1 …
1 1 0 …
0 1 1 …
Text Field #1
Text Field #2
Features(thousands)
• Cosine Similarity
• cos() between two vectors
• 1 if collinear, 0 if orthogonal
• only positive vectors: 0 ≤ CS ≤ 1
• Cosine Distance＝1－Cosine
Similarity
• CD(TF1, TF2) = 0.5

BigML, Inc 33
Finding K: G-Means

BigML, Inc 34
Finding K: G-Means

BigML, Inc 35
Finding K: G-Means
Let K=2
Keep 1, Split 1
New K=3

BigML, Inc 36
Finding K: G-Means
Let K=3
Keep 1, Split 2
New K=5

BigML, Inc 37
Finding K: G-Means
Let K=5
K=5

BigML, Inc 38
Summary
• Cluster Purpose
• Unsupervised technique for finding self-similar groups
of instances
• Number of centroids (k) can be inputed or computed
• Outputs list of centroids
• Configuration:
• Algorithm: K-means / G-means
• Cluster Parameter: k or critical value
• Default missing / Summary fields / Scales / Weights
• Model Clusters
• Centroid / Batchcentroids

BigML, Inc 40
What is Topic Modeling?
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that
model the text
Text Fields
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use itl?
Questions:

BigML, Inc 41
What is Topic Modeling?
• Finds topics in your text ﬁelds
• A topic is a distribution over terms
• Terms with high probability in the same topic often
occur together in the same document
• Topics often correspond to real-world things that the
document may be “about” (e.g., sports, cooking,
technology)
• Each document is “about” one or more topics
• Usually each document is only about one or two
topics
• But in practice we assign a probability to every
topic for every document

BigML, Inc 42
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
1. Stem Words -> Tokens
2. Remove tokens that
occur too often
3. Remove tokens that do
not occur often enough
4. Count occurrences of
remaining “interesting”
tokens

BigML, Inc 43
Text Analysis
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Model
The token “great”

occurs more than 3 times
The token “afraid”

occurs no more than once

BigML, Inc 46
Text Analysis vs. Topic Modeling
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
Co-occurrence limited to
consecutive n-grams
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Topics indicate broader
co-occurrences

BigML, Inc 47
Generative Modeling
• Decision trees are discriminative models
• Aggressively model the classiﬁcation boundary
• Parsimonious: Don’t consider anything you don’t have to
• Topic Models are generative models
• Come up with a theory of how the data is generated
• Tweak the theory to ﬁt your data
Topic Modeling builds a model of how the text is generated

BigML, Inc 48
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness:

some are born
great, some
achieve

greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ

BigML, Inc 49
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ

BigML, Inc 50
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities

BigML, Inc 51
Training Topic Models

BigML, Inc 52
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and ﬁnally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%

BigML, Inc 53
Topic Distributions

BigML, Inc 54
Prediction?
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…

BigML, Inc 55
Topic Model Use Cases
• As a preprocessor for other techniques
• Building better models
• Bootstrapping categories for classiﬁcation
• Recommendation
• Discovery in large, heterogeneous text datasets

BigML, Inc 56
Topic Model Tips
• Setting k
• Much like k-means, the best value is data specific
• Too few will agglomerate unrelated topics, too many will
partition highly related topics
• I tend to find the latter more annoying than the former
• Tuning the Model
• Remove common, useless terms (use term filters)
• Set term limit higher, use n-grams
• Mess with stop word removal, turn off stemming

VSSML18. Clustering and Latent Dirichlet Allocation

VSSML18. Clustering and Latent Dirichlet Allocation

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a VSSML18. Clustering and Latent Dirichlet Allocation

Semelhante a VSSML18. Clustering and Latent Dirichlet Allocation (20)

Mais de BigML, Inc

Mais de BigML, Inc (20)

Último

Último (20)

VSSML18. Clustering and Latent Dirichlet Allocation