VSSML16 L4. Association Discovery and Latent Dirichlet Allocation
Valencian Summer School in Machine Learning 2016
Day 1 VSSML16
Lecture 4
Association Discovery and Latent Dirichlet Allocation
Geoff Webb (Monash University) & Charles Parker (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
2. BigML, Inc 2
Association Discovery
Geoff Webb
Professor of Information Technology Research
Monash University, Melbourne, Australia
Finding interesting correlations
3. BigML, Inc 3Unsupervised Learning
• Algorithm: “Magnum Opus” from Geoff Webb
• Unsupervised Learning: Works with unlabelled
data, like clustering and anomaly detection.
• Learning Task: Find “interesting” relations
between variables.
Association Discovery
4. BigML, Inc 4Unsupervised Learning
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
Anomaly Detection
similar
unusual
5. BigML, Inc 5Unsupervised Learning
{class = gas} amount < 100
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
{customer = Bob, account = 3421} zip = 46140
Rules:
Antecedent Consequent
6. BigML, Inc 6Unsupervised Learning
Use Cases
• Market Basket Analysis
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• Medical risk factors
8. BigML, Inc 8Unsupervised Learning
Magnum Opus
What's wrong with frequent pattern mining?
• Feast or famine
• often results in too few or too many patterns
• The vodka and caviar problem
• some high value patterns are infrequent
• Cannot handle dense data
• Minimum support may not be relevant
• cannot be low enough to capture all valid rules
• cannot be high enough to exclude all spurious rules
10. BigML, Inc 10Unsupervised Learning
Magnum Opus
Very high support patterns can be spurious
Data file: covtype.data 581012 cases / 125 values
ST15=0 → ST07=0
[Coverage=581009; Support=580904; Confidence=1.000]
ST07=0 → ST15=0
[Coverage=580907; Support=580904; Confidence=1.000]
ST15=0 → ST36=0
[Coverage=581009; Support=580890; Confidence=1.000]
ST36=0 → ST15=0
[Coverage=580893; Support=580890; Confidence=1.000]
ST15=0 → ST08=0
[Coverage=581009; Support=580830; Confidence=1.000]
ST08=0 → ST15=0
[Coverage=580833; Support=580830; Confidence=1.000]
… 197,183,686 such rules have highest support
11. BigML, Inc 11Unsupervised Learning
Magnum Opus
• User selects measure of interest
• System finds the top-k associations on that
measure within constraints
• Must be statistically significant interaction between
antecedent and consequent
• Every item in the antecedent must increase the
strength of association
12. BigML, Inc 12Unsupervised Learning
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
13. BigML, Inc 13Unsupervised Learning
Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C
14. BigML, Inc 14Unsupervised Learning
Association Metrics
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C
15. BigML, Inc 15Unsupervised Learning
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
16. BigML, Inc 16Unsupervised Learning
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
17. BigML, Inc 17Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
18. BigML, Inc 18Unsupervised Learning
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
19. BigML, Inc 19Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
20. BigML, Inc 20Unsupervised Learning
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
22. BigML, Inc 22Unsupervised Learning
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for
diabetes.
24. BigML, Inc 24Unsupervised Learning
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes = TRUE
26. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
27. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
28. Bag of Words Analysis
• Easiest way of analyzing a text
field is just to treat it as a “bag
of words”
• Each word is a separate
feature (usually an occurrence
count)
• When modeling, the features
are treated in isolation from
one another, essentially “one
at a time”
#VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
29. Limitations
• Words are sometimes
ambiguous
• Both because of multiple
definitions and difference in
tone
• How do we usually
disambiguate words? Context
#VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
30. An Instructive Example
• One way of looking at the usefulness of a machine learning
feature is to think about how well it isolates unique and coherent
subsets of the data
• Suppose I have a collection of documents where some of them
are about two different topics (via Ted Underwood’s Blog):
I Leadership (CEOs, organization, management)
I Chemistry (Elements, compounds, reactions)
• If I do a keyword search for “lead” (or try to classify documents
based on that word alone), I’ll get documents from either category
and documents that are a mix of both
• Can we build a feature that better isolates which set of documents
we’re looking for?
#VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
31. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
32. Generative Modeling
• Posit a parameterized structure that is responsible for generating
the data
• Use the data to fit the parameters
• A notion of causality is important for these models
#VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
33. Example of a Generative model
• Consider a patient with some
disease
• Class: Disease present /
absent, Features: Test results
• Arrows indicate cause in this
diagram; the symptoms
(features) are caused by the
disease
• This generative process
implies a structure; in this case
the so-called “Naive Bayes”
model
#VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
34. Generative vs. Discriminative
• This is an important distinction in machine learning generally
• Generative models try to model / assume a structure for the
process generating the data
• More mathematically, generative classifiers explicitly model the
joint distribution p(x, y) of the data
• Discriminate models don’t care; they “solve the prediction problem
directly”, and model only the conditional p(y|x) (Vapnik)
#VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
35. Which is Better?
• No general answer to this question (not that we haven’t tried):
Paper: On Discriminative vs. Generative Classifiers1
• Discriminative models tend to be faster to fit, quicker to predict,
and in the case of non-parametrics are often guaranteed to
converge to the correct answer given enough data
• Generative models tend to be more probabilistically sound and
able to do more than just classify
1
http:
//ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
#VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
36. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
37. A New Way of Thinking About Documents
• Three entities: Documents,
Terms, and Topics
• A term is a single lexical token
(usually one or more words,
but can be any arbitrary string)
• A document has many terms
• A topic is a distribution over
terms
#VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
38. A Generative Model for Documents
• A document can be thought of as a distribution over topics, drawn
from a distribution over possible distributions
• To create a document, repeatedly draw a topic at random from the
distribution, then draw a term from topic (which, remember, is a
distribution over terms)
• The main thing we want to infer is the topic distribution
#VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
39. Dirichlet Process Intuition: Rich Get Richer
• We use a Dirichlet process to model the relationship between
documents, topics, and terms
• We’re more likely to think a word came from a topic if we’ve
already seen a bunch of words from that topic
• We’re more likely to think the topic was responsible for generating
the document if we’ve already seen a bunch of words in the
document from that topics.
• Here lies the disambiguation: If a word could have come from two
different topics, we use the rest of the words in the document to
decide which meaning it has
• Note that there’s a little bit of self-fulfilling prophecy going on here
(by design)
#VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
40. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
41. Usenet Movie Reviews
Library of over 26,000 movie reviews
A solid noir melodrama from Vincent Sherman, who takes a standard
story and dresses it up with moving characterizations and beautifully
expressionistic B&W; photography from cinematographer James Wong Howe.
The director took a songwriter Paul Webster's short magazine story
called "The Man Who Died Twice" and improved the story by rounding out
the characters to give them both strong and weak points, so that they
would not be one-note characters as was the case in the original
story. The film was made by Warner Brothers, who needed a film for
their contract star Ann Sheridan and asked Sherman to change the story
around so that her part as Nora Prentiss, a nightclub singer, is
expanded
#VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
42. Supreme Court Cases
Library of about 7500 Supreme Court Cases
NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F.
SUPP. 819, REVERSED.
FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE
APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE
SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND
UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE
FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT
PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA,
ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH
#VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
43. Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
44. Visualizing Changes in Topic Over Time
• Plot changes in topic distribution over time
• Especially nice for dated historical collections (e.g., novels,
newspapers)
#VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
45. Search Without Keywords
• Keyword search is great, if you
know the keywords
• Good for finding search terms
• Great for, e.g., legal discovery
• Nice for finding “outliers”
• Surprise topics (From the
recycle bin)
#VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
46. Feature Spaces for Classification
• Just classify the documents in “topic space” rather than “bag
space”
• The topics that come out of LDA have some nice benefits as
features
I Can reduce a feature space of thousands to a few dozen (faster to
fit)
I Nicely interpretable
I Automatically tailored to the documents you’ve provided
• Foreshadowing Alert: When using LDA in this way, we’re doing a
form of feature engineering which we’ll hear more about tomorrow.
#VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
47. Some Caveats
• You need to choose the number of topics beforehand
• Takes forever, both to fit and to do inference
• Takes a lot of text to make it meaningful
• Tends to focus on “meaningless minutiae”
• While it sometimes makes a nice classification space, it’s a rare
case that provides dramatic improvement over bag-of-words
• I find it nice just for exploration
#VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
48. Thus Ends The Lesson
Questions?
#VSSML16 Latent Dirichlet Allocation September 2016 24 / 24