VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

BigML, Inc 2
Association Discovery
Geoff Webb
Professor of Information Technology Research
Monash University, Melbourne, Australia
Finding interesting correlations

BigML, Inc 3Unsupervised Learning
• Algorithm: “Magnum Opus” from Geoﬀ Webb
• Unsupervised Learning: Works with unlabelled
data, like clustering and anomaly detection.
• Learning Task: Find “interesting” relations
between variables.
Association Discovery

Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
Anomaly Detection
similar
unusual

{class = gas} amount < 100
Association Rules
{customer = Bob, account = 3421} zip = 46140
Rules:
Antecedent Consequent

Use Cases
• Market Basket Analysis
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• Medical risk factors

Magnum Opus
What's wrong with frequent pattern mining?

Magnum Opus
What's wrong with frequent pattern mining?
• Feast or famine

• often results in too few or too many patterns

• The vodka and caviar problem

• some high value patterns are infrequent

• Cannot handle dense data

• Minimum support may not be relevant

• cannot be low enough to capture all valid rules

• cannot be high enough to exclude all spurious rules

Magnum Opus
Very infrequent patterns can be significant
Data file: Brijs retail.itl, 88162 cases / 16470 items

237 → 1  
[Coverage=3032; Support=28; Lift=3.06; p=1.99E-007]

237 & 4685 → 1  

1159 → 1  

4685 → 1  

168 → 1  

4382 → 1  

168 & 4685 → 1  

Magnum Opus
Very high support patterns can be spurious
Data file: covtype.data 581012 cases / 125 values

ST15=0 → ST07=0  
[Coverage=581009; Support=580904; Confidence=1.000]

ST07=0 → ST15=0  

ST15=0 → ST36=0  

ST36=0 → ST15=0  

ST15=0 → ST08=0  

ST08=0 → ST15=0  

… 197,183,686 such rules have highest support

Magnum Opus
• User selects measure of interest

• System finds the top-k associations on that
measure within constraints

• Must be statistically significant interaction between
antecedent and consequent

• Every item in the antecedent must increase the
strength of association

Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C

Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C

Association Metrics
Conﬁdence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C

Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Conﬁdence
A never

implies C
A sometimes

implies C
A always

implies C

Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Conﬁdence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A

Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C

Association Metrics
Leverage
Diﬀerence of observed
support and support if A
and C were statistically
independent.

Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A

Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…

Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout

Association Discovery 
Demo #1

Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for
diabetes.

Association Discovery 
Demo #2

Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes = TRUE

Latent Dirichlet Allocation
#VSSML16
September 2016
#VSSML16 Latent Dirichlet Allocation September 2016 1 / 24

Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications

Outline
5 Applications

Bag of Words Analysis
• Easiest way of analyzing a text
ﬁeld is just to treat it as a “bag
of words”
• Each word is a separate
feature (usually an occurrence
count)
• When modeling, the features
are treated in isolation from
one another, essentially “one
at a time”

Limitations
• Words are sometimes
ambiguous
• Both because of multiple
deﬁnitions and difference in
tone
• How do we usually
disambiguate words? Context

An Instructive Example
• One way of looking at the usefulness of a machine learning
feature is to think about how well it isolates unique and coherent
subsets of the data
• Suppose I have a collection of documents where some of them
are about two different topics (via Ted Underwood’s Blog):
I Leadership (CEOs, organization, management)
I Chemistry (Elements, compounds, reactions)
• If I do a keyword search for “lead” (or try to classify documents
based on that word alone), I’ll get documents from either category
and documents that are a mix of both
• Can we build a feature that better isolates which set of documents
we’re looking for?

Outline
5 Applications

Generative Modeling
• Posit a parameterized structure that is responsible for generating
the data
• Use the data to ﬁt the parameters
• A notion of causality is important for these models

Example of a Generative model
• Consider a patient with some
disease
• Class: Disease present /
absent, Features: Test results
• Arrows indicate cause in this
diagram; the symptoms
(features) are caused by the
disease
• This generative process
implies a structure; in this case
the so-called “Naive Bayes”
model

Generative vs. Discriminative
• This is an important distinction in machine learning generally
• Generative models try to model / assume a structure for the
process generating the data
• More mathematically, generative classiﬁers explicitly model the
joint distribution p(x, y) of the data
• Discriminate models don’t care; they “solve the prediction problem
directly”, and model only the conditional p(y|x) (Vapnik)

Which is Better?
• No general answer to this question (not that we haven’t tried):
Paper: On Discriminative vs. Generative Classiﬁers1
• Discriminative models tend to be faster to ﬁt, quicker to predict,
and in the case of non-parametrics are often guaranteed to
converge to the correct answer given enough data
• Generative models tend to be more probabilistically sound and
able to do more than just classify
1
http:
//ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

Outline
5 Applications

A New Way of Thinking About Documents
• Three entities: Documents,
Terms, and Topics
• A term is a single lexical token
(usually one or more words,
but can be any arbitrary string)
• A document has many terms
• A topic is a distribution over
terms

A Generative Model for Documents
• A document can be thought of as a distribution over topics, drawn
from a distribution over possible distributions
• To create a document, repeatedly draw a topic at random from the
distribution, then draw a term from topic (which, remember, is a
distribution over terms)
• The main thing we want to infer is the topic distribution

Dirichlet Process Intuition: Rich Get Richer
• We use a Dirichlet process to model the relationship between
documents, topics, and terms
• We’re more likely to think a word came from a topic if we’ve
already seen a bunch of words from that topic
• We’re more likely to think the topic was responsible for generating
the document if we’ve already seen a bunch of words in the
document from that topics.
• Here lies the disambiguation: If a word could have come from two
different topics, we use the rest of the words in the document to
decide which meaning it has
• Note that there’s a little bit of self-fulﬁlling prophecy going on here
(by design)

Outline
5 Applications

Usenet Movie Reviews
Library of over 26,000 movie reviews
A solid noir melodrama from Vincent Sherman, who takes a standard
story and dresses it up with moving characterizations and beautifully
expressionistic B&W; photography from cinematographer James Wong Howe.
The director took a songwriter Paul Webster's short magazine story
called "The Man Who Died Twice" and improved the story by rounding out
the characters to give them both strong and weak points, so that they
would not be one-note characters as was the case in the original
story. The film was made by Warner Brothers, who needed a film for
their contract star Ann Sheridan and asked Sherman to change the story
around so that her part as Nora Prentiss, a nightclub singer, is
expanded

Supreme Court Cases
Library of about 7500 Supreme Court Cases
NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F.
SUPP. 819, REVERSED.
FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE
APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE
SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND
UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE
FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT
PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA,
ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH

Outline
5 Applications

Visualizing Changes in Topic Over Time
• Plot changes in topic distribution over time
• Especially nice for dated historical collections (e.g., novels,
newspapers)

Search Without Keywords
• Keyword search is great, if you
know the keywords
• Good for ﬁnding search terms
• Great for, e.g., legal discovery
• Nice for ﬁnding “outliers”
• Surprise topics (From the
recycle bin)

Feature Spaces for Classification
• Just classify the documents in “topic space” rather than “bag
space”
• The topics that come out of LDA have some nice benefits as
features
I Can reduce a feature space of thousands to a few dozen (faster to
fit)
I Nicely interpretable
I Automatically tailored to the documents you’ve provided
• Foreshadowing Alert: When using LDA in this way, we’re doing a
form of feature engineering which we’ll hear more about tomorrow.

Some Caveats
• You need to choose the number of topics beforehand
• Takes forever, both to fit and to do inference
• Takes a lot of text to make it meaningful
• Tends to focus on “meaningless minutiae”
• While it sometimes makes a nice classification space, it’s a rare
case that provides dramatic improvement over bag-of-words
• I find it nice just for exploration

Thus Ends The Lesson
Questions?

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation