SlideShare a Scribd company logo
1 of 48
Download to read offline
September 8-9, 2016
BigML, Inc 2
Association Discovery
Geoff Webb
Professor of Information Technology Research
Monash University, Melbourne, Australia
Finding interesting correlations
BigML, Inc 3Unsupervised Learning
• Algorithm: “Magnum Opus” from Geoff Webb
• Unsupervised Learning: Works with unlabelled
data, like clustering and anomaly detection.
• Learning Task: Find “interesting” relations
between variables.
Association Discovery
BigML, Inc 4Unsupervised Learning
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
Anomaly Detection
similar
unusual
BigML, Inc 5Unsupervised Learning
{class = gas} amount < 100
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
{customer = Bob, account = 3421} zip = 46140
Rules:
Antecedent Consequent
BigML, Inc 6Unsupervised Learning
Use Cases
• Market Basket Analysis
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• Medical risk factors
BigML, Inc 7Unsupervised Learning
Magnum Opus
What's wrong with frequent pattern mining?
BigML, Inc 8Unsupervised Learning
Magnum Opus
What's wrong with frequent pattern mining?
• Feast or famine

• often results in too few or too many patterns

• The vodka and caviar problem

• some high value patterns are infrequent

• Cannot handle dense data

• Minimum support may not be relevant

• cannot be low enough to capture all valid rules

• cannot be high enough to exclude all spurious rules
BigML, Inc 9Unsupervised Learning
Magnum Opus
Very infrequent patterns can be significant
Data file: Brijs retail.itl, 88162 cases / 16470 items

237 → 1 

[Coverage=3032; Support=28; Lift=3.06; p=1.99E-007]

237 & 4685 → 1 

[Coverage=19; Support=9; Lift=157.00; p=5.03E-012]

1159 → 1 

[Coverage=197; Support=9; Lift=15.14; p=1.13E-008]

4685 → 1 

[Coverage=270; Support=9; Lift=11.05; p=1.68E-007]

168 → 1 

[Coverage=293; Support=9; Lift=10.18; p=3.33E-007]

4382 → 1 

[Coverage=72; Support=8; Lift=36.83; p=6.26E-011]

168 & 4685 → 1 

[Coverage=9; Support=7; Lift=257.78; p=6.66E-011]
BigML, Inc 10Unsupervised Learning
Magnum Opus
Very high support patterns can be spurious
Data file: covtype.data 581012 cases / 125 values

ST15=0 → ST07=0 

[Coverage=581009; Support=580904; Confidence=1.000]

ST07=0 → ST15=0 

[Coverage=580907; Support=580904; Confidence=1.000]

ST15=0 → ST36=0 

[Coverage=581009; Support=580890; Confidence=1.000]

ST36=0 → ST15=0 

[Coverage=580893; Support=580890; Confidence=1.000]

ST15=0 → ST08=0 

[Coverage=581009; Support=580830; Confidence=1.000]

ST08=0 → ST15=0 

[Coverage=580833; Support=580830; Confidence=1.000]

… 197,183,686 such rules have highest support
BigML, Inc 11Unsupervised Learning
Magnum Opus
• User selects measure of interest

• System finds the top-k associations on that
measure within constraints 

• Must be statistically significant interaction between
antecedent and consequent

• Every item in the antecedent must increase the
strength of association
BigML, Inc 12Unsupervised Learning
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
BigML, Inc 13Unsupervised Learning
Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C
BigML, Inc 14Unsupervised Learning
Association Metrics
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C
BigML, Inc 15Unsupervised Learning
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never 

implies C
A sometimes 

implies C
A always 

implies C
BigML, Inc 16Unsupervised Learning
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
BigML, Inc 17Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
BigML, Inc 18Unsupervised Learning
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent. 

Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
BigML, Inc 19Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
BigML, Inc 20Unsupervised Learning
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
BigML, Inc 21Unsupervised Learning
Association Discovery

Demo #1
BigML, Inc 22Unsupervised Learning
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for
diabetes.
BigML, Inc 23Unsupervised Learning
Association Discovery

Demo #2
BigML, Inc 24Unsupervised Learning
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes = TRUE
Latent Dirichlet Allocation
#VSSML16
September 2016
#VSSML16 Latent Dirichlet Allocation September 2016 1 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
Bag of Words Analysis
• Easiest way of analyzing a text
field is just to treat it as a “bag
of words”
• Each word is a separate
feature (usually an occurrence
count)
• When modeling, the features
are treated in isolation from
one another, essentially “one
at a time”
#VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
Limitations
• Words are sometimes
ambiguous
• Both because of multiple
definitions and difference in
tone
• How do we usually
disambiguate words? Context
#VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
An Instructive Example
• One way of looking at the usefulness of a machine learning
feature is to think about how well it isolates unique and coherent
subsets of the data
• Suppose I have a collection of documents where some of them
are about two different topics (via Ted Underwood’s Blog):
I Leadership (CEOs, organization, management)
I Chemistry (Elements, compounds, reactions)
• If I do a keyword search for “lead” (or try to classify documents
based on that word alone), I’ll get documents from either category
and documents that are a mix of both
• Can we build a feature that better isolates which set of documents
we’re looking for?
#VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
Generative Modeling
• Posit a parameterized structure that is responsible for generating
the data
• Use the data to fit the parameters
• A notion of causality is important for these models
#VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
Example of a Generative model
• Consider a patient with some
disease
• Class: Disease present /
absent, Features: Test results
• Arrows indicate cause in this
diagram; the symptoms
(features) are caused by the
disease
• This generative process
implies a structure; in this case
the so-called “Naive Bayes”
model
#VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
Generative vs. Discriminative
• This is an important distinction in machine learning generally
• Generative models try to model / assume a structure for the
process generating the data
• More mathematically, generative classifiers explicitly model the
joint distribution p(x, y) of the data
• Discriminate models don’t care; they “solve the prediction problem
directly”, and model only the conditional p(y|x) (Vapnik)
#VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
Which is Better?
• No general answer to this question (not that we haven’t tried):
Paper: On Discriminative vs. Generative Classifiers1
• Discriminative models tend to be faster to fit, quicker to predict,
and in the case of non-parametrics are often guaranteed to
converge to the correct answer given enough data
• Generative models tend to be more probabilistically sound and
able to do more than just classify
1
http:
//ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
#VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
A New Way of Thinking About Documents
• Three entities: Documents,
Terms, and Topics
• A term is a single lexical token
(usually one or more words,
but can be any arbitrary string)
• A document has many terms
• A topic is a distribution over
terms
#VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
A Generative Model for Documents
• A document can be thought of as a distribution over topics, drawn
from a distribution over possible distributions
• To create a document, repeatedly draw a topic at random from the
distribution, then draw a term from topic (which, remember, is a
distribution over terms)
• The main thing we want to infer is the topic distribution
#VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
Dirichlet Process Intuition: Rich Get Richer
• We use a Dirichlet process to model the relationship between
documents, topics, and terms
• We’re more likely to think a word came from a topic if we’ve
already seen a bunch of words from that topic
• We’re more likely to think the topic was responsible for generating
the document if we’ve already seen a bunch of words in the
document from that topics.
• Here lies the disambiguation: If a word could have come from two
different topics, we use the rest of the words in the document to
decide which meaning it has
• Note that there’s a little bit of self-fulfilling prophecy going on here
(by design)
#VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
Usenet Movie Reviews
Library of over 26,000 movie reviews
A solid noir melodrama from Vincent Sherman, who takes a standard
story and dresses it up with moving characterizations and beautifully
expressionistic B&W; photography from cinematographer James Wong Howe.
The director took a songwriter Paul Webster's short magazine story
called "The Man Who Died Twice" and improved the story by rounding out
the characters to give them both strong and weak points, so that they
would not be one-note characters as was the case in the original
story. The film was made by Warner Brothers, who needed a film for
their contract star Ann Sheridan and asked Sherman to change the story
around so that her part as Nora Prentiss, a nightclub singer, is
expanded
#VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
Supreme Court Cases
Library of about 7500 Supreme Court Cases
NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F.
SUPP. 819, REVERSED.
FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE
APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE
SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND
UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE
FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT
PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA,
ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH
#VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
Visualizing Changes in Topic Over Time
• Plot changes in topic distribution over time
• Especially nice for dated historical collections (e.g., novels,
newspapers)
#VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
Search Without Keywords
• Keyword search is great, if you
know the keywords
• Good for finding search terms
• Great for, e.g., legal discovery
• Nice for finding “outliers”
• Surprise topics (From the
recycle bin)
#VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
Feature Spaces for Classification
• Just classify the documents in “topic space” rather than “bag
space”
• The topics that come out of LDA have some nice benefits as
features
I Can reduce a feature space of thousands to a few dozen (faster to
fit)
I Nicely interpretable
I Automatically tailored to the documents you’ve provided
• Foreshadowing Alert: When using LDA in this way, we’re doing a
form of feature engineering which we’ll hear more about tomorrow.
#VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
Some Caveats
• You need to choose the number of topics beforehand
• Takes forever, both to fit and to do inference
• Takes a lot of text to make it meaningful
• Tends to focus on “meaningless minutiae”
• While it sometimes makes a nice classification space, it’s a rare
case that provides dramatic improvement over bag-of-words
• I find it nice just for exploration
#VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
Thus Ends The Lesson
Questions?
#VSSML16 Latent Dirichlet Allocation September 2016 24 / 24

More Related Content

Viewers also liked

Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic ModelsTuri, Inc.
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelTomonari Masada
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionBigML, Inc
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Viewers also liked (13)

Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic Models
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
C4.5
C4.5C4.5
C4.5
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic Regression
 
LDA入門
LDA入門LDA入門
LDA入門
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

  • 2. BigML, Inc 2 Association Discovery Geoff Webb Professor of Information Technology Research Monash University, Melbourne, Australia Finding interesting correlations
  • 3. BigML, Inc 3Unsupervised Learning • Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations between variables. Association Discovery
  • 4. BigML, Inc 4Unsupervised Learning Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering Anomaly Detection similar unusual
  • 5. BigML, Inc 5Unsupervised Learning {class = gas} amount < 100 Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 {customer = Bob, account = 3421} zip = 46140 Rules: Antecedent Consequent
  • 6. BigML, Inc 6Unsupervised Learning Use Cases • Market Basket Analysis • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • Medical risk factors
  • 7. BigML, Inc 7Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining?
  • 8. BigML, Inc 8Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining? • Feast or famine • often results in too few or too many patterns • The vodka and caviar problem • some high value patterns are infrequent • Cannot handle dense data • Minimum support may not be relevant • cannot be low enough to capture all valid rules • cannot be high enough to exclude all spurious rules
  • 9. BigML, Inc 9Unsupervised Learning Magnum Opus Very infrequent patterns can be significant Data file: Brijs retail.itl, 88162 cases / 16470 items 237 → 1 
 [Coverage=3032; Support=28; Lift=3.06; p=1.99E-007] 237 & 4685 → 1 
 [Coverage=19; Support=9; Lift=157.00; p=5.03E-012] 1159 → 1 
 [Coverage=197; Support=9; Lift=15.14; p=1.13E-008] 4685 → 1 
 [Coverage=270; Support=9; Lift=11.05; p=1.68E-007] 168 → 1 
 [Coverage=293; Support=9; Lift=10.18; p=3.33E-007] 4382 → 1 
 [Coverage=72; Support=8; Lift=36.83; p=6.26E-011] 168 & 4685 → 1 
 [Coverage=9; Support=7; Lift=257.78; p=6.66E-011]
  • 10. BigML, Inc 10Unsupervised Learning Magnum Opus Very high support patterns can be spurious Data file: covtype.data 581012 cases / 125 values ST15=0 → ST07=0 
 [Coverage=581009; Support=580904; Confidence=1.000] ST07=0 → ST15=0 
 [Coverage=580907; Support=580904; Confidence=1.000] ST15=0 → ST36=0 
 [Coverage=581009; Support=580890; Confidence=1.000] ST36=0 → ST15=0 
 [Coverage=580893; Support=580890; Confidence=1.000] ST15=0 → ST08=0 
 [Coverage=581009; Support=580830; Confidence=1.000] ST08=0 → ST15=0 
 [Coverage=580833; Support=580830; Confidence=1.000] … 197,183,686 such rules have highest support
  • 11. BigML, Inc 11Unsupervised Learning Magnum Opus • User selects measure of interest • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  • 12. BigML, Inc 12Unsupervised Learning Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  • 13. BigML, Inc 13Unsupervised Learning Association Metrics Support Percentage of instances which match antecedent “A” and Consequent “C” Instances A C
  • 14. BigML, Inc 14Unsupervised Learning Association Metrics Confidence Percentage of instances in the antecedent which also contain the consequent. Coverage Support Instances A C
  • 15. BigML, Inc 15Unsupervised Learning Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  • 16. BigML, Inc 16Unsupervised Learning Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A
  • 17. BigML, Inc 17Unsupervised Learning Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C
  • 18. BigML, Inc 18Unsupervised Learning Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  • 19. BigML, Inc 19Unsupervised Learning Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C -1…
  • 20. BigML, Inc 20Unsupervised Learning Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  • 21. BigML, Inc 21Unsupervised Learning Association Discovery
 Demo #1
  • 22. BigML, Inc 22Unsupervised Learning Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  • 23. BigML, Inc 23Unsupervised Learning Association Discovery
 Demo #2
  • 24. BigML, Inc 24Unsupervised Learning Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes = TRUE
  • 25. Latent Dirichlet Allocation #VSSML16 September 2016 #VSSML16 Latent Dirichlet Allocation September 2016 1 / 24
  • 26. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
  • 27. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
  • 28. Bag of Words Analysis • Easiest way of analyzing a text field is just to treat it as a “bag of words” • Each word is a separate feature (usually an occurrence count) • When modeling, the features are treated in isolation from one another, essentially “one at a time” #VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
  • 29. Limitations • Words are sometimes ambiguous • Both because of multiple definitions and difference in tone • How do we usually disambiguate words? Context #VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
  • 30. An Instructive Example • One way of looking at the usefulness of a machine learning feature is to think about how well it isolates unique and coherent subsets of the data • Suppose I have a collection of documents where some of them are about two different topics (via Ted Underwood’s Blog): I Leadership (CEOs, organization, management) I Chemistry (Elements, compounds, reactions) • If I do a keyword search for “lead” (or try to classify documents based on that word alone), I’ll get documents from either category and documents that are a mix of both • Can we build a feature that better isolates which set of documents we’re looking for? #VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
  • 31. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
  • 32. Generative Modeling • Posit a parameterized structure that is responsible for generating the data • Use the data to fit the parameters • A notion of causality is important for these models #VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
  • 33. Example of a Generative model • Consider a patient with some disease • Class: Disease present / absent, Features: Test results • Arrows indicate cause in this diagram; the symptoms (features) are caused by the disease • This generative process implies a structure; in this case the so-called “Naive Bayes” model #VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
  • 34. Generative vs. Discriminative • This is an important distinction in machine learning generally • Generative models try to model / assume a structure for the process generating the data • More mathematically, generative classifiers explicitly model the joint distribution p(x, y) of the data • Discriminate models don’t care; they “solve the prediction problem directly”, and model only the conditional p(y|x) (Vapnik) #VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
  • 35. Which is Better? • No general answer to this question (not that we haven’t tried): Paper: On Discriminative vs. Generative Classifiers1 • Discriminative models tend to be faster to fit, quicker to predict, and in the case of non-parametrics are often guaranteed to converge to the correct answer given enough data • Generative models tend to be more probabilistically sound and able to do more than just classify 1 http: //ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf #VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
  • 36. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
  • 37. A New Way of Thinking About Documents • Three entities: Documents, Terms, and Topics • A term is a single lexical token (usually one or more words, but can be any arbitrary string) • A document has many terms • A topic is a distribution over terms #VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
  • 38. A Generative Model for Documents • A document can be thought of as a distribution over topics, drawn from a distribution over possible distributions • To create a document, repeatedly draw a topic at random from the distribution, then draw a term from topic (which, remember, is a distribution over terms) • The main thing we want to infer is the topic distribution #VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
  • 39. Dirichlet Process Intuition: Rich Get Richer • We use a Dirichlet process to model the relationship between documents, topics, and terms • We’re more likely to think a word came from a topic if we’ve already seen a bunch of words from that topic • We’re more likely to think the topic was responsible for generating the document if we’ve already seen a bunch of words in the document from that topics. • Here lies the disambiguation: If a word could have come from two different topics, we use the rest of the words in the document to decide which meaning it has • Note that there’s a little bit of self-fulfilling prophecy going on here (by design) #VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
  • 40. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
  • 41. Usenet Movie Reviews Library of over 26,000 movie reviews A solid noir melodrama from Vincent Sherman, who takes a standard story and dresses it up with moving characterizations and beautifully expressionistic B&W; photography from cinematographer James Wong Howe. The director took a songwriter Paul Webster's short magazine story called "The Man Who Died Twice" and improved the story by rounding out the characters to give them both strong and weak points, so that they would not be one-note characters as was the case in the original story. The film was made by Warner Brothers, who needed a film for their contract star Ann Sheridan and asked Sherman to change the story around so that her part as Nora Prentiss, a nightclub singer, is expanded #VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
  • 42. Supreme Court Cases Library of about 7500 Supreme Court Cases NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F. SUPP. 819, REVERSED. FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA, ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH #VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
  • 43. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
  • 44. Visualizing Changes in Topic Over Time • Plot changes in topic distribution over time • Especially nice for dated historical collections (e.g., novels, newspapers) #VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
  • 45. Search Without Keywords • Keyword search is great, if you know the keywords • Good for finding search terms • Great for, e.g., legal discovery • Nice for finding “outliers” • Surprise topics (From the recycle bin) #VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
  • 46. Feature Spaces for Classification • Just classify the documents in “topic space” rather than “bag space” • The topics that come out of LDA have some nice benefits as features I Can reduce a feature space of thousands to a few dozen (faster to fit) I Nicely interpretable I Automatically tailored to the documents you’ve provided • Foreshadowing Alert: When using LDA in this way, we’re doing a form of feature engineering which we’ll hear more about tomorrow. #VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
  • 47. Some Caveats • You need to choose the number of topics beforehand • Takes forever, both to fit and to do inference • Takes a lot of text to make it meaningful • Tends to focus on “meaningless minutiae” • While it sometimes makes a nice classification space, it’s a rare case that provides dramatic improvement over bag-of-words • I find it nice just for exploration #VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
  • 48. Thus Ends The Lesson Questions? #VSSML16 Latent Dirichlet Allocation September 2016 24 / 24