SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
Valencian Summer School in Machine Learning
4rd edition
September 13-14, 2018
BigML, Inc 2
Clustering and Topic Models
“One of these things is not like the other things . . . “
Charles Parker
VP Algorithms, BigML, Inc
BigML, Inc 3
Clustering
BigML, Inc 4
What is Clustering?
• An unsupervised learning technique
• No labels necessary
• Useful for finding similar instances
• Smart sampling/labelling
• Finds “self-similar" groups of instances
• Customer: groups with similar behavior
• Medical: patients with similar diagnostic measurements
• Defines each group by a “centroid”
• Geometric center of the group
• Represents the “average” member
• Number of centroids (k) can be specified or determined
BigML, Inc 5
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 6
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
auth = pin
amount ~ $100
Same:
date: Mon != Wed
customer: Sally != Bob
account: 6788 != 3421
class: clothes != gas
zip: 26339 != 46140
Different:
date = Wed (2 out of 3)
customer = Bob
account = 3421
auth = pin
class = gas
zip = 46140
amount = $104
Centroid:
similar
BigML, Inc 7
Use Cases
• Customer segmentation
• Which customers are similar?
• Active learning
• Labelling unlabelled data efficiently
• Item discovery
• What other items are similar to this one?
BigML, Inc 8
Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-
game purchases
• Assumption: Usage correlates to LTV
0%
3%
1%
BigML, Inc 9
Active Learning
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for
diabetes and label the dataset to
build a model but the test is
expensive*.
BigML, Inc 10
Active Learning
*For a more realistic example of high cost, imagine a dataset with a
billion transactions, each one needing to be labelled as fraud/not-
fraud. Or a million images which need to be labeled as cat/not-cat.
2323
BigML, Inc 11
Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible flavor
characteristics.
Smoky
Fruity
BigML, Inc 12
Clusters Demo #1
BigML, Inc 13
Human Expert
Cluster into 3 groups…
BigML, Inc 14
Human Expert
BigML, Inc 15
Human Expert
• Jesa used prior knowledge to select possible features that
separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen features
• Separation quality was then tested to ensure:
• Met criteria of K=3
• Groups were sufficiently “distant”
• No crossover
BigML, Inc 16
Human Expert
• Aspect Ratio (Length / Width)
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
BigML, Inc 17
Clustering Features
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2,75 6
box 1 6
block 1,6 6
screw 8 3
battery 5 3
key 4,25 3
bead 1 2
BigML, Inc 18
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can find clusters using distances

in n-dimensional feature space
K=3
BigML, Inc 19
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points
BigML, Inc 20
K-Means Algorithm
K=3
BigML, Inc 21
K-Means Algorithm
K=3
Repeat until centroids stop moving
BigML, Inc 22
There is no Right Answer!
Metal Other
Wood
BigML, Inc 23
Starting Points
• Random points or instances in n-dimensional space
• Might start "too close"
• Risk of sub-optimal convergence
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first point is chosen randomly from instances
• each subsequent point is chosen from the remaining
instances with a probability proportional to the squared
distance from the point's closest existing cluster center
BigML, Inc 24
K++ Initial Centers
Low

Probability
High

ProbabilityHighest

Probability
K=3
BigML, Inc 25
K++ Initial Centers
Low

Probability
Low

Probability
K=3
BigML, Inc 26
K++ Initial Centers
K=3
BigML, Inc 27
Scaling Matters
price
number of bedrooms
d = 160,000
d = 1
BigML, Inc 28
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• How far is “red” from “green”?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown ideal number of clusters, “K”?
BigML, Inc 29
Distance to Missing?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
BigML, Inc 30
Distance to Categorical?
• Define special distance function: For two instances 𝑥 and 𝑦
and the categorical field 𝑎:
• if 𝑥 𝑎 = 𝑦 𝑎 then

(𝑥,𝑦)distance=0 (or field scaling value) 

else 

(𝑥,𝑦)distance=1
Approach: similar to “k-prototypes”
BigML, Inc 31
Distance to Categorical?
animal favorite toy toy color
cat ball red
cat ball green
d=0 d=0 d=1
cat laser red
dog squeaky red
d=1 d=1 d=0
D = 1
Then compute Euclidean distance between vectors
D = √2
Note: the centroid is assigned the most common
category of the member instances
BigML, Inc 32
Text Vectors
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
1 0 1 …
1 1 0 …
0 1 1 …
Text Field #1
Text Field #2
Features(thousands)
• Cosine Similarity
• cos() between two vectors
• 1 if collinear, 0 if orthogonal
• only positive vectors: 0 ≤ CS ≤ 1
• Cosine Distance=1-Cosine
Similarity
• CD(TF1, TF2) = 0.5
BigML, Inc 33
Finding K: G-Means
BigML, Inc 34
Finding K: G-Means
BigML, Inc 35
Finding K: G-Means
Let K=2
Keep 1, Split 1
New K=3
BigML, Inc 36
Finding K: G-Means
Let K=3
Keep 1, Split 2
New K=5
BigML, Inc 37
Finding K: G-Means
Let K=5
K=5
BigML, Inc 38
Summary
• Cluster Purpose
• Unsupervised technique for finding self-similar groups
of instances
• Number of centroids (k) can be inputed or computed
• Outputs list of centroids
• Configuration:
• Algorithm: K-means / G-means
• Cluster Parameter: k or critical value
• Default missing / Summary fields / Scales / Weights
• Model Clusters
• Centroid / Batchcentroids
BigML, Inc 39
Topic Modeling
BigML, Inc 40
What is Topic Modeling?
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that
model the text
Text Fields
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use itl?
Questions:
BigML, Inc 41
What is Topic Modeling?
• Finds topics in your text fields
• A topic is a distribution over terms
• Terms with high probability in the same topic often
occur together in the same document
• Topics often correspond to real-world things that the
document may be “about” (e.g., sports, cooking,
technology)
• Each document is “about” one or more topics
• Usually each document is only about one or two
topics
• But in practice we assign a probability to every
topic for every document
BigML, Inc 42
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
1. Stem Words -> Tokens
2. Remove tokens that
occur too often
3. Remove tokens that do
not occur often enough
4. Count occurrences of
remaining “interesting”
tokens
BigML, Inc 43
Text Analysis
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Model
The token “great” 

occurs more than 3 times
The token “afraid” 

occurs no more than once
BigML, Inc 44
Text Analysis
BigML, Inc 45
Hodor!
BigML, Inc 46
Text Analysis vs. Topic Modeling
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
Co-occurrence limited to
consecutive n-grams
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Topics indicate broader
co-occurrences
BigML, Inc 47
Generative Modeling
• Decision trees are discriminative models
• Aggressively model the classification boundary
• Parsimonious: Don’t consider anything you don’t have to
• Topic Models are generative models
• Come up with a theory of how the data is generated
• Tweak the theory to fit your data
Topic Modeling builds a model of how the text is generated
BigML, Inc 48
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness: 

some are born
great, some
achieve 

greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
BigML, Inc 49
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
BigML, Inc 50
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities
BigML, Inc 51
Training Topic Models
BigML, Inc 52
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
BigML, Inc 53
Topic Distributions
BigML, Inc 54
Prediction?
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…
BigML, Inc 55
Topic Model Use Cases
• As a preprocessor for other techniques
• Building better models
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets
BigML, Inc 56
Topic Model Tips
• Setting k
• Much like k-means, the best value is data specific
• Too few will agglomerate unrelated topics, too many will
partition highly related topics
• I tend to find the latter more annoying than the former
• Tuning the Model
• Remove common, useless terms (use term filters)
• Set term limit higher, use n-grams
• Mess with stop word removal, turn off stemming
VSSML18. Clustering and Latent Dirichlet Allocation

Mais conteúdo relacionado

Semelhante a VSSML18. Clustering and Latent Dirichlet Allocation

Semelhante a VSSML18. Clustering and Latent Dirichlet Allocation (20)

VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
BigML Education - Clusters
BigML Education - ClustersBigML Education - Clusters
BigML Education - Clusters
 
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformDutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
L13. Cluster Analysis
L13. Cluster AnalysisL13. Cluster Analysis
L13. Cluster Analysis
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical Perspective
 
DutchMLSchool. Supervised vs Unsupervised Learning
DutchMLSchool. Supervised vs Unsupervised LearningDutchMLSchool. Supervised vs Unsupervised Learning
DutchMLSchool. Supervised vs Unsupervised Learning
 
BSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, Evaluations
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
07 learning
07 learning07 learning
07 learning
 
DutchMLSchool. Models, Evaluations, and Ensembles
DutchMLSchool. Models, Evaluations, and EnsemblesDutchMLSchool. Models, Evaluations, and Ensembles
DutchMLSchool. Models, Evaluations, and Ensembles
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 

Mais de BigML, Inc

Mais de BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
 
Intelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility IndustryIntelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility Industry
 

Último

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Último (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

VSSML18. Clustering and Latent Dirichlet Allocation

  • 1. Valencian Summer School in Machine Learning 4rd edition September 13-14, 2018
  • 2. BigML, Inc 2 Clustering and Topic Models “One of these things is not like the other things . . . “ Charles Parker VP Algorithms, BigML, Inc
  • 4. BigML, Inc 4 What is Clustering? • An unsupervised learning technique • No labels necessary • Useful for finding similar instances • Smart sampling/labelling • Finds “self-similar" groups of instances • Customer: groups with similar behavior • Medical: patients with similar diagnostic measurements • Defines each group by a “centroid” • Geometric center of the group • Represents the “average” member • Number of centroids (k) can be specified or determined
  • 5. BigML, Inc 5 Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 6. BigML, Inc 6 Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 auth = pin amount ~ $100 Same: date: Mon != Wed customer: Sally != Bob account: 6788 != 3421 class: clothes != gas zip: 26339 != 46140 Different: date = Wed (2 out of 3) customer = Bob account = 3421 auth = pin class = gas zip = 46140 amount = $104 Centroid: similar
  • 7. BigML, Inc 7 Use Cases • Customer segmentation • Which customers are similar? • Active learning • Labelling unlabelled data efficiently • Item discovery • What other items are similar to this one?
  • 8. BigML, Inc 8 Customer Segmentation GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in- game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  • 9. BigML, Inc 9 Active Learning GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  • 10. BigML, Inc 10 Active Learning *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not- fraud. Or a million images which need to be labeled as cat/not-cat. 2323
  • 11. BigML, Inc 11 Item Discovery GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  • 13. BigML, Inc 13 Human Expert Cluster into 3 groups…
  • 15. BigML, Inc 15 Human Expert • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • Met criteria of K=3 • Groups were sufficiently “distant” • No crossover
  • 16. BigML, Inc 16 Human Expert • Aspect Ratio (Length / Width) • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  • 17. BigML, Inc 17 Clustering Features Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2,75 6 box 1 6 block 1,6 6 screw 8 3 battery 5 3 key 4,25 3 bead 1 2
  • 18. BigML, Inc 18 Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space K=3
  • 19. BigML, Inc 19 Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  • 20. BigML, Inc 20 K-Means Algorithm K=3
  • 21. BigML, Inc 21 K-Means Algorithm K=3 Repeat until centroids stop moving
  • 22. BigML, Inc 22 There is no Right Answer! Metal Other Wood
  • 23. BigML, Inc 23 Starting Points • Random points or instances in n-dimensional space • Might start "too close" • Risk of sub-optimal convergence • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first point is chosen randomly from instances • each subsequent point is chosen from the remaining instances with a probability proportional to the squared distance from the point's closest existing cluster center
  • 24. BigML, Inc 24 K++ Initial Centers Low
 Probability High
 ProbabilityHighest
 Probability K=3
  • 25. BigML, Inc 25 K++ Initial Centers Low
 Probability Low
 Probability K=3
  • 26. BigML, Inc 26 K++ Initial Centers K=3
  • 27. BigML, Inc 27 Scaling Matters price number of bedrooms d = 160,000 d = 1
  • 28. BigML, Inc 28 Other Tricks • What is the distance to a “missing value”? • What is the distance between categorical values? • How far is “red” from “green”? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown ideal number of clusters, “K”?
  • 29. BigML, Inc 29 Distance to Missing? • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  • 30. BigML, Inc 30 Distance to Categorical? • Define special distance function: For two instances 𝑥 and 𝑦 and the categorical field 𝑎: • if 𝑥 𝑎 = 𝑦 𝑎 then
 (𝑥,𝑦)distance=0 (or field scaling value) 
 else 
 (𝑥,𝑦)distance=1 Approach: similar to “k-prototypes”
  • 31. BigML, Inc 31 Distance to Categorical? animal favorite toy toy color cat ball red cat ball green d=0 d=0 d=1 cat laser red dog squeaky red d=1 d=1 d=0 D = 1 Then compute Euclidean distance between vectors D = √2 Note: the centroid is assigned the most common category of the member instances
  • 32. BigML, Inc 32 Text Vectors 1 Cosine Similarity 0 -1 "hippo" "safari" "zebra" …. 1 0 1 … 1 1 0 … 0 1 1 … Text Field #1 Text Field #2 Features(thousands) • Cosine Similarity • cos() between two vectors • 1 if collinear, 0 if orthogonal • only positive vectors: 0 ≤ CS ≤ 1 • Cosine Distance=1-Cosine Similarity • CD(TF1, TF2) = 0.5
  • 33. BigML, Inc 33 Finding K: G-Means
  • 34. BigML, Inc 34 Finding K: G-Means
  • 35. BigML, Inc 35 Finding K: G-Means Let K=2 Keep 1, Split 1 New K=3
  • 36. BigML, Inc 36 Finding K: G-Means Let K=3 Keep 1, Split 2 New K=5
  • 37. BigML, Inc 37 Finding K: G-Means Let K=5 K=5
  • 38. BigML, Inc 38 Summary • Cluster Purpose • Unsupervised technique for finding self-similar groups of instances • Number of centroids (k) can be inputed or computed • Outputs list of centroids • Configuration: • Algorithm: K-means / G-means • Cluster Parameter: k or critical value • Default missing / Summary fields / Scales / Weights • Model Clusters • Centroid / Batchcentroids
  • 40. BigML, Inc 40 What is Topic Modeling? • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text Text Fields • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use itl? Questions:
  • 41. BigML, Inc 41 What is Topic Modeling? • Finds topics in your text fields • A topic is a distribution over terms • Terms with high probability in the same topic often occur together in the same document • Topics often correspond to real-world things that the document may be “about” (e.g., sports, cooking, technology) • Each document is “about” one or more topics • Usually each document is only about one or two topics • But in practice we assign a probability to every topic for every document
  • 42. BigML, Inc 42 Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times 1. Stem Words -> Tokens 2. Remove tokens that occur too often 3. Remove tokens that do not occur often enough 4. Count occurrences of remaining “interesting” tokens
  • 43. BigML, Inc 43 Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  • 44. BigML, Inc 44 Text Analysis
  • 46. BigML, Inc 46 Text Analysis vs. Topic Modeling Text Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance Co-occurrence limited to consecutive n-grams Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Topics indicate broader co-occurrences
  • 47. BigML, Inc 47 Generative Modeling • Decision trees are discriminative models • Aggressively model the classification boundary • Parsimonious: Don’t consider anything you don’t have to • Topic Models are generative models • Come up with a theory of how the data is generated • Tweak the theory to fit your data Topic Modeling builds a model of how the text is generated
  • 48. BigML, Inc 48 Generating Documents cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  • 49. BigML, Inc 49 Topic Model • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  • 50. BigML, Inc 50 Topic Model plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics • Each topic is represented by a distribution of term probabilities
  • 51. BigML, Inc 51 Training Topic Models
  • 52. BigML, Inc 52 Topic Distribution • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  • 53. BigML, Inc 53 Topic Distributions
  • 54. BigML, Inc 54 Prediction? Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  • 55. BigML, Inc 55 Topic Model Use Cases • As a preprocessor for other techniques • Building better models • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  • 56. BigML, Inc 56 Topic Model Tips • Setting k • Much like k-means, the best value is data specific • Too few will agglomerate unrelated topics, too many will partition highly related topics • I tend to find the latter more annoying than the former • Tuning the Model • Remove common, useless terms (use term filters) • Set term limit higher, use n-grams • Mess with stop word removal, turn off stemming