SlideShare uma empresa Scribd logo
1 de 51
Latent Dirichlet Allocation and Topic Modeling
Topic Modeling
• Every document we read can be thought of as consisting of many
topics all stacked upon one another.
• Today, we’re going to focus on how we can unpack these topics
using NLP techniques.
Amazon Buys Whole Foods
Food types
and price
changes
Topic Modeling
● Goal: break text documents down into “topics” by word:
Business
Prices
Food
Topics:
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
In the first document, we see that all of our “words-of-interest” show up. So
maybe they’re all just one big topic? We need to investigate more.
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
If we look at Doc 2, we see that ‘takeover’ and ‘close’ appear together again,
but ‘prices’ and ‘discounts’ do not appear. Maybe we have two topics in Doc1,
and only one of those topics also shows up in Doc 2.
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
The inverse happens in Doc 3. So it seems likely that we have 2 topics in
document 1, since we see that these don’t always have to appear together.
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
Doc 4 adds a bit of confusion, but we see that ‘takeover’ and ‘close’ appear
together again. Even if there’s an extra appearance by ‘prices’ which doesn’t
quite fit our original plan. So maybe ’prices’ is a cross-topic word.
Takeover Close Prices Discounts
Doc 1 1 1 1 1
Doc 2 1 1 0 0
Doc 3 0 0 1 1
Doc 4 1 1 1 0
Doc 5 1 0 0 0
Topic Modeling
Mathematically, we want to find “topics” that are collections of words that
appear in similar documents. More generally, a collection of features in a
dataset is called a “latent feature”. Let’s look at an example
Latent Features: Example
If we had data on people’s heights and weights, we would likely observe a high
correlation. That is because there is an “latent feature” of “Size” affecting both.
Height
Weight
Latent feature:
“Size”
Latent Spaces for NLP
● In NLP, we use topic modeling techniques to find latent features that are
collections of individual words.
● Under the hood, what we’re doing when creating topics is shifting vectorizer
axes in a clever way.
● This creates a new space, that’s some combination of our previous columns,
that’s “hidden” or latent to us.
● We can use this new space to look at each document and understand more
about it.
● Let’s look at a simple example.
Latent Spaces
“I love my pet rabbit.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He has
five rabbits.”
“I had this weird dish with fried rabbit.” “That’s
my pet rabbit’s favorite dish.”
…
“I love my pet rabbit.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He
has five rabbits.”
“I had this weird dish with fried rabbit.” “That’s
my pet rabbit’s favorite dish.”
…
Remove stop words, only keep nouns, end up with 3 features:
“rabbit”, “pet”, “dish”
“I love my pet rabbit.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He
has five rabbits.”
“I had this weird dish with fried rabbit.” “That’s
my pet rabbit’s favorite dish.”
…
Remove stop words, only keep nouns, end up with 3 features:
“rabbit”, “pet”, “dish”
“Rabbit”
“Dish”
“Pet”
“Rabbit”
“Dish”
“Pet”
“Rabbit”
“Dish”
“Pet”
“Rabbit”
“Dish”
“Pet”
Clustering is easier in this latent space
What are our clusters?
“I love my pet rabbit.” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He
has five rabbits.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I had this weird dish with fried rabbit.”
“I gave leftovers of that dish to my pet, Mr. Rabbit”
“That’s my pet rabbit’s favorite dish.”
“I love my pet rabbit.” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He
has five rabbits.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I had this weird dish with fried rabbit.”
“I gave leftovers of that dish to my pet, Mr. Rabbit”
“That’s my pet rabbit’s favorite dish.”
Axis 1: 1.5(Rabbit) +1.1 (Pet) + 0.1(Dish)
Axis 2: 0.9(Rabbit) + 0.02(Pet) + 1.6(Dish)
Axis 1: High
Axis 2: Low
Axis 1: Low
Axis 2: High
Axis 1: High
Axis 2: High
“I love my pet rabbit.” “Rabbits
make messy pets.”
“My rabbit growls when I pet her.” “He
has five rabbits.”
“That dish yesterday was amazing.”
“She cooked the best rabbit dish ever.”
“I had this weird dish with fried rabbit.”
“I gave leftovers of that dish to my pet, Mr. Rabbit”
“That’s my pet rabbit’s favorite dish.”
TOPIC 1: 1.5(Rabbit) +1.1 (Pet) + 0.1(Dish)  Pet Rabbits
TOPIC 2: 0.9(Rabbit) + 0.02(Pet) + 1.6(Dish)  Eating
Rabbit
TOPIC 1: High
TOPIC 2: Low
TOPIC 1: Low
TOPIC 2: High
TOPIC 1: High
TOPIC 2: High
Pet Rabbits Eating Rabbit
“I love my pet rabbit.” 87% 13%
“That’s my rabbit’s favorite dish.” 42% 58%
“She cooks the best rabbit dish.” 14% 84%
Topic Modeling
● Documents are not required to be one topic or another. They will be an overlap
of many topics.
● For each document, we will have a ‘percentage breakdown’ of how much of
each topic is involved.
● The topics will not be defined by the machine, we must interpret the word
groupings.
What is a topic?
● When writing about a topic, certain words are more likely to come up.
For example, if I used the words - hoop, backboard, ball, sneakers,
and coach – you’d be able to assume I’m talking about basketball.
● A topic is just a conglomeration of words/ideas that tend to appear
together.
● Mathematically, a topic is a probability distribution over all possible
words.
What is a topic?
Word Probability in “Pet Rabbit” Probability in “Eating Rabbit”
pet 2.3E-7 1.2E-10
rabbit 7.9E-7 3.4E-8
dish 6.8E-11 4.5E-7
car 3.1E-12 1.8E-12
plate 8.3E-14 1.4E-9
affair 3.0E-13 3.1E-13
love 5.4E-8 3.9E-10
the 3.1E-5 3.2E-5
What is a topic?
Word Probability in “Pet Rabbit” Probability in “Eating Rabbit”
pet 2.3E-7 1.2E-10
rabbit 7.9E-7 3.4E-8
dish 6.8E-11 4.5E-7
car 3.1E-12 1.8E-12
plate 8.3E-14 1.4E-8
affair 3.0E-13 3.1E-13
love 5.4E-8 3.9E-10
the 3.1E-5 3.2E-5
Topic Modeling Summary
● Choose some algorithm to figure out how words are related
● Create a latent space that makes use of those in-document correlations to
transition from a “word space” to a latent “topic space”
● Each axis in the latent space represents a topic
● Construct a probabilistic understanding of how much each topic contributes to
each document.
● We, as the humans, must understand the meaning of the topics. The machine
doesn’t understand the meanings, just the correlations.
Let’s look at some example topics from the
Fetch20 Dataset
Topic 0:
jesus matthew said people den col prophecy int away war men messiah den
den radius prophet row isaiah psalm row col sea
Topic 1:
year game good team think don just games like players better runs hit
won league time baseball season win pitching
Topic 2:
graphics image edu data mail software ftp pub available send images
package computer information use files thanks program processing code
It’s still on us to interpret these topics.
Let’s look at some example topics from the
Fetch20 Dataset
Atheism
jesus matthew said people den col prophecy int away war men messiah den
den radius prophet row isaiah psalm row col sea
Baseball
year game good team think don just games like players better runs hit
won league time baseball season win pitching
Graphics
graphics image edu data mail software ftp pub available send images
package computer information use files thanks program processing code
LDA is one of the algorithms we can choose to convert
between word and topic spaces. It works by simulating the
process of writing.
• Choose a topic to write about
• Choose words that are about that topic
• Add those words to the document
Latent Dirichlet Allocation (LDA)
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with: Technology
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with: Technology
Now we roll a new dice within the
technology topic to see which word
and we get: Computer
Now we repeat the process for the
next word.
Computer
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with: Technology
Now we roll a new dice within the
technology topic to see which word
and we get: screen
Computer screen
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with: Technology
Now we roll a new dice within the
technology topic to see which word
and we get: the
Computer screen the
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with:
Music
Now we roll a new dice within the
technology topic to see which word
and we get: guitar
Computer screen the guitar
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Now we roll a dice to decide which
topic we start with:
Movies
Now we roll a new dice within the
technology topic to see which word
and we get: theatre
Computer screen the guitar theatre
Start by deciding which topics will make up the document and in what percentage:
Technology: 50% Music: 25% Movies: 25%
Computer screen the guitar theatre
headphones. Actress sings best mouse
performance. We can continue this process
over and over until we’ve
generated essentially an entire
document. This document will
be made up of the topics we
selected, and the words that are
associated with that topic, all
based on the probabilities.
Latent Dirichlet Allocation in Practice
We typically do this in reverse on the modeling side, relying on Bayesian
methodology.
● Assume some topic distribution in the corpus
● Assume some word distribution for each topic
● Look at the corpus and try to find what topic and word distributions
would be most likely to generate that data.
● For this to converge, we need to tell it how many topics to look for.
Why Dirichlet?
● Any probability distribution could be used for assigning the words to
topics, it’s just a Bayesian process where we must select a prior. So
why Dirichlet?
● The sparse Dirichlet distribution assumes that each topic will only be
made from a small subset of the total available space.
● In our case, the vast majority of words are not related to technology (or
any other topic), so assuming that only a small subset of the total word
probabilities are large makes sense.
● Dirichlet does a better job than other choices for the Bayesian prior.
LDA in Python
● There are several libraries for LDA: two popular ones are Scikit-learn and
Gensim
● Scikit-learn is nice because it quickly integrates with the other, more familiar
Scikit-learn/NLTK tools and has a consistent API.
● GenSim is another Python text processing package we’ll use for Word2Vec
and a few other tools. It also has LDA as a built in package.
● We can either prepare our corpus with GenSim tools or use the
corpus = matutils.Sparse2Corpus(counts) tool to convert from
CountVectorizer to a format GenSim can use.
● For the exercises and following slides, you’ll see the Scikit-Learn API; you can
see a conceptual Gensim example in the appendix.
Code: LDA (Scikit-learn)
from sklearn.datasets import fetch_20newsgroups
ng_train = fetch_20newsgroups(subset=‘train’,
remove=(‘headers’, ‘footers’, ‘quotes’))
print("Data has {0:d} documents".format(len(ng_train.data)))
print(ng_train.data[0][:100]) # print requires Python 3
Data has 11314 documents
'I was wondering if anyone out there could enlighten me on this car I sawnthe
other day. It was a 2-d’
Output:
Input:
Code: LDA (Scikit-learn)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(1, 2),
stop_words='english',
token_pattern=“b[a-z][a-z]+b",
lowercase=True,
max_features=1000)
X = count_vectorizer.fit_transform(ng_train.data)
# “X” is now our transformed data
Input:
Code: LDA (Scikit-learn)
import pandas as pd
print(count_vectorizer.get_feature_names()[92:97]) # print 5 random columns
df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
# create data frame
print(df.iloc[10:15, 92:97]) # values of these features on documents 10-15
“['bhj', 'bible', 'big', 'bike', ‘bios']"
Output:
Input:
df:
Code: LDA (Scikit-learn)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=4, random_state=42,
learning_method='online')
data = lda.fit_transform(X)
print(data[0])
[ 0.00246896, 0.00251041, 0.99253159, 0.00248904]
Output:
Input:
Code: LDA (Scikit-learn)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=4, random_state=42,
learning_method='online')
data = lda.fit_transform(X)
print(data[0])
[ 0.00246896, 0.00251041, 0.99253159, 0.00248904]
Output:
Input:
This document is 99% topic 3!
Code: LDA (Scikit-learn)
year game good team think don just games like players
Output:
Input:
The top 10 words for topic 2!
for word in lda.components_[2].argsort()[:10:-1]:
print(word)
Appendix
Code: LDA (gensim)
[(2: 0.95, 4: 0.21)]
[(0: 0.75, 1: 0.15, 5: 0.11)]
Output:
Input:
from gensim import corpora, models, similarities, matutils
lda = LdaModel(corpus, num_topics=5) # train model
print(lda[doc_bow]) # get topic probability distribution for a document
lda.update(corpus2) # update the LDA model with additional documents
print(lda[doc_bow])
Code: LDA (gensim)
[(0, u'0.002*"image" + 0.002*"don" + 0.002*"jpeg" + 0.002*"good" +
0.001*"think" + 0.001*"people" + 0.001*"file" + 0.001*"year" + 0.001*"just" +
0.001*"like"'),
(1, u'0.002*"graphics" + 0.002*"edu" + 0.002*"god" + 0.001*"just" +
0.001*"like" + 0.001*"does" + 0.001*"people" + 0.001*"know" + 0.001*"data" +
0.001*"jesus"'),
(2, u'0.002*"think" + 0.002*"don" + 0.002*"just" + 0.002*"like" + 0.002*"does"
+ 0.002*"god" + 0.001*"know" + 0.001*"people" + 0.001*"time" + 0.001*"good"')
… ]
Output:
Input:
lda.print_topics()

Mais conteúdo relacionado

Semelhante a Latent dirichlet allocation_and_topic_modeling

Advanced searchingomnifile
Advanced searchingomnifileAdvanced searchingomnifile
Advanced searchingomnifile
SCULibrarian
 
Advanced searchingomnifile
Advanced searchingomnifileAdvanced searchingomnifile
Advanced searchingomnifile
SCULibrarian
 
R2.3 A Discern Main Ideas Provide Evidence
R2.3 A Discern Main Ideas Provide EvidenceR2.3 A Discern Main Ideas Provide Evidence
R2.3 A Discern Main Ideas Provide Evidence
Mr. M
 
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.comCollege Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
Elizabeth Pardue
 

Semelhante a Latent dirichlet allocation_and_topic_modeling (20)

Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your Vacation
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your Vacation
 
Comparison contrast essay
Comparison contrast essayComparison contrast essay
Comparison contrast essay
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search application
 
Advanced searchingomnifile
Advanced searchingomnifileAdvanced searchingomnifile
Advanced searchingomnifile
 
Advanced searchingomnifile
Advanced searchingomnifileAdvanced searchingomnifile
Advanced searchingomnifile
 
Indexing vector spaces graph search engines
Indexing vector spaces graph search enginesIndexing vector spaces graph search engines
Indexing vector spaces graph search engines
 
R2.3 A Discern Main Ideas Provide Evidence
R2.3 A Discern Main Ideas Provide EvidenceR2.3 A Discern Main Ideas Provide Evidence
R2.3 A Discern Main Ideas Provide Evidence
 
Tf dsyv
Tf dsyvTf dsyv
Tf dsyv
 
Dialog system understanding
Dialog system understandingDialog system understanding
Dialog system understanding
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.comCollege Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
College Essay Superpower. If i were superhero essay - mfawriting608.web.fc2.com
 
Understanding feature-space
Understanding feature-spaceUnderstanding feature-space
Understanding feature-space
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learning
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Chapter 9 Lecture
Chapter 9 LectureChapter 9 Lecture
Chapter 9 Lecture
 
You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
 

Mais de ankit_ppt

Mais de ankit_ppt (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
07 learning
07 learning07 learning
07 learning
 
06 image features
06 image features06 image features
06 image features
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
02 image processing
02 image processing02 image processing
02 image processing
 
01 foundations
01 foundations01 foundations
01 foundations
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 
Ml6 decision trees
Ml6 decision treesMl6 decision trees
Ml6 decision trees
 
Ml5 svm and-kernels
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernels
 

Último

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 

Latent dirichlet allocation_and_topic_modeling

  • 1. Latent Dirichlet Allocation and Topic Modeling
  • 2. Topic Modeling • Every document we read can be thought of as consisting of many topics all stacked upon one another. • Today, we’re going to focus on how we can unpack these topics using NLP techniques.
  • 3. Amazon Buys Whole Foods Food types and price changes
  • 4. Topic Modeling ● Goal: break text documents down into “topics” by word: Business Prices Food Topics:
  • 5. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling
  • 6. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling In the first document, we see that all of our “words-of-interest” show up. So maybe they’re all just one big topic? We need to investigate more.
  • 7. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling If we look at Doc 2, we see that ‘takeover’ and ‘close’ appear together again, but ‘prices’ and ‘discounts’ do not appear. Maybe we have two topics in Doc1, and only one of those topics also shows up in Doc 2.
  • 8. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling The inverse happens in Doc 3. So it seems likely that we have 2 topics in document 1, since we see that these don’t always have to appear together.
  • 9. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling Doc 4 adds a bit of confusion, but we see that ‘takeover’ and ‘close’ appear together again. Even if there’s an extra appearance by ‘prices’ which doesn’t quite fit our original plan. So maybe ’prices’ is a cross-topic word.
  • 10. Takeover Close Prices Discounts Doc 1 1 1 1 1 Doc 2 1 1 0 0 Doc 3 0 0 1 1 Doc 4 1 1 1 0 Doc 5 1 0 0 0 Topic Modeling Mathematically, we want to find “topics” that are collections of words that appear in similar documents. More generally, a collection of features in a dataset is called a “latent feature”. Let’s look at an example
  • 11. Latent Features: Example If we had data on people’s heights and weights, we would likely observe a high correlation. That is because there is an “latent feature” of “Size” affecting both. Height Weight Latent feature: “Size”
  • 12. Latent Spaces for NLP ● In NLP, we use topic modeling techniques to find latent features that are collections of individual words. ● Under the hood, what we’re doing when creating topics is shifting vectorizer axes in a clever way. ● This creates a new space, that’s some combination of our previous columns, that’s “hidden” or latent to us. ● We can use this new space to look at each document and understand more about it. ● Let’s look at a simple example.
  • 13. Latent Spaces “I love my pet rabbit.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “I had this weird dish with fried rabbit.” “That’s my pet rabbit’s favorite dish.” …
  • 14. “I love my pet rabbit.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “I had this weird dish with fried rabbit.” “That’s my pet rabbit’s favorite dish.” … Remove stop words, only keep nouns, end up with 3 features: “rabbit”, “pet”, “dish”
  • 15. “I love my pet rabbit.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I gave leftovers of that dish to my pet, mr. rabbit” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “I had this weird dish with fried rabbit.” “That’s my pet rabbit’s favorite dish.” … Remove stop words, only keep nouns, end up with 3 features: “rabbit”, “pet”, “dish”
  • 20. What are our clusters? “I love my pet rabbit.” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I had this weird dish with fried rabbit.” “I gave leftovers of that dish to my pet, Mr. Rabbit” “That’s my pet rabbit’s favorite dish.”
  • 21. “I love my pet rabbit.” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I had this weird dish with fried rabbit.” “I gave leftovers of that dish to my pet, Mr. Rabbit” “That’s my pet rabbit’s favorite dish.” Axis 1: 1.5(Rabbit) +1.1 (Pet) + 0.1(Dish) Axis 2: 0.9(Rabbit) + 0.02(Pet) + 1.6(Dish) Axis 1: High Axis 2: Low Axis 1: Low Axis 2: High Axis 1: High Axis 2: High
  • 22. “I love my pet rabbit.” “Rabbits make messy pets.” “My rabbit growls when I pet her.” “He has five rabbits.” “That dish yesterday was amazing.” “She cooked the best rabbit dish ever.” “I had this weird dish with fried rabbit.” “I gave leftovers of that dish to my pet, Mr. Rabbit” “That’s my pet rabbit’s favorite dish.” TOPIC 1: 1.5(Rabbit) +1.1 (Pet) + 0.1(Dish)  Pet Rabbits TOPIC 2: 0.9(Rabbit) + 0.02(Pet) + 1.6(Dish)  Eating Rabbit TOPIC 1: High TOPIC 2: Low TOPIC 1: Low TOPIC 2: High TOPIC 1: High TOPIC 2: High
  • 23. Pet Rabbits Eating Rabbit “I love my pet rabbit.” 87% 13% “That’s my rabbit’s favorite dish.” 42% 58% “She cooks the best rabbit dish.” 14% 84% Topic Modeling ● Documents are not required to be one topic or another. They will be an overlap of many topics. ● For each document, we will have a ‘percentage breakdown’ of how much of each topic is involved. ● The topics will not be defined by the machine, we must interpret the word groupings.
  • 24. What is a topic? ● When writing about a topic, certain words are more likely to come up. For example, if I used the words - hoop, backboard, ball, sneakers, and coach – you’d be able to assume I’m talking about basketball. ● A topic is just a conglomeration of words/ideas that tend to appear together. ● Mathematically, a topic is a probability distribution over all possible words.
  • 25. What is a topic? Word Probability in “Pet Rabbit” Probability in “Eating Rabbit” pet 2.3E-7 1.2E-10 rabbit 7.9E-7 3.4E-8 dish 6.8E-11 4.5E-7 car 3.1E-12 1.8E-12 plate 8.3E-14 1.4E-9 affair 3.0E-13 3.1E-13 love 5.4E-8 3.9E-10 the 3.1E-5 3.2E-5
  • 26. What is a topic? Word Probability in “Pet Rabbit” Probability in “Eating Rabbit” pet 2.3E-7 1.2E-10 rabbit 7.9E-7 3.4E-8 dish 6.8E-11 4.5E-7 car 3.1E-12 1.8E-12 plate 8.3E-14 1.4E-8 affair 3.0E-13 3.1E-13 love 5.4E-8 3.9E-10 the 3.1E-5 3.2E-5
  • 27. Topic Modeling Summary ● Choose some algorithm to figure out how words are related ● Create a latent space that makes use of those in-document correlations to transition from a “word space” to a latent “topic space” ● Each axis in the latent space represents a topic ● Construct a probabilistic understanding of how much each topic contributes to each document. ● We, as the humans, must understand the meaning of the topics. The machine doesn’t understand the meanings, just the correlations.
  • 28. Let’s look at some example topics from the Fetch20 Dataset Topic 0: jesus matthew said people den col prophecy int away war men messiah den den radius prophet row isaiah psalm row col sea Topic 1: year game good team think don just games like players better runs hit won league time baseball season win pitching Topic 2: graphics image edu data mail software ftp pub available send images package computer information use files thanks program processing code It’s still on us to interpret these topics.
  • 29. Let’s look at some example topics from the Fetch20 Dataset Atheism jesus matthew said people den col prophecy int away war men messiah den den radius prophet row isaiah psalm row col sea Baseball year game good team think don just games like players better runs hit won league time baseball season win pitching Graphics graphics image edu data mail software ftp pub available send images package computer information use files thanks program processing code
  • 30. LDA is one of the algorithms we can choose to convert between word and topic spaces. It works by simulating the process of writing. • Choose a topic to write about • Choose words that are about that topic • Add those words to the document Latent Dirichlet Allocation (LDA)
  • 31. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25%
  • 32. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Technology
  • 33. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Technology Now we roll a new dice within the technology topic to see which word and we get: Computer Now we repeat the process for the next word. Computer
  • 34. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Technology Now we roll a new dice within the technology topic to see which word and we get: screen Computer screen
  • 35. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Technology Now we roll a new dice within the technology topic to see which word and we get: the Computer screen the
  • 36. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Music Now we roll a new dice within the technology topic to see which word and we get: guitar Computer screen the guitar
  • 37. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Now we roll a dice to decide which topic we start with: Movies Now we roll a new dice within the technology topic to see which word and we get: theatre Computer screen the guitar theatre
  • 38. Start by deciding which topics will make up the document and in what percentage: Technology: 50% Music: 25% Movies: 25% Computer screen the guitar theatre headphones. Actress sings best mouse performance. We can continue this process over and over until we’ve generated essentially an entire document. This document will be made up of the topics we selected, and the words that are associated with that topic, all based on the probabilities.
  • 39. Latent Dirichlet Allocation in Practice We typically do this in reverse on the modeling side, relying on Bayesian methodology. ● Assume some topic distribution in the corpus ● Assume some word distribution for each topic ● Look at the corpus and try to find what topic and word distributions would be most likely to generate that data. ● For this to converge, we need to tell it how many topics to look for.
  • 40. Why Dirichlet? ● Any probability distribution could be used for assigning the words to topics, it’s just a Bayesian process where we must select a prior. So why Dirichlet? ● The sparse Dirichlet distribution assumes that each topic will only be made from a small subset of the total available space. ● In our case, the vast majority of words are not related to technology (or any other topic), so assuming that only a small subset of the total word probabilities are large makes sense. ● Dirichlet does a better job than other choices for the Bayesian prior.
  • 41. LDA in Python ● There are several libraries for LDA: two popular ones are Scikit-learn and Gensim ● Scikit-learn is nice because it quickly integrates with the other, more familiar Scikit-learn/NLTK tools and has a consistent API. ● GenSim is another Python text processing package we’ll use for Word2Vec and a few other tools. It also has LDA as a built in package. ● We can either prepare our corpus with GenSim tools or use the corpus = matutils.Sparse2Corpus(counts) tool to convert from CountVectorizer to a format GenSim can use. ● For the exercises and following slides, you’ll see the Scikit-Learn API; you can see a conceptual Gensim example in the appendix.
  • 42. Code: LDA (Scikit-learn) from sklearn.datasets import fetch_20newsgroups ng_train = fetch_20newsgroups(subset=‘train’, remove=(‘headers’, ‘footers’, ‘quotes’)) print("Data has {0:d} documents".format(len(ng_train.data))) print(ng_train.data[0][:100]) # print requires Python 3 Data has 11314 documents 'I was wondering if anyone out there could enlighten me on this car I sawnthe other day. It was a 2-d’ Output: Input:
  • 43. Code: LDA (Scikit-learn) from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english', token_pattern=“b[a-z][a-z]+b", lowercase=True, max_features=1000) X = count_vectorizer.fit_transform(ng_train.data) # “X” is now our transformed data Input:
  • 44. Code: LDA (Scikit-learn) import pandas as pd print(count_vectorizer.get_feature_names()[92:97]) # print 5 random columns df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names()) # create data frame print(df.iloc[10:15, 92:97]) # values of these features on documents 10-15 “['bhj', 'bible', 'big', 'bike', ‘bios']" Output: Input: df:
  • 45. Code: LDA (Scikit-learn) from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_topics=4, random_state=42, learning_method='online') data = lda.fit_transform(X) print(data[0]) [ 0.00246896, 0.00251041, 0.99253159, 0.00248904] Output: Input:
  • 46. Code: LDA (Scikit-learn) from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_topics=4, random_state=42, learning_method='online') data = lda.fit_transform(X) print(data[0]) [ 0.00246896, 0.00251041, 0.99253159, 0.00248904] Output: Input: This document is 99% topic 3!
  • 47. Code: LDA (Scikit-learn) year game good team think don just games like players Output: Input: The top 10 words for topic 2! for word in lda.components_[2].argsort()[:10:-1]: print(word)
  • 48.
  • 50. Code: LDA (gensim) [(2: 0.95, 4: 0.21)] [(0: 0.75, 1: 0.15, 5: 0.11)] Output: Input: from gensim import corpora, models, similarities, matutils lda = LdaModel(corpus, num_topics=5) # train model print(lda[doc_bow]) # get topic probability distribution for a document lda.update(corpus2) # update the LDA model with additional documents print(lda[doc_bow])
  • 51. Code: LDA (gensim) [(0, u'0.002*"image" + 0.002*"don" + 0.002*"jpeg" + 0.002*"good" + 0.001*"think" + 0.001*"people" + 0.001*"file" + 0.001*"year" + 0.001*"just" + 0.001*"like"'), (1, u'0.002*"graphics" + 0.002*"edu" + 0.002*"god" + 0.001*"just" + 0.001*"like" + 0.001*"does" + 0.001*"people" + 0.001*"know" + 0.001*"data" + 0.001*"jesus"'), (2, u'0.002*"think" + 0.002*"don" + 0.002*"just" + 0.002*"like" + 0.002*"does" + 0.002*"god" + 0.001*"know" + 0.001*"people" + 0.001*"time" + 0.001*"good"') … ] Output: Input: lda.print_topics()

Notas do Editor

  1. Every text document - a news article, a comment on a forum - can be though of as made up of a set of topics. It might be 60% about politics, 30% about economics, etc. Today, we’re going to show you techniques that can automatically determine the topics that are present in a series of documents.
  2. Here’s an example document. We can see that it is about different topics: “Business” “Food” “Prices”
  3. The goal of topic modeling in particular is to break down documents at the word level.
  4. Topic modeling starts with a representation of the data. This is an example representation: a “bag-of-words” representation. The documents are along the rows and the words are along the columns, and each cell either has the value “1” if that word is present in that document, and “0” otherwise. The next few slides walk through how topic modeling would further break this down to determine the topics - combinations of the presence of different words - present in each document.
  5. Mathematically, topics are latent features, meaning linear combinations of the original features in the data. Here’s a classic illustration of a latent feature: “size” is a latent feature combining the underlying features of “height” and “weight”. We can describe size as a latent feature because height and weight are both highly correlated with each other and we can interpret them as both being “driven” by the latent feature of size.
  6. Let’s suppose each of these sentences is a document. We’re going to illustrate what a latent space of topics would look like for these documents.
  7. First, we’d clean out some words that are not relevant to any topic, such as stop words.
  8. Then, we could use a topic modeling algorithm to discover three underlying topics: “rabbit”, “pet”, and “dish”.
  9. Then, we could take each document and plot it in three dimensional space, where the dimensions are to what extent the words: “Rabbit” “Dish” “Pet” Are present in each document.
  10. Then, we can find the principal components of this new three dimensional data. These principal components represent combinations of the individual words, with each topic having different weight on each word - exactly what we want. Note: Principal component analysis is a technique for feature extraction  that combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables.
  11. Furthermore, this new space can be lower dimensional than the space we started with - for example, if we started with 1000 words, we could group them into 10 topics. If we then project our points down into this latent space, it will be easier to tell which documents are similar, for example.
  12. If we wanted to cluster these documents, for example, we are much more likely to get a good result in the latent space than in the original space due to the lower dimensionality.
  13. The next few slides show the clusters that would result. Observe how the documents in each cluster are similar, but the clusters are different from each other.
  14. In particular, the clusters are different on the different topic axes. These axes are what allowed us to separate the documents into clusters.
  15. The probabilities that the words when used are describing “pet rabbit” or “eating rabbit”
  16. The highest probabilities are “pet, rabbit & love” are used to describe “pet rabbit” while “rabbit, dish & plate” describe “eating rabbit”
  17. These three paragraphs have no topics and we could assign topics to them based on our judgement
  18. One way to do this is to assign “Atheism”, “baseball” & “graphics” based on our judgement that the collection of words that make up these paragraphs have a high probability of describing these topics.
  19. Note that “ng_train” here is the same as “ng_train” from the prior slide.
  20. The “1” on index “11” for the word “bible” indicates that “bible” appeared in the document with index 11.
  21. Note: this is just illustrative code. Gensim requires much more extensive data preprocessing than SciKit Learn, which is why we don’t cover it in as much detail in this course. To learn more about Gensim, see here: https://radimrehurek.com/gensim/tut1.html
  22. Output of LDA from Gensim once preprocessing has been done correctly.