SlideShare uma empresa Scribd logo
1 de 20
Topic Modeling
Karol Grzegorczyk
June 3, 2014
2/20
Generative probabilistic models of textual corpora
● In order to introduce automatic processing of natural language, a language model is
needed.
● One of the main goals of language modeling is to assign a probability to a document:
P(D) = P(w1
,w2
,w3
,...,wm
)
● It is assumed that documents in a corpus were randomly generated (it is, of course, only
a theoretical assumption; in reality, in most cases, they are created by humans)
● There are two types of generative language models:
● those that generate each word on the basis of some number of preceding words
● those that generate words based on latent topics
3/20
n-gram language models
● N-gram – a contiguous sequence of n items from a given document
– When n==m, where m is a total number number of words in a document:
P(D) = P(w1
)P(w2
|w1
)P(w3
|w1
,w2
)...P(wm
|w1
,...,wm-1
)
● Unigram – Words are independent. Each word in a document is assumed to be
randomly generated in the independent way
– P(D) = P(w1
)P(w2
)P(w3
,)...P(wm
)
● Bigram – words are generated with probability condition on the previous word
● In reality, language has long-distance dependencies
– Skip-gram is one of the solutions to this problem
4/20
Document Representations in Vector Space
● Each document is represented by an array of features
● Representation types:
– Bag-of-words (a.k.a. unigram count, term frequencies)
● A document is represented by a sparse vector of the size equals to dictionary size
– TF-IDF (term frequency–inverse document frequency)
● Similar to BoW but term frequencies are weighted, to penalize frequent words
– Topic Models (a.k.a. concept models, latent variable models)
● A document is represented by a low-rank dense vector
● Similarity between documents (or between document and a query) can be expressed in a
cosine distance
5/20
Topic Modeling
● Topic Modeling is a set of techniques that aim to discover and annotate large archives of
documents with thematic information.
● TM is a set of methods that analyze the words (or other fine-grained features) of the original
documents to discover the themes that run through them, how those themes are connected
to each other, and how they change over time.
● Often, the number of topics to be discovered is predefined.
● Topic modeling can be seen as a dimensionality reduction technique
● Topic modeling, like clustering, do not require any prior annotations or labeling, but in
contrast to clustering, can assign document to multiple topics.
● Semantic information can be derived from a word-document co-occurrence matrix
● Topic Model types:
– Linear algebra based (e.g. LSA)
– Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
6/20
Latent semantic analysis
C - normalized co-occurrence matrix
D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main
diagonal are called 'singular values'
● a.k.a. Latent Semantic Indexing
● A low-rank approximation of document-term matrix (typical rank 100-300)
● In contrast, The British National Corpus (BNC) has 100-million words
● LSA downsizes the co-occurrence tables via Singular Value Decomposition
7/20
Probabilistic Latent Semantic Analysis
pLSA models the probability of each co-occurrence as a mixture of conditionally
independent multinomial distributions:
P(d|c) relates to the V matrix from the previous slide
P(w|c) relates to the U matrix from the previous slide
In contrast to classical LSA, words representation in topics and topic representations in
documents will be non-negative and will sum up to one.
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
8/20
pLSA – Graphical Model
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
● A graphical model is a probabilistic
model for which a graph denotes the
conditional dependence structure
between random variables.
● Only a shaded variable is observed.
All the others have to be inferred.
● We can infer hidden variable using
maximum likelihood estimation.
● D – total number of documents
● N – total number of words in a
document (it fact, it should be Nd
)
● T – total number of topics
9/20
Latent Dirichlet allocation
● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a
Dirichlet prior
● Dirichlet distribution is a family of continuous multivariate probability distributions
● This model assumes that documents are generated randomly
● Topic is a distribution over a fixed vocabulary, each topic contains each word from the
dictionary, but some of them have very low probability
● Each word in a document is randomly selected from randomly selected topic from
distribution of topics.
● Each documents exhibit multiple topics in different proportions.
– In fact, all the documents in the collection share the same set of topics, but each
document exhibits those topics in different proportions
● In reality, the topic structure, per-document topic distributions and the per-document per-
word topic assignments are latent, and have to be inferred from observed documents.
10/20
The intuitions behind latent Dirichlet allocation
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
11/20
Real inference with LDA
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
A 100-topic LDA model was fitted to 17,000 articles from the Science journal.
At right are the top 15 most frequent words from the most frequent topics.
At left are the inferred topic proportions for the example article from previous slide.
12/20
The graphical model for latent Dirichlet allocation.
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
K – total number of topics
βk
– topic, a distribution over the vocabulary
D – total number of documents
Θd
– per-document topic proportions
N – total number of words in a document (it fact, it should be Nd
)
Zd,n
– per-word topic assignment
Wd,n
– observed word
α, η – Dirichlet parameters
● Several inference algorithms are
available (e.g. sampling based)
● A few extensions to LDA were created:
● Bigram Topic Model
13/20
Matrix Factorization Interpretation of LDA
14/20
Tooling
MAchine Learning for LanguagE Toolkit
(MALLET) is a Java-based package for:
● statistical natural language processing
● document classification
● Clustering
● topic modeling
● information extraction
● and other machine learning applications to text.
http://mallet.cs.umass.edu
gensim: topic modeling for humans
● Free python library
● Memory independent
● Distributed computing
http://radimrehurek.com/gensim
Stanford Topic Modeling Toolbox
http://nlp.stanford.edu/software/tmt
15/20
Topic modeling applications
● Topic-based text classification
● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor,
SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data.
● Topic modeling can be seen as a pre-processing step before applying supervised
learning methods.
● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in
F1 score [Cai, Hofmann, 2003]
● Collaborative filtering [Hofmann, 2004]
● Finding patterns in genetic data, images, and social networks
[L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003]
[T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
16/20
Word Representations in Vector Space
● Notion of similarity between words
● Continuous-valued vector representation of words
● Neural network language model
● Prediction semantic of the word based on the context
● Ability to perform simple algebraic operations on the word vectors:
vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen
● word2ved: https://code.google.com/p/word2vec
17/20
Topic Modeling in Computer Vision
● Bag-of-words model in computer vision (a.k.a. bag-of-features)
– Codewords (“visual words”) instead of just words
– Codebook instead of dictionary
● It is assumed, that documents exhibit multiple topics and the collection of documents
exhibits the same set of topics
● TM in CV has been used to:
– Classify images
– Build image hierarchies
– Connecting images and captions
● The main advantage of this approach it its unsupervised training nature.
18/20
Codebook example
[L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005]
Obtained from 650
training examples
from 13 categories
of natural scenes
(e.g. highway, inside
of cities, tall
buildings, forest, etc)
using k-means
clustering algorithm.
19/20
Bibliography
● D. Blei, Probabilistic topic models, Communications of the ACM, 2012
● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007
● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990
● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003
● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine
learning, 2006
● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector
space, ICLR, 2013
● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories,
Computer Vision and Pattern Recognition, 2005
● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information
Processing Systems, 2007
20/20
Thank you!

Mais conteúdo relacionado

Mais procurados

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesDATAVERSITY
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R PackagesRsquared Academy
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models Xiaotao Zou
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 

Mais procurados (20)

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Word embedding
Word embedding Word embedding
Word embedding
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case Studies
 
Text categorization
Text categorizationText categorization
Text categorization
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trend
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Machine learning
Machine learningMachine learning
Machine learning
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 

Destaque

LDA presentation
LDA presentationLDA presentation
LDA presentationMohit Gupta
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...Davide Chicco
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all thatZhibo Xiao
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisMercy Livingstone
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpointtgaskins4
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modelingSameer Wadkar
 
Scaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the WorkplaceScaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the Workplacetobold
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary TaleEva Ward
 

Destaque (20)

LDA presentation
LDA presentationLDA presentation
LDA presentation
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all that
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Terms
TermsTerms
Terms
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpoint
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Microservices
MicroservicesMicroservices
Microservices
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modeling
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
Scaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the WorkplaceScaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the Workplace
 
Ltc completed slides
Ltc completed slidesLtc completed slides
Ltc completed slides
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary Tale
 

Semelhante a Topic Modeling

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxssuserf3aa89
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSouth Tyrol Free Software Conference
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2Jisoo Jang
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesDatech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesIMPACT Centre of Competence
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 

Semelhante a Topic Modeling (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptx
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesDatech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 

Último

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 

Último (20)

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 

Topic Modeling

  • 2. 2/20 Generative probabilistic models of textual corpora ● In order to introduce automatic processing of natural language, a language model is needed. ● One of the main goals of language modeling is to assign a probability to a document: P(D) = P(w1 ,w2 ,w3 ,...,wm ) ● It is assumed that documents in a corpus were randomly generated (it is, of course, only a theoretical assumption; in reality, in most cases, they are created by humans) ● There are two types of generative language models: ● those that generate each word on the basis of some number of preceding words ● those that generate words based on latent topics
  • 3. 3/20 n-gram language models ● N-gram – a contiguous sequence of n items from a given document – When n==m, where m is a total number number of words in a document: P(D) = P(w1 )P(w2 |w1 )P(w3 |w1 ,w2 )...P(wm |w1 ,...,wm-1 ) ● Unigram – Words are independent. Each word in a document is assumed to be randomly generated in the independent way – P(D) = P(w1 )P(w2 )P(w3 ,)...P(wm ) ● Bigram – words are generated with probability condition on the previous word ● In reality, language has long-distance dependencies – Skip-gram is one of the solutions to this problem
  • 4. 4/20 Document Representations in Vector Space ● Each document is represented by an array of features ● Representation types: – Bag-of-words (a.k.a. unigram count, term frequencies) ● A document is represented by a sparse vector of the size equals to dictionary size – TF-IDF (term frequency–inverse document frequency) ● Similar to BoW but term frequencies are weighted, to penalize frequent words – Topic Models (a.k.a. concept models, latent variable models) ● A document is represented by a low-rank dense vector ● Similarity between documents (or between document and a query) can be expressed in a cosine distance
  • 5. 5/20 Topic Modeling ● Topic Modeling is a set of techniques that aim to discover and annotate large archives of documents with thematic information. ● TM is a set of methods that analyze the words (or other fine-grained features) of the original documents to discover the themes that run through them, how those themes are connected to each other, and how they change over time. ● Often, the number of topics to be discovered is predefined. ● Topic modeling can be seen as a dimensionality reduction technique ● Topic modeling, like clustering, do not require any prior annotations or labeling, but in contrast to clustering, can assign document to multiple topics. ● Semantic information can be derived from a word-document co-occurrence matrix ● Topic Model types: – Linear algebra based (e.g. LSA) – Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
  • 6. 6/20 Latent semantic analysis C - normalized co-occurrence matrix D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main diagonal are called 'singular values' ● a.k.a. Latent Semantic Indexing ● A low-rank approximation of document-term matrix (typical rank 100-300) ● In contrast, The British National Corpus (BNC) has 100-million words ● LSA downsizes the co-occurrence tables via Singular Value Decomposition
  • 7. 7/20 Probabilistic Latent Semantic Analysis pLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions: P(d|c) relates to the V matrix from the previous slide P(w|c) relates to the U matrix from the previous slide In contrast to classical LSA, words representation in topics and topic representations in documents will be non-negative and will sum up to one. [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
  • 8. 8/20 pLSA – Graphical Model [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999] ● A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. ● Only a shaded variable is observed. All the others have to be inferred. ● We can infer hidden variable using maximum likelihood estimation. ● D – total number of documents ● N – total number of words in a document (it fact, it should be Nd ) ● T – total number of topics
  • 9. 9/20 Latent Dirichlet allocation ● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a Dirichlet prior ● Dirichlet distribution is a family of continuous multivariate probability distributions ● This model assumes that documents are generated randomly ● Topic is a distribution over a fixed vocabulary, each topic contains each word from the dictionary, but some of them have very low probability ● Each word in a document is randomly selected from randomly selected topic from distribution of topics. ● Each documents exhibit multiple topics in different proportions. – In fact, all the documents in the collection share the same set of topics, but each document exhibits those topics in different proportions ● In reality, the topic structure, per-document topic distributions and the per-document per- word topic assignments are latent, and have to be inferred from observed documents.
  • 10. 10/20 The intuitions behind latent Dirichlet allocation [D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
  • 11. 11/20 Real inference with LDA [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] A 100-topic LDA model was fitted to 17,000 articles from the Science journal. At right are the top 15 most frequent words from the most frequent topics. At left are the inferred topic proportions for the example article from previous slide.
  • 12. 12/20 The graphical model for latent Dirichlet allocation. [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] K – total number of topics βk – topic, a distribution over the vocabulary D – total number of documents Θd – per-document topic proportions N – total number of words in a document (it fact, it should be Nd ) Zd,n – per-word topic assignment Wd,n – observed word α, η – Dirichlet parameters ● Several inference algorithms are available (e.g. sampling based) ● A few extensions to LDA were created: ● Bigram Topic Model
  • 14. 14/20 Tooling MAchine Learning for LanguagE Toolkit (MALLET) is a Java-based package for: ● statistical natural language processing ● document classification ● Clustering ● topic modeling ● information extraction ● and other machine learning applications to text. http://mallet.cs.umass.edu gensim: topic modeling for humans ● Free python library ● Memory independent ● Distributed computing http://radimrehurek.com/gensim Stanford Topic Modeling Toolbox http://nlp.stanford.edu/software/tmt
  • 15. 15/20 Topic modeling applications ● Topic-based text classification ● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor, SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data. ● Topic modeling can be seen as a pre-processing step before applying supervised learning methods. ● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in F1 score [Cai, Hofmann, 2003] ● Collaborative filtering [Hofmann, 2004] ● Finding patterns in genetic data, images, and social networks [L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003] [T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
  • 16. 16/20 Word Representations in Vector Space ● Notion of similarity between words ● Continuous-valued vector representation of words ● Neural network language model ● Prediction semantic of the word based on the context ● Ability to perform simple algebraic operations on the word vectors: vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen ● word2ved: https://code.google.com/p/word2vec
  • 17. 17/20 Topic Modeling in Computer Vision ● Bag-of-words model in computer vision (a.k.a. bag-of-features) – Codewords (“visual words”) instead of just words – Codebook instead of dictionary ● It is assumed, that documents exhibit multiple topics and the collection of documents exhibits the same set of topics ● TM in CV has been used to: – Classify images – Build image hierarchies – Connecting images and captions ● The main advantage of this approach it its unsupervised training nature.
  • 18. 18/20 Codebook example [L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005] Obtained from 650 training examples from 13 categories of natural scenes (e.g. highway, inside of cities, tall buildings, forest, etc) using k-means clustering algorithm.
  • 19. 19/20 Bibliography ● D. Blei, Probabilistic topic models, Communications of the ACM, 2012 ● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007 ● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990 ● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003 ● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine learning, 2006 ● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, ICLR, 2013 ● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, Computer Vision and Pattern Recognition, 2005 ● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information Processing Systems, 2007

Notas do Editor

  1. Explain difference between cosine and euclidean distances.
  2. Everything within a plate is replicated. Replication number is in the bottom-right corner of the plate. The hidden nodes are unshaded. The observed nodes are shades Θ – theta Φ – fi
  3. Special case of Dirichlet distribution is beta distribution, which is a family of continuous probability distributions defined on the interval [0, 1] A continuous probability distribution is a probability distribution that has a probability density function.
  4. For each document we draw topic distribution θd from distribution of topic distributions We know how many words we want in each document so we draw that many times per-word topic assignment Zd,n So probability of word Wd,n depends on topic assignment (Zd,n) and distribution of words for this topic (βk).