SlideShare a Scribd company logo
1 of 20
Topic Modeling
Karol Grzegorczyk
June 3, 2014
2/20
Generative probabilistic models of textual corpora
● In order to introduce automatic processing of natural language, a language model is
needed.
● One of the main goals of language modeling is to assign a probability to a document:
P(D) = P(w1
,w2
,w3
,...,wm
)
● It is assumed that documents in a corpus were randomly generated (it is, of course, only
a theoretical assumption; in reality, in most cases, they are created by humans)
● There are two types of generative language models:
● those that generate each word on the basis of some number of preceding words
● those that generate words based on latent topics
3/20
n-gram language models
● N-gram – a contiguous sequence of n items from a given document
– When n==m, where m is a total number number of words in a document:
P(D) = P(w1
)P(w2
|w1
)P(w3
|w1
,w2
)...P(wm
|w1
,...,wm-1
)
● Unigram – Words are independent. Each word in a document is assumed to be
randomly generated in the independent way
– P(D) = P(w1
)P(w2
)P(w3
,)...P(wm
)
● Bigram – words are generated with probability condition on the previous word
● In reality, language has long-distance dependencies
– Skip-gram is one of the solutions to this problem
4/20
Document Representations in Vector Space
● Each document is represented by an array of features
● Representation types:
– Bag-of-words (a.k.a. unigram count, term frequencies)
● A document is represented by a sparse vector of the size equals to dictionary size
– TF-IDF (term frequency–inverse document frequency)
● Similar to BoW but term frequencies are weighted, to penalize frequent words
– Topic Models (a.k.a. concept models, latent variable models)
● A document is represented by a low-rank dense vector
● Similarity between documents (or between document and a query) can be expressed in a
cosine distance
5/20
Topic Modeling
● Topic Modeling is a set of techniques that aim to discover and annotate large archives of
documents with thematic information.
● TM is a set of methods that analyze the words (or other fine-grained features) of the original
documents to discover the themes that run through them, how those themes are connected
to each other, and how they change over time.
● Often, the number of topics to be discovered is predefined.
● Topic modeling can be seen as a dimensionality reduction technique
● Topic modeling, like clustering, do not require any prior annotations or labeling, but in
contrast to clustering, can assign document to multiple topics.
● Semantic information can be derived from a word-document co-occurrence matrix
● Topic Model types:
– Linear algebra based (e.g. LSA)
– Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
6/20
Latent semantic analysis
C - normalized co-occurrence matrix
D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main
diagonal are called 'singular values'
● a.k.a. Latent Semantic Indexing
● A low-rank approximation of document-term matrix (typical rank 100-300)
● In contrast, The British National Corpus (BNC) has 100-million words
● LSA downsizes the co-occurrence tables via Singular Value Decomposition
7/20
Probabilistic Latent Semantic Analysis
pLSA models the probability of each co-occurrence as a mixture of conditionally
independent multinomial distributions:
P(d|c) relates to the V matrix from the previous slide
P(w|c) relates to the U matrix from the previous slide
In contrast to classical LSA, words representation in topics and topic representations in
documents will be non-negative and will sum up to one.
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
8/20
pLSA – Graphical Model
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
● A graphical model is a probabilistic
model for which a graph denotes the
conditional dependence structure
between random variables.
● Only a shaded variable is observed.
All the others have to be inferred.
● We can infer hidden variable using
maximum likelihood estimation.
● D – total number of documents
● N – total number of words in a
document (it fact, it should be Nd
)
● T – total number of topics
9/20
Latent Dirichlet allocation
● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a
Dirichlet prior
● Dirichlet distribution is a family of continuous multivariate probability distributions
● This model assumes that documents are generated randomly
● Topic is a distribution over a fixed vocabulary, each topic contains each word from the
dictionary, but some of them have very low probability
● Each word in a document is randomly selected from randomly selected topic from
distribution of topics.
● Each documents exhibit multiple topics in different proportions.
– In fact, all the documents in the collection share the same set of topics, but each
document exhibits those topics in different proportions
● In reality, the topic structure, per-document topic distributions and the per-document per-
word topic assignments are latent, and have to be inferred from observed documents.
10/20
The intuitions behind latent Dirichlet allocation
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
11/20
Real inference with LDA
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
A 100-topic LDA model was fitted to 17,000 articles from the Science journal.
At right are the top 15 most frequent words from the most frequent topics.
At left are the inferred topic proportions for the example article from previous slide.
12/20
The graphical model for latent Dirichlet allocation.
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
K – total number of topics
βk
– topic, a distribution over the vocabulary
D – total number of documents
Θd
– per-document topic proportions
N – total number of words in a document (it fact, it should be Nd
)
Zd,n
– per-word topic assignment
Wd,n
– observed word
α, η – Dirichlet parameters
● Several inference algorithms are
available (e.g. sampling based)
● A few extensions to LDA were created:
● Bigram Topic Model
13/20
Matrix Factorization Interpretation of LDA
14/20
Tooling
MAchine Learning for LanguagE Toolkit
(MALLET) is a Java-based package for:
● statistical natural language processing
● document classification
● Clustering
● topic modeling
● information extraction
● and other machine learning applications to text.
http://mallet.cs.umass.edu
gensim: topic modeling for humans
● Free python library
● Memory independent
● Distributed computing
http://radimrehurek.com/gensim
Stanford Topic Modeling Toolbox
http://nlp.stanford.edu/software/tmt
15/20
Topic modeling applications
● Topic-based text classification
● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor,
SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data.
● Topic modeling can be seen as a pre-processing step before applying supervised
learning methods.
● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in
F1 score [Cai, Hofmann, 2003]
● Collaborative filtering [Hofmann, 2004]
● Finding patterns in genetic data, images, and social networks
[L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003]
[T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
16/20
Word Representations in Vector Space
● Notion of similarity between words
● Continuous-valued vector representation of words
● Neural network language model
● Prediction semantic of the word based on the context
● Ability to perform simple algebraic operations on the word vectors:
vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen
● word2ved: https://code.google.com/p/word2vec
17/20
Topic Modeling in Computer Vision
● Bag-of-words model in computer vision (a.k.a. bag-of-features)
– Codewords (“visual words”) instead of just words
– Codebook instead of dictionary
● It is assumed, that documents exhibit multiple topics and the collection of documents
exhibits the same set of topics
● TM in CV has been used to:
– Classify images
– Build image hierarchies
– Connecting images and captions
● The main advantage of this approach it its unsupervised training nature.
18/20
Codebook example
[L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005]
Obtained from 650
training examples
from 13 categories
of natural scenes
(e.g. highway, inside
of cities, tall
buildings, forest, etc)
using k-means
clustering algorithm.
19/20
Bibliography
● D. Blei, Probabilistic topic models, Communications of the ACM, 2012
● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007
● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990
● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003
● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine
learning, 2006
● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector
space, ICLR, 2013
● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories,
Computer Vision and Pattern Recognition, 2005
● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information
Processing Systems, 2007
20/20
Thank you!

More Related Content

What's hot (20)

Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Text categorization
Text categorizationText categorization
Text categorization
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Word embedding
Word embedding Word embedding
Word embedding
 
NLP
NLPNLP
NLP
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Text mining
Text miningText mining
Text mining
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 

Viewers also liked

LDA presentation
LDA presentationLDA presentation
LDA presentationMohit Gupta
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...Davide Chicco
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all thatZhibo Xiao
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisMercy Livingstone
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpointtgaskins4
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modelingSameer Wadkar
 
Scaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the WorkplaceScaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the Workplacetobold
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary TaleEva Ward
 

Viewers also liked (20)

LDA presentation
LDA presentationLDA presentation
LDA presentation
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all that
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Terms
TermsTerms
Terms
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpoint
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Microservices
MicroservicesMicroservices
Microservices
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modeling
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
Scaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the WorkplaceScaling Informal Learning and Meaning Making at the Workplace
Scaling Informal Learning and Meaning Making at the Workplace
 
Ltc completed slides
Ltc completed slidesLtc completed slides
Ltc completed slides
 
Flint: A Cautionary Tale
Flint: A Cautionary TaleFlint: A Cautionary Tale
Flint: A Cautionary Tale
 

Similar to Topic Modeling

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxssuserf3aa89
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSouth Tyrol Free Software Conference
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2Jisoo Jang
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesDatech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesIMPACT Centre of Competence
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 

Similar to Topic Modeling (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptx
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital HumanitiesDatech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 

Recently uploaded

Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxGOWTHAMIM22
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloChristian Robert
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysBrahmesh Reddy B R
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfpablovgd
 
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPESTHE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPESkushbuR
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionAreesha Ahmad
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfRevenJadePalma
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfStart Project
 
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxPOST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxArpitaMishra69
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 
short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...HeenaDagar
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyAreesha Ahmad
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Sérgio Sacani
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeAreesha Ahmad
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...kevin8smith
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...yogeshlabana357357
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Sahil Suleman
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed systemADB online India
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
 

Recently uploaded (20)

Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptx
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdf
 
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPESTHE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interaction
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxPOST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) Enzymology
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 

Topic Modeling

  • 2. 2/20 Generative probabilistic models of textual corpora ● In order to introduce automatic processing of natural language, a language model is needed. ● One of the main goals of language modeling is to assign a probability to a document: P(D) = P(w1 ,w2 ,w3 ,...,wm ) ● It is assumed that documents in a corpus were randomly generated (it is, of course, only a theoretical assumption; in reality, in most cases, they are created by humans) ● There are two types of generative language models: ● those that generate each word on the basis of some number of preceding words ● those that generate words based on latent topics
  • 3. 3/20 n-gram language models ● N-gram – a contiguous sequence of n items from a given document – When n==m, where m is a total number number of words in a document: P(D) = P(w1 )P(w2 |w1 )P(w3 |w1 ,w2 )...P(wm |w1 ,...,wm-1 ) ● Unigram – Words are independent. Each word in a document is assumed to be randomly generated in the independent way – P(D) = P(w1 )P(w2 )P(w3 ,)...P(wm ) ● Bigram – words are generated with probability condition on the previous word ● In reality, language has long-distance dependencies – Skip-gram is one of the solutions to this problem
  • 4. 4/20 Document Representations in Vector Space ● Each document is represented by an array of features ● Representation types: – Bag-of-words (a.k.a. unigram count, term frequencies) ● A document is represented by a sparse vector of the size equals to dictionary size – TF-IDF (term frequency–inverse document frequency) ● Similar to BoW but term frequencies are weighted, to penalize frequent words – Topic Models (a.k.a. concept models, latent variable models) ● A document is represented by a low-rank dense vector ● Similarity between documents (or between document and a query) can be expressed in a cosine distance
  • 5. 5/20 Topic Modeling ● Topic Modeling is a set of techniques that aim to discover and annotate large archives of documents with thematic information. ● TM is a set of methods that analyze the words (or other fine-grained features) of the original documents to discover the themes that run through them, how those themes are connected to each other, and how they change over time. ● Often, the number of topics to be discovered is predefined. ● Topic modeling can be seen as a dimensionality reduction technique ● Topic modeling, like clustering, do not require any prior annotations or labeling, but in contrast to clustering, can assign document to multiple topics. ● Semantic information can be derived from a word-document co-occurrence matrix ● Topic Model types: – Linear algebra based (e.g. LSA) – Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
  • 6. 6/20 Latent semantic analysis C - normalized co-occurrence matrix D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main diagonal are called 'singular values' ● a.k.a. Latent Semantic Indexing ● A low-rank approximation of document-term matrix (typical rank 100-300) ● In contrast, The British National Corpus (BNC) has 100-million words ● LSA downsizes the co-occurrence tables via Singular Value Decomposition
  • 7. 7/20 Probabilistic Latent Semantic Analysis pLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions: P(d|c) relates to the V matrix from the previous slide P(w|c) relates to the U matrix from the previous slide In contrast to classical LSA, words representation in topics and topic representations in documents will be non-negative and will sum up to one. [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
  • 8. 8/20 pLSA – Graphical Model [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999] ● A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. ● Only a shaded variable is observed. All the others have to be inferred. ● We can infer hidden variable using maximum likelihood estimation. ● D – total number of documents ● N – total number of words in a document (it fact, it should be Nd ) ● T – total number of topics
  • 9. 9/20 Latent Dirichlet allocation ● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a Dirichlet prior ● Dirichlet distribution is a family of continuous multivariate probability distributions ● This model assumes that documents are generated randomly ● Topic is a distribution over a fixed vocabulary, each topic contains each word from the dictionary, but some of them have very low probability ● Each word in a document is randomly selected from randomly selected topic from distribution of topics. ● Each documents exhibit multiple topics in different proportions. – In fact, all the documents in the collection share the same set of topics, but each document exhibits those topics in different proportions ● In reality, the topic structure, per-document topic distributions and the per-document per- word topic assignments are latent, and have to be inferred from observed documents.
  • 10. 10/20 The intuitions behind latent Dirichlet allocation [D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
  • 11. 11/20 Real inference with LDA [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] A 100-topic LDA model was fitted to 17,000 articles from the Science journal. At right are the top 15 most frequent words from the most frequent topics. At left are the inferred topic proportions for the example article from previous slide.
  • 12. 12/20 The graphical model for latent Dirichlet allocation. [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] K – total number of topics βk – topic, a distribution over the vocabulary D – total number of documents Θd – per-document topic proportions N – total number of words in a document (it fact, it should be Nd ) Zd,n – per-word topic assignment Wd,n – observed word α, η – Dirichlet parameters ● Several inference algorithms are available (e.g. sampling based) ● A few extensions to LDA were created: ● Bigram Topic Model
  • 14. 14/20 Tooling MAchine Learning for LanguagE Toolkit (MALLET) is a Java-based package for: ● statistical natural language processing ● document classification ● Clustering ● topic modeling ● information extraction ● and other machine learning applications to text. http://mallet.cs.umass.edu gensim: topic modeling for humans ● Free python library ● Memory independent ● Distributed computing http://radimrehurek.com/gensim Stanford Topic Modeling Toolbox http://nlp.stanford.edu/software/tmt
  • 15. 15/20 Topic modeling applications ● Topic-based text classification ● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor, SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data. ● Topic modeling can be seen as a pre-processing step before applying supervised learning methods. ● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in F1 score [Cai, Hofmann, 2003] ● Collaborative filtering [Hofmann, 2004] ● Finding patterns in genetic data, images, and social networks [L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003] [T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
  • 16. 16/20 Word Representations in Vector Space ● Notion of similarity between words ● Continuous-valued vector representation of words ● Neural network language model ● Prediction semantic of the word based on the context ● Ability to perform simple algebraic operations on the word vectors: vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen ● word2ved: https://code.google.com/p/word2vec
  • 17. 17/20 Topic Modeling in Computer Vision ● Bag-of-words model in computer vision (a.k.a. bag-of-features) – Codewords (“visual words”) instead of just words – Codebook instead of dictionary ● It is assumed, that documents exhibit multiple topics and the collection of documents exhibits the same set of topics ● TM in CV has been used to: – Classify images – Build image hierarchies – Connecting images and captions ● The main advantage of this approach it its unsupervised training nature.
  • 18. 18/20 Codebook example [L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005] Obtained from 650 training examples from 13 categories of natural scenes (e.g. highway, inside of cities, tall buildings, forest, etc) using k-means clustering algorithm.
  • 19. 19/20 Bibliography ● D. Blei, Probabilistic topic models, Communications of the ACM, 2012 ● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007 ● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990 ● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003 ● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine learning, 2006 ● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, ICLR, 2013 ● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, Computer Vision and Pattern Recognition, 2005 ● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information Processing Systems, 2007

Editor's Notes

  1. Explain difference between cosine and euclidean distances.
  2. Everything within a plate is replicated. Replication number is in the bottom-right corner of the plate. The hidden nodes are unshaded. The observed nodes are shades Θ – theta Φ – fi
  3. Special case of Dirichlet distribution is beta distribution, which is a family of continuous probability distributions defined on the interval [0, 1] A continuous probability distribution is a probability distribution that has a probability density function.
  4. For each document we draw topic distribution θd from distribution of topic distributions We know how many words we want in each document so we draw that many times per-word topic assignment Zd,n So probability of word Wd,n depends on topic assignment (Zd,n) and distribution of words for this topic (βk).