SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Topic Modeling for Information Retrieval
and Word Sense Disambiguation tasks
Università degli Studi di Milano - Bicocca
Di Donato Leonardo
Text Mining Course - Prof. Fabio Stella
Introduction
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
super abundant amount of digital unstructured information
it continues to grow at an astonishing rate (it doubles every two years)
man can not manage it: information overload.
problems: crawling, representing, storing, summarizing, clustering, searching ...
(general rule: every problem is an opportunity)
opportunity: automatically extract value from chaos
what value? how to do it?
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Goals
the value that we want to extract is: clusters of semantically related
documents
our purpose is [1] the unsupervised clustering of a text dataset
[2] the implementation of information retrieval procedures that exploit the
representation of documents at the topic level
[3] the modeling of the ability to computationally identify the meaning of
words in context (word sense disambiguation)
our documents collection: a partition of the Associated Press dataset
~ 2300 english textual news (dating back to the '90s)
characteristic of any text document: it is often messy, has flaws and noise
we need to clean the data
we need a structured representation of the data
Dataset
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Pre-Processing
google refine [ link ]
[1] replacement of abbreviations and common entities with expressions that
normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million})
[2] adjustment of flaws and [3] stripping metadata entities through regular
expressions
mallet [ link ]
[1] make all the characters lowercase
[2] tokenization [3] stop-word removal
[4] vocabulary proportional cut-off, with threshold 0.03
[5] term-frequency representation of each document
corpus is a unique file, every line is a document with this format:
results: |W| = 32349 token types, 241908 words
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
probabilistic generative models for uncovering the underlying semantic
structure of a document collection based on a Bayesian analysis of the
original texts [ Blei, 2003 ]
goal: discover patterns of word-use and connect documents that exhibit
similar patterns
idea: documents are mixtures of topics (assignments) and each topic is a
multinomial probability distribution over words
which are the topics have generated the given corpus of documents with
the maximum likelihood ?
we have to infer 3 latent variables: [1] the word distribution over topics [2]
the topics distribution over documents [3] the word-topic assignments
[1] Φ(j)
= P(W|Z = j) [2] Θ(d)
= P (Z|D = d) [3] P(Z|W)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two
smoothing hyper-parameters α and β.
the number of times a topic j which has been selected for a document is
indicated by αj
(α1
, ..., αT
are the parameters of a prior Dirichlet)
β is the parameter of a prior Dirichlet which indicates the count of
extracted words from a topic (before observing any corpus document)
To estimate them we can use different methods (e.g.; Gibbs Sampling)
we need to estimate the distributions Φ and Θ: it is possible compute them
directly through the matrixes of counts
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Tuning
which are the best value for hyper-parameters ? usually α = 50/T and β =
0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ]
which is the optimal number of topics T ? and the number of iterations I ?
it depends on the specific problem, it's an open problem
we have set T = 35 and T = 40
there are topics evaluation techniques that try to face this problem ...
we have used one of those techniques (i.e., the topic coherence metric, which
evaluates the semantic coherence of a topic) to compare two model
configurations: symmetric α versus asymmetric α
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Symmetric α versus Asymmetric α
an asymmetric configuration (AS) for the alpha hyper-parameters serves to
calibrate with more flexibility the degree of topics sparseness
has been empirically demonstrated that optimizing Dirichlet hyper-
parameters (αi
, ..., αT
) for topics-document distribution makes a huge
difference: topics are not dominated by very common words and they are
more stable as their number increase [ Wallach, 2009 ]
it has not been verified by our experimentation: the topic's average
coherence for AS configuration was worse than SS configuration
why ? in our corpus there isn’t a topic that tends to occur
in each document (or the optimal number of T may be greater, or simply
the answer is more trivial ...)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Top topics for symmetric α and T = 35
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Information Retrieval
why should we use topic models to improve information retrieval tasks ?
[1] we can cluster queries according the extracted topics
[2] two documents which share no common words can be measured as
similar
query likelihood model is a basic approach for information retrieval
in this context (generative model) we can evaluate how well a document
matches a query specifying how the words of the query may have been
generated by a language model
we derive a language model for each document (a mixture of topics)
so, the relevant documents will have a topic distribution that is likely may
generated the set of words contained in (or associated with) the query
→ documents similarity
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Documents Similarity
two approaches to compute the similarity between documents
[1] probabilistic query approach
[2] comparison of topics distribution of documents
how ? through divergence metrics (e.g., symmetrised Kullback-Leibler,
Jenson-Shannon)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Similar documents for query "forest fire"
AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on
Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Word Sense Disambiguation
the ability to identify the meaning of words in context in a computational
manner is usually referred as the Word Sense Disambiguation
four elements: [1] selection of word senses (i.e., the classes) [2] use of
external knowledge sources [3] representation of context [4] selection of an
automatic classification method
input: a user specified context document dc
that contains the word wx
to be
disambiguated
[1] → given s most similar words for wx
, for each of this we build a sense document
capturing synsets, glosses, example phrases, and other relevant relations from
WordNet
[2] → WordNet as external knowledge sources to create the sense documents ds
[3] → the topical and the semantic features
[4] → comparison of document dc
with each of the s ds
document (with one of the two
approaches presented): the most similar will be the sense of word wx
in context dc
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similarity
two possible approaches to compute the similarity between words:
[1] associative relation
[2] comparison of (topics-words) P(Z|W) distribution
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similar to token "arab"
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Future Work
topic modeling →
● train an LDA model with asymmetric α for increasing values of T and evaluate the
resulting quality of topics
● train an LDA model with asymmetric α on a vocabulary on which has not been
performed any proportional cut-off
● investigate a possible implementation of a multiple chain model to obtain topics more
stable
● use other metric of topic evaluation
information retrieval →
● assess and fine-tune the prior probability of a document in the query likelihood model
● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of
distributions
word sense disambiguation →
● implement and evaluate other methods to compare context document and sense
documents (e.g., compute P(dc
, ds
) under the assumption that they are conditionally
independent, given the topic variable)
● refine the mechanism of sense selection (e.g., choosing each of the s most probable words
into probability interval in order to minimize the risk that all the most similar words
refer to meanings really strictly correlated)
Thank you for your attention.
Di Donato Leonardo, Università degli Studi di Milano - Bicocca

Mais conteúdo relacionado

Mais procurados

Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase miningjins0618
 

Mais procurados (20)

Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
similarity measure
similarity measure similarity measure
similarity measure
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 

Destaque

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationvini89
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and InductionLeon Derczynski
 
Search Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentationSearch Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentationSearch Smart Marketing
 
Draft programme 15 09-2015
Draft programme 15 09-2015Draft programme 15 09-2015
Draft programme 15 09-2015predim
 
Word sense dissambiguation
Word sense dissambiguationWord sense dissambiguation
Word sense dissambiguationAshwin Perti
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationSurabhi Verma
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011Andi Wu
 
A word sense disambiguation technique for sinhala
A word sense disambiguation technique  for sinhalaA word sense disambiguation technique  for sinhala
A word sense disambiguation technique for sinhalaVijayindu Gamage
 
Graph-based Word Sense Disambiguation
Graph-based Word Sense DisambiguationGraph-based Word Sense Disambiguation
Graph-based Word Sense DisambiguationElena-Oana Tabaranu
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...Pierpaolo Basile
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Disambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalDisambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalMadhusudan Daad
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationvini89
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNetSeid Hassen
 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a surveyunyil96
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschachtguest1add48f
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]akm sabbir
 

Destaque (20)

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Search Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentationSearch Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentation
 
Draft programme 15 09-2015
Draft programme 15 09-2015Draft programme 15 09-2015
Draft programme 15 09-2015
 
Word sense dissambiguation
Word sense dissambiguationWord sense dissambiguation
Word sense dissambiguation
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011
 
A word sense disambiguation technique for sinhala
A word sense disambiguation technique  for sinhalaA word sense disambiguation technique  for sinhala
A word sense disambiguation technique for sinhala
 
Graph-based Word Sense Disambiguation
Graph-based Word Sense DisambiguationGraph-based Word Sense Disambiguation
Graph-based Word Sense Disambiguation
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Disambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalDisambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document Retrieval
 
Thesis
ThesisThesis
Thesis
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a survey
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschacht
 
Word-sense disambiguation
Word-sense disambiguationWord-sense disambiguation
Word-sense disambiguation
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]
 

Semelhante a Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES cscpconf
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlescsandit
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 

Semelhante a Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks (20)

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 

Mais de Leonardo Di Donato

Prometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on KubernetesPrometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on KubernetesLeonardo Di Donato
 
Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?Leonardo Di Donato
 
Continuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc ThesisContinuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc ThesisLeonardo Di Donato
 
Guida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social NetworkGuida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social NetworkLeonardo Di Donato
 
A Location Based Mobile Social Network
A Location Based Mobile Social NetworkA Location Based Mobile Social Network
A Location Based Mobile Social NetworkLeonardo Di Donato
 
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and DesignSistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and DesignLeonardo Di Donato
 
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...Leonardo Di Donato
 

Mais de Leonardo Di Donato (9)

Prometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on KubernetesPrometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on Kubernetes
 
Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?
 
From logs to metrics
From logs to metricsFrom logs to metrics
From logs to metrics
 
Continuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc ThesisContinuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc Thesis
 
Guida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social NetworkGuida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social Network
 
Virtual Worlds
Virtual WorldsVirtual Worlds
Virtual Worlds
 
A Location Based Mobile Social Network
A Location Based Mobile Social NetworkA Location Based Mobile Social Network
A Location Based Mobile Social Network
 
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and DesignSistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
 
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
 

Último

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

  • 1. Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks Università degli Studi di Milano - Bicocca Di Donato Leonardo Text Mining Course - Prof. Fabio Stella
  • 2. Introduction Di Donato Leonardo, Università degli Studi di Milano - Bicocca super abundant amount of digital unstructured information it continues to grow at an astonishing rate (it doubles every two years) man can not manage it: information overload. problems: crawling, representing, storing, summarizing, clustering, searching ... (general rule: every problem is an opportunity) opportunity: automatically extract value from chaos what value? how to do it?
  • 3. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Goals the value that we want to extract is: clusters of semantically related documents our purpose is [1] the unsupervised clustering of a text dataset [2] the implementation of information retrieval procedures that exploit the representation of documents at the topic level [3] the modeling of the ability to computationally identify the meaning of words in context (word sense disambiguation) our documents collection: a partition of the Associated Press dataset ~ 2300 english textual news (dating back to the '90s) characteristic of any text document: it is often messy, has flaws and noise we need to clean the data we need a structured representation of the data Dataset
  • 4. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Pre-Processing google refine [ link ] [1] replacement of abbreviations and common entities with expressions that normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million}) [2] adjustment of flaws and [3] stripping metadata entities through regular expressions mallet [ link ] [1] make all the characters lowercase [2] tokenization [3] stop-word removal [4] vocabulary proportional cut-off, with threshold 0.03 [5] term-frequency representation of each document corpus is a unique file, every line is a document with this format: results: |W| = 32349 token types, 241908 words
  • 5. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models probabilistic generative models for uncovering the underlying semantic structure of a document collection based on a Bayesian analysis of the original texts [ Blei, 2003 ] goal: discover patterns of word-use and connect documents that exhibit similar patterns idea: documents are mixtures of topics (assignments) and each topic is a multinomial probability distribution over words which are the topics have generated the given corpus of documents with the maximum likelihood ? we have to infer 3 latent variables: [1] the word distribution over topics [2] the topics distribution over documents [3] the word-topic assignments [1] Φ(j) = P(W|Z = j) [2] Θ(d) = P (Z|D = d) [3] P(Z|W)
  • 6. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two smoothing hyper-parameters α and β. the number of times a topic j which has been selected for a document is indicated by αj (α1 , ..., αT are the parameters of a prior Dirichlet) β is the parameter of a prior Dirichlet which indicates the count of extracted words from a topic (before observing any corpus document) To estimate them we can use different methods (e.g.; Gibbs Sampling) we need to estimate the distributions Φ and Θ: it is possible compute them directly through the matrixes of counts
  • 7. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Tuning which are the best value for hyper-parameters ? usually α = 50/T and β = 0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ] which is the optimal number of topics T ? and the number of iterations I ? it depends on the specific problem, it's an open problem we have set T = 35 and T = 40 there are topics evaluation techniques that try to face this problem ... we have used one of those techniques (i.e., the topic coherence metric, which evaluates the semantic coherence of a topic) to compare two model configurations: symmetric α versus asymmetric α
  • 8. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Symmetric α versus Asymmetric α an asymmetric configuration (AS) for the alpha hyper-parameters serves to calibrate with more flexibility the degree of topics sparseness has been empirically demonstrated that optimizing Dirichlet hyper- parameters (αi , ..., αT ) for topics-document distribution makes a huge difference: topics are not dominated by very common words and they are more stable as their number increase [ Wallach, 2009 ] it has not been verified by our experimentation: the topic's average coherence for AS configuration was worse than SS configuration why ? in our corpus there isn’t a topic that tends to occur in each document (or the optimal number of T may be greater, or simply the answer is more trivial ...)
  • 9. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Top topics for symmetric α and T = 35
  • 10. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Information Retrieval why should we use topic models to improve information retrieval tasks ? [1] we can cluster queries according the extracted topics [2] two documents which share no common words can be measured as similar query likelihood model is a basic approach for information retrieval in this context (generative model) we can evaluate how well a document matches a query specifying how the words of the query may have been generated by a language model we derive a language model for each document (a mixture of topics) so, the relevant documents will have a topic distribution that is likely may generated the set of words contained in (or associated with) the query → documents similarity
  • 11. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Documents Similarity two approaches to compute the similarity between documents [1] probabilistic query approach [2] comparison of topics distribution of documents how ? through divergence metrics (e.g., symmetrised Kullback-Leibler, Jenson-Shannon)
  • 12. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Similar documents for query "forest fire" AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
  • 13. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Word Sense Disambiguation the ability to identify the meaning of words in context in a computational manner is usually referred as the Word Sense Disambiguation four elements: [1] selection of word senses (i.e., the classes) [2] use of external knowledge sources [3] representation of context [4] selection of an automatic classification method input: a user specified context document dc that contains the word wx to be disambiguated [1] → given s most similar words for wx , for each of this we build a sense document capturing synsets, glosses, example phrases, and other relevant relations from WordNet [2] → WordNet as external knowledge sources to create the sense documents ds [3] → the topical and the semantic features [4] → comparison of document dc with each of the s ds document (with one of the two approaches presented): the most similar will be the sense of word wx in context dc
  • 14. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similarity two possible approaches to compute the similarity between words: [1] associative relation [2] comparison of (topics-words) P(Z|W) distribution
  • 15. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similar to token "arab"
  • 16. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Future Work topic modeling → ● train an LDA model with asymmetric α for increasing values of T and evaluate the resulting quality of topics ● train an LDA model with asymmetric α on a vocabulary on which has not been performed any proportional cut-off ● investigate a possible implementation of a multiple chain model to obtain topics more stable ● use other metric of topic evaluation information retrieval → ● assess and fine-tune the prior probability of a document in the query likelihood model ● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of distributions word sense disambiguation → ● implement and evaluate other methods to compare context document and sense documents (e.g., compute P(dc , ds ) under the assumption that they are conditionally independent, given the topic variable) ● refine the mechanism of sense selection (e.g., choosing each of the s most probable words into probability interval in order to minimize the risk that all the most similar words refer to meanings really strictly correlated)
  • 17. Thank you for your attention. Di Donato Leonardo, Università degli Studi di Milano - Bicocca