Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
1. Topic Modeling for Information Retrieval
and Word Sense Disambiguation tasks
Università degli Studi di Milano - Bicocca
Di Donato Leonardo
Text Mining Course - Prof. Fabio Stella
2. Introduction
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
super abundant amount of digital unstructured information
it continues to grow at an astonishing rate (it doubles every two years)
man can not manage it: information overload.
problems: crawling, representing, storing, summarizing, clustering, searching ...
(general rule: every problem is an opportunity)
opportunity: automatically extract value from chaos
what value? how to do it?
3. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Goals
the value that we want to extract is: clusters of semantically related
documents
our purpose is [1] the unsupervised clustering of a text dataset
[2] the implementation of information retrieval procedures that exploit the
representation of documents at the topic level
[3] the modeling of the ability to computationally identify the meaning of
words in context (word sense disambiguation)
our documents collection: a partition of the Associated Press dataset
~ 2300 english textual news (dating back to the '90s)
characteristic of any text document: it is often messy, has flaws and noise
we need to clean the data
we need a structured representation of the data
Dataset
4. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Pre-Processing
google refine [ link ]
[1] replacement of abbreviations and common entities with expressions that
normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million})
[2] adjustment of flaws and [3] stripping metadata entities through regular
expressions
mallet [ link ]
[1] make all the characters lowercase
[2] tokenization [3] stop-word removal
[4] vocabulary proportional cut-off, with threshold 0.03
[5] term-frequency representation of each document
corpus is a unique file, every line is a document with this format:
results: |W| = 32349 token types, 241908 words
5. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
probabilistic generative models for uncovering the underlying semantic
structure of a document collection based on a Bayesian analysis of the
original texts [ Blei, 2003 ]
goal: discover patterns of word-use and connect documents that exhibit
similar patterns
idea: documents are mixtures of topics (assignments) and each topic is a
multinomial probability distribution over words
which are the topics have generated the given corpus of documents with
the maximum likelihood ?
we have to infer 3 latent variables: [1] the word distribution over topics [2]
the topics distribution over documents [3] the word-topic assignments
[1] Φ(j)
= P(W|Z = j) [2] Θ(d)
= P (Z|D = d) [3] P(Z|W)
6. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two
smoothing hyper-parameters α and β.
the number of times a topic j which has been selected for a document is
indicated by αj
(α1
, ..., αT
are the parameters of a prior Dirichlet)
β is the parameter of a prior Dirichlet which indicates the count of
extracted words from a topic (before observing any corpus document)
To estimate them we can use different methods (e.g.; Gibbs Sampling)
we need to estimate the distributions Φ and Θ: it is possible compute them
directly through the matrixes of counts
7. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Tuning
which are the best value for hyper-parameters ? usually α = 50/T and β =
0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ]
which is the optimal number of topics T ? and the number of iterations I ?
it depends on the specific problem, it's an open problem
we have set T = 35 and T = 40
there are topics evaluation techniques that try to face this problem ...
we have used one of those techniques (i.e., the topic coherence metric, which
evaluates the semantic coherence of a topic) to compare two model
configurations: symmetric α versus asymmetric α
8. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Symmetric α versus Asymmetric α
an asymmetric configuration (AS) for the alpha hyper-parameters serves to
calibrate with more flexibility the degree of topics sparseness
has been empirically demonstrated that optimizing Dirichlet hyper-
parameters (αi
, ..., αT
) for topics-document distribution makes a huge
difference: topics are not dominated by very common words and they are
more stable as their number increase [ Wallach, 2009 ]
it has not been verified by our experimentation: the topic's average
coherence for AS configuration was worse than SS configuration
why ? in our corpus there isn’t a topic that tends to occur
in each document (or the optimal number of T may be greater, or simply
the answer is more trivial ...)
9. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Top topics for symmetric α and T = 35
10. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Information Retrieval
why should we use topic models to improve information retrieval tasks ?
[1] we can cluster queries according the extracted topics
[2] two documents which share no common words can be measured as
similar
query likelihood model is a basic approach for information retrieval
in this context (generative model) we can evaluate how well a document
matches a query specifying how the words of the query may have been
generated by a language model
we derive a language model for each document (a mixture of topics)
so, the relevant documents will have a topic distribution that is likely may
generated the set of words contained in (or associated with) the query
→ documents similarity
11. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Documents Similarity
two approaches to compute the similarity between documents
[1] probabilistic query approach
[2] comparison of topics distribution of documents
how ? through divergence metrics (e.g., symmetrised Kullback-Leibler,
Jenson-Shannon)
12. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Similar documents for query "forest fire"
AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on
Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
13. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Word Sense Disambiguation
the ability to identify the meaning of words in context in a computational
manner is usually referred as the Word Sense Disambiguation
four elements: [1] selection of word senses (i.e., the classes) [2] use of
external knowledge sources [3] representation of context [4] selection of an
automatic classification method
input: a user specified context document dc
that contains the word wx
to be
disambiguated
[1] → given s most similar words for wx
, for each of this we build a sense document
capturing synsets, glosses, example phrases, and other relevant relations from
WordNet
[2] → WordNet as external knowledge sources to create the sense documents ds
[3] → the topical and the semantic features
[4] → comparison of document dc
with each of the s ds
document (with one of the two
approaches presented): the most similar will be the sense of word wx
in context dc
14. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similarity
two possible approaches to compute the similarity between words:
[1] associative relation
[2] comparison of (topics-words) P(Z|W) distribution
15. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similar to token "arab"
16. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Future Work
topic modeling →
● train an LDA model with asymmetric α for increasing values of T and evaluate the
resulting quality of topics
● train an LDA model with asymmetric α on a vocabulary on which has not been
performed any proportional cut-off
● investigate a possible implementation of a multiple chain model to obtain topics more
stable
● use other metric of topic evaluation
information retrieval →
● assess and fine-tune the prior probability of a document in the query likelihood model
● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of
distributions
word sense disambiguation →
● implement and evaluate other methods to compare context document and sense
documents (e.g., compute P(dc
, ds
) under the assumption that they are conditionally
independent, given the topic variable)
● refine the mechanism of sense selection (e.g., choosing each of the s most probable words
into probability interval in order to minimize the risk that all the most similar words
refer to meanings really strictly correlated)
17. Thank you for your attention.
Di Donato Leonardo, Università degli Studi di Milano - Bicocca