SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Correlated Topic Models


                   David M. Blei                          John D. Lafferty
           Department of Computer Science             School of Computer Science
                Princeton University                  Carnegie Mellon University



                                          Abstract

         Topic models, such as latent Dirichlet allocation (LDA), can be useful
         tools for the statistical analysis of document collections and other dis-
         crete data. The LDA model assumes that the words of each document
         arise from a mixture of topics, each of which is a distribution over the vo-
         cabulary. A limitation of LDA is the inability to model topic correlation
         even though, for example, a document about genetics is more likely to
         also be about disease than x-ray astronomy. This limitation stems from
         the use of the Dirichlet distribution to model the variability among the
         topic proportions. In this paper we develop the correlated topic model
         (CTM), where the topic proportions exhibit correlation via the logistic
         normal distribution [1]. We derive a mean-field variational inference al-
         gorithm for approximate posterior inference in this model, which is com-
         plicated by the fact that the logistic normal is not conjugate to the multi-
         nomial. The CTM gives a better fit than LDA on a collection of OCRed
         articles from the journal Science. Furthermore, the CTM provides a nat-
         ural way of visualizing and exploring this and other unstructured data
         sets.


1   Introduction

The availability and use of unstructured historical collections of documents is rapidly grow-
ing. As one example, JSTOR (www.jstor.org) is a not-for-profit organization that main-
tains a large online scholarly journal archive obtained by running an optical character recog-
nition engine over the original printed journals. JSTOR indexes the resulting text and pro-
vides online access to the scanned images of the original content through keyword search.
This provides an extremely useful service to the scholarly community, with the collection
comprising nearly three million published articles in a variety of fields.
The sheer size of this unstructured and noisy archive naturally suggests opportunities for
the use of statistical modeling. For instance, a scholar in a narrow subdiscipline, searching
for a particular research article, would certainly be interested to learn that the topic of
that article is highly correlated with another topic that the researcher may not have known
about, and that is not explicitly contained in the article. Alerted to the existence of this new
related topic, the researcher could browse the collection in a topic-guided manner to begin
to investigate connections to a previously unrecognized body of work. Since the archive
comprises millions of articles spanning centuries of scholarly work, automated analysis is
essential.
Several statistical models have recently been developed for automatically extracting the
topical structure of large document collections. In technical terms, a topic model is a
generative probabilistic model that uses a small number of distributions over a vocabulary
to describe a document collection. When fit from data, these distributions often correspond
to intuitive notions of topicality. In this work, we build upon the latent Dirichlet allocation
(LDA) [4] model. LDA assumes that the words of each document arise from a mixture
of topics. The topics are shared by all documents in the collection; the topic proportions
are document-specific and randomly drawn from a Dirichlet distribution. LDA allows each
document to exhibit multiple topics with different proportions, and it can thus capture the
heterogeneity in grouped data that exhibit multiple latent patterns. Recent work has used
LDA in more complicated document models [9, 11, 7], and in a variety of settings such
as image processing [12], collaborative filtering [8], and the modeling of sequential data
and user profiles [6]. Similar models were independently developed for disability survey
data [5] and population genetics [10].
Our goal in this paper is to address a limitation of the topic models proposed to date: they
fail to directly model correlation between topics. In many—indeed most—text corpora, it
is natural to expect that subsets of the underlying latent topics will be highly correlated. In
a corpus of scientific articles, for instance, an article about genetics may be likely to also
be about health and disease, but unlikely to also be about x-ray astronomy. For the LDA
model, this limitation stems from the independence assumptions implicit in the Dirichlet
distribution on the topic proportions. Under a Dirichlet, the components of the proportions
vector are nearly independent; this leads to the strong and unrealistic modeling assumption
that the presence of one topic is not correlated with the presence of another.
In this paper we present the correlated topic model (CTM). The CTM uses an alterna-
tive, more flexible distribution for the topic proportions that allows for covariance structure
among the components. This gives a more realistic model of latent topic structure where
the presence of one latent topic may be correlated with the presence of another. In the
following sections we develop the technical aspects of this model, and then demonstrate its
potential for the applications envisioned above. We fit the model to a portion of the JSTOR
archive of the journal Science. We demonstrate that the model gives a better fit than LDA,
as measured by the accuracy of the predictive distributions over held out documents. Fur-
thermore, we demonstrate qualitatively that the correlated topic model provides a natural
way of visualizing and exploring such an unstructured collection of textual data.

2   The Correlated Topic Model
The key to the correlated topic model we propose is the logistic normal distribution [1]. The
logistic normal is a distribution on the simplex that allows for a general pattern of variability
between the components by transforming a multivariate normal random variable. Consider
the natural parameterization of a K-dimensional multinomial distribution:
                                p(z | η) = exp{η T z − a(η)}.                                (1)
The random variable Z can take on K values; it can be represented by a K-vector with
exactly one component equal to one, denoting a value in {1, . . . , K}. The cumulant gener-
ating function of the distribution is
                                                K
                                a(η) = log      i=1   exp{ηi } .                             (2)
The mapping between the mean parameterization (i.e., the simplex) and the natural param-
eterization is given by
                                     ηi = log θi /θK .                                (3)
Notice that this is not the minimal exponential family representation of the multinomial
because multiple values of η can yield the same mean parameter.
Σ                                            βk
                              ηd    Zd,n    Wd,n
                                                   N
                     µ                                  D            K




Figure 1: Top: Graphical model representation of the correlated topic model. The logistic
normal distribution, used to model the latent topic proportions of a document, can represent
correlations between topics that are impossible to capture using a single Dirichlet. Bottom:
Example densities of the logistic normal on the 2-simplex. From left: diagonal covariance
and nonzero-mean, negative correlation between components 1 and 2, positive correlation
between components 1 and 2.


The logistic normal distribution assumes that η is normally distributed and then mapped
to the simplex with the inverse of the mapping given in equation (3); that is, f (ηi ) =
exp ηi / j exp ηj . The logistic normal models correlations between components of the
simplicial random variable through the covariance matrix of the normal distribution. The
logistic normal was originally studied in the context of analyzing observed compositional
data such as the proportions of minerals in geological samples. In this work, we extend its
use to a hierarchical model where it describes the latent composition of topics associated
with each document.
Let {µ, Σ} be a K-dimensional mean and covariance matrix, and let topics β1:K be K
multinomials over a fixed word vocabulary. The correlated topic model assumes that an
N -word document arises from the following generative process:

      1. Draw η | {µ, Σ} ∼ N (µ, Σ).
      2. For n ∈ {1, . . . , N }:
          (a) Draw topic assignment Zn | η from Mult(f (η)).
          (b) Draw word Wn | {zn , β1:K } from Mult(βzn ).

This process is identical to the generative process of LDA except that the topic proportions
are drawn from a logistic normal rather than a Dirichlet. The model is shown as a directed
graphical model in Figure 1.
The CTM is more expressive than LDA. The strong independence assumption imposed
by the Dirichlet in LDA is not realistic when analyzing document collections, where one
may find strong correlations between topics. The covariance matrix of the logistic normal
in the CTM is introduced to model such correlations. In Section 3, we illustrate how the
higher order structure given by the covariance can be used as an exploratory tool for better
understanding and navigating a large corpus of documents. Moreover, modeling correlation
can lead to better predictive distributions. In some settings, such as collaborative filtering,
the goal is to predict unseen items conditional on a set of observations. An LDA model
will predict words based on the latent topics that the observations suggest, but the CTM
has the ability to predict items associated with additional topics that are correlated with the
conditionally probable topics.

2.1 Posterior inference and parameter estimation

Posterior inference is the central challenge to using the CTM. The posterior distribution of
the latent variables conditional on a document, p(η, z1:N | w1:N ), is intractable to compute;
once conditioned on some observations, the topic assignments z1:N and log proportions
η are dependent. We make use of mean-field variational methods to efficiently obtain an
approximation of this posterior distribution.
In brief, the strategy employed by mean-field variational methods is to form a factorized
distribution of the latent variables, parameterized by free variables which are called the vari-
ational parameters. These parameters are fit so that the Kullback-Leibler (KL) divergence
between the approximate and true posterior is small. For many problems this optimization
problem is computationally manageable, while standard methods, such as Markov Chain
Monte Carlo, are impractical. The tradeoff is that variational methods do not come with
the same theoretical guarantees as simulation methods. See [13] for a modern review of
variational methods for statistical inference.
In graphical models composed of conjugate-exponential family pairs and mixtures, the
variational inference algorithm can be automatically derived from general principles [2,
14]. In the CTM, however, the logistic normal is not conjugate to the multinomial. We
will therefore derive a variational inference algorithm by taking into account the special
structure and distributions used by our model.
We begin by using Jensen’s inequality to bound the log probability of a document:
  log p(w1:N | µ, Σ, β) ≥                                                                          (4)
                                  N
       Eq [log p(η | µ, Σ)] +     n=1 (Eq   [log p(zn | η)] + Eq [log p(wn | zn , β)]) + H (q) ,
where the expectation is taken with respect to a variational distribution of the latent vari-
ables, and H (q) denotes the entropy of that distribution. We use a factorized distribution:
                                 2                 K                  2       N
         q(η1:K , z1:N | λ1:K , ν1:K , φ1:N ) =    i=1   q(ηi | λi , νi )     n=1   q(zn | φn ).   (5)
The variational distributions of the discrete variables z1:N are specified by the K-
dimensional multinomial parameters φ1:N . The variational distribution of the continuous
variables η1:K are K independent univariate Gaussians {λi , νi }. Since the variational pa-
rameters are fit using a single observed document w1:N , there is no advantage in introduc-
ing a non-diagonal variational covariance matrix.
The nonconjugacy of the logistic normal leads to difficulty in computing the expected log
probability of a topic assignment:
                                                                    K
                Eq [log p(zn | η)] = Eq η T zn − Eq log(            i=1     exp{ηi }) .            (6)

To preserve the lower bound on the log probability, we upper bound the log normalizer
with a Taylor expansion,
                       K                             K
          Eq log       i=1   exp{ηi }   ≤ ζ −1 (     i=1   Eq [exp{ηi }]) − 1 + log(ζ),            (7)

where we have introduced a new variational parameter ζ. The expectation Eq [exp{ηi }] is
the mean of a log normal distribution with mean and variance obtained from the variational
                  2                                    2
parameters {λi , νi }; thus, Eq [exp{ηi }] = exp{λi + νi /2} for i ∈ {1, . . . , K}.
brain
                                            earthquake                                                     memory
                                           earthquakes                                                     subjects
                                                fault                                                         left                    males                                                  gene
                                              images                                                         task                     male                                                 disease
                                                data                                                        brains                  females                                               mutations
 fossil record              mantle         observations                                                   cognitive                  female                                                families
                                                                                                                                                             genetic
      birds                  crust           features                                                     language                   sperm                                                mutation
                                                                                                                                                           population
     fossils            upper mantle           venus                                                     human brain                   sex                                           alzheimers disease
                                                                                                                                                          populations
  dinosaurs              meteorites           surface                                                      learning                 offspring                                              patients
                                                                                                                                                          differences
      fossil                 ratios            faults                                                                                 eggs                                                  human
                                                                                                                                                            variation
   evolution                 rocks                                co2                                                               species                                             breast cancer            development
                                                                                                                                                            evolution
       taxa                 grains                              carbon                                                                 egg                                                  normal                  embryos
                                                                                          neurons                                                              loci
    species                isotopic                         carbon dioxide                                        ras                                        mtdna                                                 drosophila
  specimens         isotopic composition                       methane                    stimulus                atp                                         data                                                   genes
 evolutionary               depth                                water                      motor               camp                                      evolutionary                                            expression
                                                                energy                      visual                gtp                                                                                               embryo
                                                                  gas                      cortical       adenylyl cyclase                                                                                      developmental
                                                                  fuel                     axons                  cftr                                                                                             embryonic
                                                              production                   stimuli    adenosine triphosphate atp                                             amino acids                     developmental biology
                                                            organic matter               movement     guanosine triphosphate gtp                                                 cdna                              vertebrate
                                                                                           cortex                gap
                                                                                             eye                                                                              sequence
      ancient                                                                                                    gdp                                                           isolated
       found                                                                                                                                                                    protein
      impact                                                                                                                                                                 amino acid
                              climate                 ozone
million years ago                                                                                                                                                                mrna
                              ocean               atmospheric                                                                                         p53
       africa                                                                                                                                                            amino acid sequence         wild type
                                 ice             measurements                          ca2                                                         cell cycle
         site                                                                                                  synapses                                                          actin                mutant
                             changes              stratosphere                      calcium                                                         activity
      bones                                                                                                        ltp                                                           clone              mutations
                         climate change          concentrations                     release                                                          cyclin
    years ago                                                                                                  glutamate                                                                             mutants
                          north atlantic          atmosphere                      ca2 release                                                     regulation
        date                                                                                                    synaptic                                                                             mutation
                              record                    air                      concentration                                                      protein
        rock                                                                                                    neurons                                                                                gene
                             warming                aerosols                           ip3                                                    phosphorylation
                           temperature            troposphere                                          long term potentiation ltp                                                                     yeast
                                                                             intracellular calcium       synaptic transmission                      kinase                                        recombination
                                past                measured                      intracellular                                                   regulated
                                                                                                             postsynaptic                                                                           phenotype
                                                                               intracellular ca2            nmda receptors                 cell cycle progression                                     genes
                                                                                      ca2 i                  hippocampus




Figure 2: A portion of the topic graph learned from 16,351 OCR articles from Science.
Each node represents a topic, and is labeled with the five most probable phrases from its
distribution (phrases are found by the “turbo topics” method [3]). The interested reader can
browse the full model at http://www.cs.cmu.edu/˜lemur/science/.


Given a model {β1:K , µ, Σ} and a document w1:N , the variational inference algorithm op-
timizes equation (4) with respect to the variational parameters {λ1:K , ν1:K , φ1:N , ζ}. We
use coordinate ascent, repeatedly optimizing with respect to each parameter while holding
the others fixed. In variational inference for LDA, each coordinate can be optimized ana-
lytically. However, iterative methods are required for the CTM when optimizing for λi and
  2
νi . The details are given in Appendix A.
Given a collection of documents, we carry out parameter estimation in the correlated topic
model by attempting to maximize the likelihood of a corpus of documents as a function
of the topics β1:K and the multivariate Gaussian parameters {µ, Σ}. We use variational
expectation-maximization (EM), where we maximize the bound on the log probability of a
collection given by summing equation (4) over the documents.
In the E-step, we maximize the bound with respect to the variational parameters by per-
forming variational inference for each document. In the M-step, we maximize the bound
with respect to the model parameters. This is maximum likelihood estimation of the top-
ics and multivariate Gaussian using expected sufficient statistics, where the expectation
is taken with respect to the variational distributions computed in the E-step. The E-step
and M-step are repeated until the bound on the likelihood converges. In the experiments
reported below, we run variational inference until the relative change in the probability
bound of equation (4) is less than 10−6 , and run variational EM until the relative change in
the likelihood bound is less than 10−5 .


3          Examples and Empirical Results: Modeling Science

In order to test and illustrate the correlated topic model, we estimated a 100-topic CTM
on 16,351 Science articles spanning 1990 to 1999. We constructed a graph of the la-
tent topics and the connections among them by examining the most probable words from
each topic and the between-topic correlations. Part of this graph is illustrated in Fig-
ure 2. In this subgraph, there are three densely connected collections of topics: material
science, geology, and cell biology. Furthermore, an estimated CTM can be used to ex-
plore otherwise unstructured observed documents. In Figure 4, we list articles that are
assigned to the cognitive science topic and articles that are assigned to both the cog-
2200
                          −112800
                                             CTM
                                                                                                                                                                                                  q




                                                                                                                            2000
                                             LDA                                                                                                                                            q




                          −113200
                                                                                                                                                                                      q




                                                                                                                            1800
                                                                                  q
                                                                                        q     q
                                                                        q    q                      q
                                                                   q
                                                                                                                                                                                q
                                                              q




                          −113600
                                                       q




                                                                                                                            1600
                                                  q
                                                  q
                                                       q



Held−out log likelihood                                                                                                                                                    q




                          −114000




                                                                                                                            1400
                                                                                                          L(CTM) − L(LDA)
                                             q                q
                                             q



                                                                   q




                                                                                                                            1200
                          −114400



                                                                        q                                                                                             q




                                                                                                                            1000
                                                                             q
                          −114800




                                        q
                                        q
                                                                                                                                                                 q
                                                                                  q




                                                                                                                            800
                                                                                        q
                          −115200




                                                                                              q




                                                                                                                            600
                                                                                                                                                             q
                                                                                                    q
                          −115600




                                                                                                                            400
                                    q
                                    q
                                                                                                                                                      q
                          −116000




                                                                                                                            200
                                                                                                                                            q    q




                                                                                                                            0
                                                                                                                                       q
                          −116400




                                                                                                                                   q




                                    5   10   20   30   40     50   60   70   80   90   100   110   120                                 10   20   30   40    50   60   70   80   90   100   110   120

                                                            Number of topics                                                                               Number of topics



Figure 3: (L) The average held-out probability; CTM supports more topics than LDA. See
figure at right for the standard error of the difference. (R) The log odds ratio of the held-out
probability. Positive numbers indicate a better fit by the correlated topic model.


nitive science and visual neuroscience topics. The interested reader is invited to visit
http://www.cs.cmu.edu/˜lemur/science/ to interactively explore this model, in-
cluding the topics, their connections, and the articles that exhibit them.
We compared the CTM to LDA by fitting a smaller collection of articles to models of vary-
ing numbers of topics. This collection contains the 1,452 documents from 1960; we used
a vocabulary of 5,612 words after pruning common function words and terms that occur
once in the collection. Using ten-fold cross validation, we computed the log probability of
the held-out data given a model estimated from the remaining data. A better model of the
document collection will assign higher probability to the held out data. To avoid comparing
bounds, we used importance sampling to compute the log probability of a document where
the fitted variational distribution is the proposal.
Figure 3 illustrates the average held out log probability for each model and the average
difference between them. The CTM provides a better fit than LDA and supports more
topics; the likelihood for LDA peaks near 30 topics while the likelihood for the CTM peaks
close to 90 topics. The means and standard errors of the difference in log-likelihood of the
models is shown at right; this indicates that the CTM always gives a better fit.
Another quantitative evaluation of the relative strengths of LDA and the CTM is how well
the models predict the remaining words after observing a portion of the document. Sup-
pose we observe words w1:P from a document and are interested in which model provides
a better predictive distribution p(w | w1:P ) of the remaining words. To compare these dis-
tributions, we use perplexity, which can be thought of as the effective number of equally
likely words according to the model. Mathematically, the perplexity of a word distribu-
tion is defined as the inverse of the per-word geometric average of the probability of the
observations,
                                                                                                                                                      PD −1
                                                                                                                                                       d=1 (Nd −P )
                                                                                  D          Nd
                                                  Perp(Φ) =                       d=1        i=P +1      p(wi | Φ, w1:P )                                                   ,
where Φ denotes the model parameters of an LDA or CTM model. Note that lower numbers
denote more predictive power.
The plot in Figure 4 compares the predictive perplexity under LDA and the CTM. When a
Top Articles with




                                                                                          2600
          {brain, memory, human, visual, cognitive}                                                                                     CTM
                                                                                                                                        LDA
   (1) Separate Neural Bases of Two Fundamental Memory                                           q

       Processes in the Human Medial Temporal Lobe
   (2) Inattentional Blindness Versus Inattentional Amnesia for




                                                                                          2400
       Fixated but Ignored Words
   (3) Making Memories: Brain Activity that Predicts How Well
       Visual Experience Will be Remembered                                                      q




                                                                  Predictive perplexity
   (4) The Learning of Categories: Parallel Brain Systems for




                                                                                          2200
                                                                                                      q
       Item Memory and Category Knowledge
   (5) Brain Activation Modulated by Sentence Comprehension
                                                                                                      q    q


                      Top Articles with                                                                    q




                                                                                          2000
         {brain, memory, human, visual, cognitive} and                                                          q
                                                                                                                q
       {computer, data, information, problem, systems}
                                                                                                                       q
                                                                                                                       q
                                                                                                                              q
                                                                                                                              q
                                                                                                                                   q
                                                                                                                                   q          q
                                                                                                                                              q
   (1) A Head for Figures                                                                                                                         q
                                                                                                                                                  q
   (2) Sources of Mathematical Thinking: Behavioral and Brain




                                                                                          1800
       Imaging Evidence
   (3) Natural Language Processing
   (4) A Romance Blossoms Between Gray Matter and Silicon                                        10   20   30   40    50     60    70     80      90
   (5) Computer Vision
                                                                                                                % observed words




Figure 4: (Left) Exploring a collection through its topics. (Right) Predictive perplexity for
partially observed held-out documents from the 1960 Science corpus.


small number of words have been observed, there is less uncertainty about the remaining
words under the CTM than under LDA—the perplexity is reduced by nearly 200 words, or
roughly 10%. The reason is that after seeing a few words in one topic, the CTM uses topic
correlation to infer that words in a related topic may also be probable. In contrast, LDA
cannot predict the remaining words as well until a large portion of the document as been
observed so that all of its topics are represented.
Acknowledgments Research supported in part by NSF grants IIS-0312814 and IIS-
0427206 and by the DARPA CALO project.

References
 [1] J. Aitchison. The statistical analysis of compositional data. Journal of the Royal
     Statistical Society, Series B, 44(2):139–177, 1982.
 [2] C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A variational inference engine for
     Bayesian networks. In NIPS 15, pages 777–784. Cambridge, MA, 2003.
 [3] D. Blei, J. Lafferty, C. Genovese, and L. Wasserman. Turbo topics. In progress, 2006.
 [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine
     Learning Research, 3:993–1022, January 2003.
 [5] E. Erosheva. Grade of membership and latent structure models with application to
     disability survey data. PhD thesis, Carnegie Mellon University, Department of Statis-
     tics, 2002.
 [6] M. Girolami and A. Kaban. Simplicial mixtures of Markov chains: Distributed mod-
     elling of dynamic user profiles. In NIPS 16, pages 9–16, 2004.
 [7] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax.
     In Advances in Neural Information Processing Systems 17, 2005.
 [8] B. Marlin. Collaborative filtering: A machine learning perspective. Master’s thesis,
     University of Toronto, 2004.
 [9] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic
     model for topic and role discovery in social networks. 2004.
[10] J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using
     multilocus genotype data. Genetics, 155:945–959, June 2000.
[11] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smith. In UAI ’04: Proceedings of
     the 20th Conference on Uncertainty in Artificial Intelligence, pages 487–494.
[12] J. Sivic, B. Rusell, A. Efros, A. Zisserman, and W. Freeman. Discovering object
     categories in image collections. Technical report, CSAIL, MIT, 2005.
[13] M. Wainwright and M. Jordan. A variational principle for graphical models. In New
     Directions in Statistical Signal Processing, chapter 11. MIT Press, 2005.
[14] E. Xing, M. Jordan, and S. Russell. A generalized mean field algorithm for variational
     inference in exponential families. In Proceedings of UAI, 2003.

A    Variational Inference
We describe a coordinate ascent optimization algorithm for the likelihood bound in equa-
tion (4) with respect to the variational parameters.
The first term of equation (4) is
Eq [log p(η | µ, Σ)] = (1/2) log |Σ−1 | − (K/2) log 2π − (1/2)Eq (η − µ)T Σ−1 (η − µ) ,
                                                                                    (8)
where
         Eq (η − µ)T Σ−1 (η − µ) = Tr(diag(ν 2 )Σ−1 ) + (λ − µ)T Σ−1 (λ − µ).       (9)
The second term of equation (4), using the additional bound in equation (7), is
                           K                         K
    Eq [log p(zn | η)] =   i=1   λi φn,i − ζ −1      i=1
                                                                     2
                                                           exp{λi + νi /2} + 1 − log ζ.   (10)

The third term of equation (4) is
                                                      K
                       Eq [log p(wn | zn , β)] =      i=1   φn,i log βi,wn .              (11)
Finally, the fourth term is the entropy of the variational distribution:
                   K 1         2                        N        k
                   i=1 2 (log νi   + log 2π + 1) −      n=1      i=1   φn,i log φn,i .    (12)
We maximize the bound in equation (4) with respect to the variational parameters λ1:K ,
ν1:K , φ1:N , and ζ. We use a coordinate ascent algorithm, iteratively maximizing the bound
with respect to each parameter.
First, we maximize equation (4) with respect to ζ, using the second bound in equation (7).
The derivative with respect to ζ is
                                           K
                   f (ζ) = N ζ −2          i=1    exp{λi + νi /2} − ζ −1 ,
                                                            2
                                                                                          (13)
which has a maximum at
                            ˆ       K                2
                            ζ = i=1 exp{λi + νi /2}.                                      (14)
Second, we maximize with respect to φn . This yields a maximum at
                     ˆ
                     φn,i ∝ exp{λi }βi,w , i ∈ {1, . . . , K}.                   (15)
                                           n

Third, we maximize with respect to λi . Since equation (4) is not amenable to analytic
maximization, we use a conjugate gradient algorithm with derivative
                                             N
           dL/dλ = −Σ−1 (λ − µ) + n=1 φn,1:K − (N/ζ) exp{λ + ν 2 /2} .            (16)
                                       2
Finally, we maximize with respect to νi . Again, there is no analytic solution. We use
Newton’s method for each coordinate, constrained such that νi > 0:
                      2      −1                        2             2
                dL/dνi = −Σii /2 − N/2ζ exp{λ + νi /2} + 1/(2νi ).                (17)
Iterating between these optimizations defines a coordinate ascent algorithm on equa-
tion (4).

Mais conteúdo relacionado

Mais procurados

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
Sihan Chen
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
University of Bari (Italy)
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic Similarity
Saswat Padhi
 

Mais procurados (19)

Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
G04124041046
G04124041046G04124041046
G04124041046
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic Similarity
 

Semelhante a Topic models

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 

Semelhante a Topic models (20)

TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Co word analysis
Co word analysisCo word analysis
Co word analysis
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 

Mais de Ajay Ohri

Mais de Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Topic models

  • 1. Correlated Topic Models David M. Blei John D. Lafferty Department of Computer Science School of Computer Science Princeton University Carnegie Mellon University Abstract Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other dis- crete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vo- cabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference al- gorithm for approximate posterior inference in this model, which is com- plicated by the fact that the logistic normal is not conjugate to the multi- nomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a nat- ural way of visualizing and exploring this and other unstructured data sets. 1 Introduction The availability and use of unstructured historical collections of documents is rapidly grow- ing. As one example, JSTOR (www.jstor.org) is a not-for-profit organization that main- tains a large online scholarly journal archive obtained by running an optical character recog- nition engine over the original printed journals. JSTOR indexes the resulting text and pro- vides online access to the scanned images of the original content through keyword search. This provides an extremely useful service to the scholarly community, with the collection comprising nearly three million published articles in a variety of fields. The sheer size of this unstructured and noisy archive naturally suggests opportunities for the use of statistical modeling. For instance, a scholar in a narrow subdiscipline, searching for a particular research article, would certainly be interested to learn that the topic of that article is highly correlated with another topic that the researcher may not have known about, and that is not explicitly contained in the article. Alerted to the existence of this new related topic, the researcher could browse the collection in a topic-guided manner to begin to investigate connections to a previously unrecognized body of work. Since the archive comprises millions of articles spanning centuries of scholarly work, automated analysis is essential.
  • 2. Several statistical models have recently been developed for automatically extracting the topical structure of large document collections. In technical terms, a topic model is a generative probabilistic model that uses a small number of distributions over a vocabulary to describe a document collection. When fit from data, these distributions often correspond to intuitive notions of topicality. In this work, we build upon the latent Dirichlet allocation (LDA) [4] model. LDA assumes that the words of each document arise from a mixture of topics. The topics are shared by all documents in the collection; the topic proportions are document-specific and randomly drawn from a Dirichlet distribution. LDA allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. Recent work has used LDA in more complicated document models [9, 11, 7], and in a variety of settings such as image processing [12], collaborative filtering [8], and the modeling of sequential data and user profiles [6]. Similar models were independently developed for disability survey data [5] and population genetics [10]. Our goal in this paper is to address a limitation of the topic models proposed to date: they fail to directly model correlation between topics. In many—indeed most—text corpora, it is natural to expect that subsets of the underlying latent topics will be highly correlated. In a corpus of scientific articles, for instance, an article about genetics may be likely to also be about health and disease, but unlikely to also be about x-ray astronomy. For the LDA model, this limitation stems from the independence assumptions implicit in the Dirichlet distribution on the topic proportions. Under a Dirichlet, the components of the proportions vector are nearly independent; this leads to the strong and unrealistic modeling assumption that the presence of one topic is not correlated with the presence of another. In this paper we present the correlated topic model (CTM). The CTM uses an alterna- tive, more flexible distribution for the topic proportions that allows for covariance structure among the components. This gives a more realistic model of latent topic structure where the presence of one latent topic may be correlated with the presence of another. In the following sections we develop the technical aspects of this model, and then demonstrate its potential for the applications envisioned above. We fit the model to a portion of the JSTOR archive of the journal Science. We demonstrate that the model gives a better fit than LDA, as measured by the accuracy of the predictive distributions over held out documents. Fur- thermore, we demonstrate qualitatively that the correlated topic model provides a natural way of visualizing and exploring such an unstructured collection of textual data. 2 The Correlated Topic Model The key to the correlated topic model we propose is the logistic normal distribution [1]. The logistic normal is a distribution on the simplex that allows for a general pattern of variability between the components by transforming a multivariate normal random variable. Consider the natural parameterization of a K-dimensional multinomial distribution: p(z | η) = exp{η T z − a(η)}. (1) The random variable Z can take on K values; it can be represented by a K-vector with exactly one component equal to one, denoting a value in {1, . . . , K}. The cumulant gener- ating function of the distribution is K a(η) = log i=1 exp{ηi } . (2) The mapping between the mean parameterization (i.e., the simplex) and the natural param- eterization is given by ηi = log θi /θK . (3) Notice that this is not the minimal exponential family representation of the multinomial because multiple values of η can yield the same mean parameter.
  • 3. Σ βk ηd Zd,n Wd,n N µ D K Figure 1: Top: Graphical model representation of the correlated topic model. The logistic normal distribution, used to model the latent topic proportions of a document, can represent correlations between topics that are impossible to capture using a single Dirichlet. Bottom: Example densities of the logistic normal on the 2-simplex. From left: diagonal covariance and nonzero-mean, negative correlation between components 1 and 2, positive correlation between components 1 and 2. The logistic normal distribution assumes that η is normally distributed and then mapped to the simplex with the inverse of the mapping given in equation (3); that is, f (ηi ) = exp ηi / j exp ηj . The logistic normal models correlations between components of the simplicial random variable through the covariance matrix of the normal distribution. The logistic normal was originally studied in the context of analyzing observed compositional data such as the proportions of minerals in geological samples. In this work, we extend its use to a hierarchical model where it describes the latent composition of topics associated with each document. Let {µ, Σ} be a K-dimensional mean and covariance matrix, and let topics β1:K be K multinomials over a fixed word vocabulary. The correlated topic model assumes that an N -word document arises from the following generative process: 1. Draw η | {µ, Σ} ∼ N (µ, Σ). 2. For n ∈ {1, . . . , N }: (a) Draw topic assignment Zn | η from Mult(f (η)). (b) Draw word Wn | {zn , β1:K } from Mult(βzn ). This process is identical to the generative process of LDA except that the topic proportions are drawn from a logistic normal rather than a Dirichlet. The model is shown as a directed graphical model in Figure 1. The CTM is more expressive than LDA. The strong independence assumption imposed by the Dirichlet in LDA is not realistic when analyzing document collections, where one may find strong correlations between topics. The covariance matrix of the logistic normal in the CTM is introduced to model such correlations. In Section 3, we illustrate how the higher order structure given by the covariance can be used as an exploratory tool for better understanding and navigating a large corpus of documents. Moreover, modeling correlation can lead to better predictive distributions. In some settings, such as collaborative filtering,
  • 4. the goal is to predict unseen items conditional on a set of observations. An LDA model will predict words based on the latent topics that the observations suggest, but the CTM has the ability to predict items associated with additional topics that are correlated with the conditionally probable topics. 2.1 Posterior inference and parameter estimation Posterior inference is the central challenge to using the CTM. The posterior distribution of the latent variables conditional on a document, p(η, z1:N | w1:N ), is intractable to compute; once conditioned on some observations, the topic assignments z1:N and log proportions η are dependent. We make use of mean-field variational methods to efficiently obtain an approximation of this posterior distribution. In brief, the strategy employed by mean-field variational methods is to form a factorized distribution of the latent variables, parameterized by free variables which are called the vari- ational parameters. These parameters are fit so that the Kullback-Leibler (KL) divergence between the approximate and true posterior is small. For many problems this optimization problem is computationally manageable, while standard methods, such as Markov Chain Monte Carlo, are impractical. The tradeoff is that variational methods do not come with the same theoretical guarantees as simulation methods. See [13] for a modern review of variational methods for statistical inference. In graphical models composed of conjugate-exponential family pairs and mixtures, the variational inference algorithm can be automatically derived from general principles [2, 14]. In the CTM, however, the logistic normal is not conjugate to the multinomial. We will therefore derive a variational inference algorithm by taking into account the special structure and distributions used by our model. We begin by using Jensen’s inequality to bound the log probability of a document: log p(w1:N | µ, Σ, β) ≥ (4) N Eq [log p(η | µ, Σ)] + n=1 (Eq [log p(zn | η)] + Eq [log p(wn | zn , β)]) + H (q) , where the expectation is taken with respect to a variational distribution of the latent vari- ables, and H (q) denotes the entropy of that distribution. We use a factorized distribution: 2 K 2 N q(η1:K , z1:N | λ1:K , ν1:K , φ1:N ) = i=1 q(ηi | λi , νi ) n=1 q(zn | φn ). (5) The variational distributions of the discrete variables z1:N are specified by the K- dimensional multinomial parameters φ1:N . The variational distribution of the continuous variables η1:K are K independent univariate Gaussians {λi , νi }. Since the variational pa- rameters are fit using a single observed document w1:N , there is no advantage in introduc- ing a non-diagonal variational covariance matrix. The nonconjugacy of the logistic normal leads to difficulty in computing the expected log probability of a topic assignment: K Eq [log p(zn | η)] = Eq η T zn − Eq log( i=1 exp{ηi }) . (6) To preserve the lower bound on the log probability, we upper bound the log normalizer with a Taylor expansion, K K Eq log i=1 exp{ηi } ≤ ζ −1 ( i=1 Eq [exp{ηi }]) − 1 + log(ζ), (7) where we have introduced a new variational parameter ζ. The expectation Eq [exp{ηi }] is the mean of a log normal distribution with mean and variance obtained from the variational 2 2 parameters {λi , νi }; thus, Eq [exp{ηi }] = exp{λi + νi /2} for i ∈ {1, . . . , K}.
  • 5. brain earthquake memory earthquakes subjects fault left males gene images task male disease data brains females mutations fossil record mantle observations cognitive female families genetic birds crust features language sperm mutation population fossils upper mantle venus human brain sex alzheimers disease populations dinosaurs meteorites surface learning offspring patients differences fossil ratios faults eggs human variation evolution rocks co2 species breast cancer development evolution taxa grains carbon egg normal embryos neurons loci species isotopic carbon dioxide ras mtdna drosophila specimens isotopic composition methane stimulus atp data genes evolutionary depth water motor camp evolutionary expression energy visual gtp embryo gas cortical adenylyl cyclase developmental fuel axons cftr embryonic production stimuli adenosine triphosphate atp amino acids developmental biology organic matter movement guanosine triphosphate gtp cdna vertebrate cortex gap eye sequence ancient gdp isolated found protein impact amino acid climate ozone million years ago mrna ocean atmospheric p53 africa amino acid sequence wild type ice measurements ca2 cell cycle site synapses actin mutant changes stratosphere calcium activity bones ltp clone mutations climate change concentrations release cyclin years ago glutamate mutants north atlantic atmosphere ca2 release regulation date synaptic mutation record air concentration protein rock neurons gene warming aerosols ip3 phosphorylation temperature troposphere long term potentiation ltp yeast intracellular calcium synaptic transmission kinase recombination past measured intracellular regulated postsynaptic phenotype intracellular ca2 nmda receptors cell cycle progression genes ca2 i hippocampus Figure 2: A portion of the topic graph learned from 16,351 OCR articles from Science. Each node represents a topic, and is labeled with the five most probable phrases from its distribution (phrases are found by the “turbo topics” method [3]). The interested reader can browse the full model at http://www.cs.cmu.edu/˜lemur/science/. Given a model {β1:K , µ, Σ} and a document w1:N , the variational inference algorithm op- timizes equation (4) with respect to the variational parameters {λ1:K , ν1:K , φ1:N , ζ}. We use coordinate ascent, repeatedly optimizing with respect to each parameter while holding the others fixed. In variational inference for LDA, each coordinate can be optimized ana- lytically. However, iterative methods are required for the CTM when optimizing for λi and 2 νi . The details are given in Appendix A. Given a collection of documents, we carry out parameter estimation in the correlated topic model by attempting to maximize the likelihood of a corpus of documents as a function of the topics β1:K and the multivariate Gaussian parameters {µ, Σ}. We use variational expectation-maximization (EM), where we maximize the bound on the log probability of a collection given by summing equation (4) over the documents. In the E-step, we maximize the bound with respect to the variational parameters by per- forming variational inference for each document. In the M-step, we maximize the bound with respect to the model parameters. This is maximum likelihood estimation of the top- ics and multivariate Gaussian using expected sufficient statistics, where the expectation is taken with respect to the variational distributions computed in the E-step. The E-step and M-step are repeated until the bound on the likelihood converges. In the experiments reported below, we run variational inference until the relative change in the probability bound of equation (4) is less than 10−6 , and run variational EM until the relative change in the likelihood bound is less than 10−5 . 3 Examples and Empirical Results: Modeling Science In order to test and illustrate the correlated topic model, we estimated a 100-topic CTM on 16,351 Science articles spanning 1990 to 1999. We constructed a graph of the la- tent topics and the connections among them by examining the most probable words from each topic and the between-topic correlations. Part of this graph is illustrated in Fig- ure 2. In this subgraph, there are three densely connected collections of topics: material science, geology, and cell biology. Furthermore, an estimated CTM can be used to ex- plore otherwise unstructured observed documents. In Figure 4, we list articles that are assigned to the cognitive science topic and articles that are assigned to both the cog-
  • 6. 2200 −112800 CTM q 2000 LDA q −113200 q 1800 q q q q q q q q q −113600 q 1600 q q q Held−out log likelihood q −114000 1400 L(CTM) − L(LDA) q q q q 1200 −114400 q q 1000 q −114800 q q q q 800 q −115200 q 600 q q −115600 400 q q q −116000 200 q q 0 q −116400 q 5 10 20 30 40 50 60 70 80 90 100 110 120 10 20 30 40 50 60 70 80 90 100 110 120 Number of topics Number of topics Figure 3: (L) The average held-out probability; CTM supports more topics than LDA. See figure at right for the standard error of the difference. (R) The log odds ratio of the held-out probability. Positive numbers indicate a better fit by the correlated topic model. nitive science and visual neuroscience topics. The interested reader is invited to visit http://www.cs.cmu.edu/˜lemur/science/ to interactively explore this model, in- cluding the topics, their connections, and the articles that exhibit them. We compared the CTM to LDA by fitting a smaller collection of articles to models of vary- ing numbers of topics. This collection contains the 1,452 documents from 1960; we used a vocabulary of 5,612 words after pruning common function words and terms that occur once in the collection. Using ten-fold cross validation, we computed the log probability of the held-out data given a model estimated from the remaining data. A better model of the document collection will assign higher probability to the held out data. To avoid comparing bounds, we used importance sampling to compute the log probability of a document where the fitted variational distribution is the proposal. Figure 3 illustrates the average held out log probability for each model and the average difference between them. The CTM provides a better fit than LDA and supports more topics; the likelihood for LDA peaks near 30 topics while the likelihood for the CTM peaks close to 90 topics. The means and standard errors of the difference in log-likelihood of the models is shown at right; this indicates that the CTM always gives a better fit. Another quantitative evaluation of the relative strengths of LDA and the CTM is how well the models predict the remaining words after observing a portion of the document. Sup- pose we observe words w1:P from a document and are interested in which model provides a better predictive distribution p(w | w1:P ) of the remaining words. To compare these dis- tributions, we use perplexity, which can be thought of as the effective number of equally likely words according to the model. Mathematically, the perplexity of a word distribu- tion is defined as the inverse of the per-word geometric average of the probability of the observations, PD −1 d=1 (Nd −P ) D Nd Perp(Φ) = d=1 i=P +1 p(wi | Φ, w1:P ) , where Φ denotes the model parameters of an LDA or CTM model. Note that lower numbers denote more predictive power. The plot in Figure 4 compares the predictive perplexity under LDA and the CTM. When a
  • 7. Top Articles with 2600 {brain, memory, human, visual, cognitive} CTM LDA (1) Separate Neural Bases of Two Fundamental Memory q Processes in the Human Medial Temporal Lobe (2) Inattentional Blindness Versus Inattentional Amnesia for 2400 Fixated but Ignored Words (3) Making Memories: Brain Activity that Predicts How Well Visual Experience Will be Remembered q Predictive perplexity (4) The Learning of Categories: Parallel Brain Systems for 2200 q Item Memory and Category Knowledge (5) Brain Activation Modulated by Sentence Comprehension q q Top Articles with q 2000 {brain, memory, human, visual, cognitive} and q q {computer, data, information, problem, systems} q q q q q q q q (1) A Head for Figures q q (2) Sources of Mathematical Thinking: Behavioral and Brain 1800 Imaging Evidence (3) Natural Language Processing (4) A Romance Blossoms Between Gray Matter and Silicon 10 20 30 40 50 60 70 80 90 (5) Computer Vision % observed words Figure 4: (Left) Exploring a collection through its topics. (Right) Predictive perplexity for partially observed held-out documents from the 1960 Science corpus. small number of words have been observed, there is less uncertainty about the remaining words under the CTM than under LDA—the perplexity is reduced by nearly 200 words, or roughly 10%. The reason is that after seeing a few words in one topic, the CTM uses topic correlation to infer that words in a related topic may also be probable. In contrast, LDA cannot predict the remaining words as well until a large portion of the document as been observed so that all of its topics are represented. Acknowledgments Research supported in part by NSF grants IIS-0312814 and IIS- 0427206 and by the DARPA CALO project. References [1] J. Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society, Series B, 44(2):139–177, 1982. [2] C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A variational inference engine for Bayesian networks. In NIPS 15, pages 777–784. Cambridge, MA, 2003. [3] D. Blei, J. Lafferty, C. Genovese, and L. Wasserman. Turbo topics. In progress, 2006. [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003. [5] E. Erosheva. Grade of membership and latent structure models with application to disability survey data. PhD thesis, Carnegie Mellon University, Department of Statis- tics, 2002. [6] M. Girolami and A. Kaban. Simplicial mixtures of Markov chains: Distributed mod- elling of dynamic user profiles. In NIPS 16, pages 9–16, 2004. [7] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005. [8] B. Marlin. Collaborative filtering: A machine learning perspective. Master’s thesis, University of Toronto, 2004. [9] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in social networks. 2004.
  • 8. [10] J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155:945–959, June 2000. [11] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smith. In UAI ’04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487–494. [12] J. Sivic, B. Rusell, A. Efros, A. Zisserman, and W. Freeman. Discovering object categories in image collections. Technical report, CSAIL, MIT, 2005. [13] M. Wainwright and M. Jordan. A variational principle for graphical models. In New Directions in Statistical Signal Processing, chapter 11. MIT Press, 2005. [14] E. Xing, M. Jordan, and S. Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of UAI, 2003. A Variational Inference We describe a coordinate ascent optimization algorithm for the likelihood bound in equa- tion (4) with respect to the variational parameters. The first term of equation (4) is Eq [log p(η | µ, Σ)] = (1/2) log |Σ−1 | − (K/2) log 2π − (1/2)Eq (η − µ)T Σ−1 (η − µ) , (8) where Eq (η − µ)T Σ−1 (η − µ) = Tr(diag(ν 2 )Σ−1 ) + (λ − µ)T Σ−1 (λ − µ). (9) The second term of equation (4), using the additional bound in equation (7), is K K Eq [log p(zn | η)] = i=1 λi φn,i − ζ −1 i=1 2 exp{λi + νi /2} + 1 − log ζ. (10) The third term of equation (4) is K Eq [log p(wn | zn , β)] = i=1 φn,i log βi,wn . (11) Finally, the fourth term is the entropy of the variational distribution: K 1 2 N k i=1 2 (log νi + log 2π + 1) − n=1 i=1 φn,i log φn,i . (12) We maximize the bound in equation (4) with respect to the variational parameters λ1:K , ν1:K , φ1:N , and ζ. We use a coordinate ascent algorithm, iteratively maximizing the bound with respect to each parameter. First, we maximize equation (4) with respect to ζ, using the second bound in equation (7). The derivative with respect to ζ is K f (ζ) = N ζ −2 i=1 exp{λi + νi /2} − ζ −1 , 2 (13) which has a maximum at ˆ K 2 ζ = i=1 exp{λi + νi /2}. (14) Second, we maximize with respect to φn . This yields a maximum at ˆ φn,i ∝ exp{λi }βi,w , i ∈ {1, . . . , K}. (15) n Third, we maximize with respect to λi . Since equation (4) is not amenable to analytic maximization, we use a conjugate gradient algorithm with derivative N dL/dλ = −Σ−1 (λ − µ) + n=1 φn,1:K − (N/ζ) exp{λ + ν 2 /2} . (16) 2 Finally, we maximize with respect to νi . Again, there is no analytic solution. We use Newton’s method for each coordinate, constrained such that νi > 0: 2 −1 2 2 dL/dνi = −Σii /2 − N/2ζ exp{λ + νi /2} + 1/(2νi ). (17) Iterating between these optimizations defines a coordinate ascent algorithm on equa- tion (4).