SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Basic Research on Text Mining at UCSD


                Charles Elkan



      University of California, San Diego




                June 5, 2009




                       1
Text mining




  What is text mining?


  Working answer: Learning to classify documents,
  and learning to organize documents.


  Three central questions:
  (1) how to represent a document?
  (2) how to model a set of closely related documents?
  (3) how to model a set of distantly related documents?




                                  2
Why quot;basicquot; research?

  Mindsets:
  Probability versus linear algebra
  Linguistics versus databases
  Single topic per document versus multiple.


  Which issues are important? From most to least interesting :-)
  Sequencing of words
  Burstiness of words
  Titles versus bodies
  Repeated documents
  Included text
  Feature selection


  Applications:
  Recognizing helpful reviews on Amazon
  Finding related topics across books published decades apart.
With thanks to ...




  Gabe Doyle, David Kauchak, Rasmus Madsen.


  Amarnath Gupta, Chaitan Baru.




                                  4
Three central questions:
(1) how to represent a document?
(2) how to model a set of closely related documents?
(3) how to model a set of distantly related documents?


Answers:
(1) quot;bag of wordsquot;
(2) Dirichlet compound multinomial (DCM) distribution
(3) DCM-based topic model (DCMLDA)
The quot;bag of wordsquot; representation




  Let V be a xed vocabulary. The vocabulary size is m    = |V |.

  Each document is represented as a vector x of length m ,

         j
  where x is the number of appearances of word j in the document.


  The length of the document is n   =     j xj .
  For typical documents, n    m   and x j = 0 for most words j .




                                    6
The multinomial distribution



  The probability of document x according to model     θ   is

                                  n   !        x
                        ( |θ) =
                       p x                    θj j .
                                  j j!x
                                          j

  Each appearance of the same word j always has the same
  probability   θj .

  Computing the probability of a document needs O (n ) time,
  not O (m ) time.




                                  7
The phenomenon of burstiness


  In reality, additional appearances of the same word are less
  surprising, i.e. they have higher probability. Example:


           Toyota Motor Corp. is expected to announce a major
       overhaul. Yoshi Inaba, a former senior Toyota executive, was
       formally asked by Toyota this week to oversee the U.S.
       business. Mr. Inaba is currently head of an international
       airport close to Toyota's headquarters in Japan.
           Toyota's U.S. operations now are suering from plunging
       sales. Mr. Inaba was credited with laying the groundwork for
       Toyota's fast growth in the U.S. before he left the company.
           Recently, Toyota has had to idle U.S. assembly lines and
       oer a limited number of voluntary buyouts. Toyota now
       employs 36,000 in the U.S.
Empirical evidence of burstiness




  How to interpret the gure: The chance that a given rare word
  occurs 10 times in a document is 10
                                        −6 . The chance that it occurs
               −6.5 .
  20 times is 10
                                  9
Moreover ...




  A multinomial is appropriate only for modeling common words,
  which are not informative about the topic of a document.


  Burstiness and signicance are correlated: more informative words
  are also more bursty.
                                  10
A trained DCM model gives correct probabilities for all counts of all
types of words.




                                 11
The Polya urn



  So what is the Dirichlet compound multinomial (DCM)?


  Consider a bucket with balls of m   = |V |   dierent colors.


  After a ball is selected randomly, it is replaced and one more ball of
  the same color is added.


  Each time a ball is drawn, the chance of drawing the same color
  again is increased.


  The initial number of balls with color j is   αj .




                                    12
The bag-of-bag-of-words process


  Let   ϕ   be the parameter vector of a multinomial,
  i.e. a xed probability for each word.


  Let   Dir(β) be a Dirichlet distribution over ϕ.
  To generate a document:
  (1) draw document-specic probabilities    ϕ ∼ Dir(β)
  (2) draw n words w     ∼ Mult(ϕ).

  Each document consists of words drawn from a multinomial that is
  xed for that document, but dierent for other documents.


  Remarkably, the Polya urn and the bag-of-bag-of-words process
  yield the same probability distribution over documents.




                                      13
How is burstiness captured?



  A multinomial parameter vector       ϕ     has length |V | and is
  constrained:     j ϕj = 1.
  A DCM parameter vector       β   has the same length |V | but is
  unconstrained.


  The one extra degree of freedom allows the DCM to discount
  multiple observations of the same word, in an adjustable way.


  The smaller the sum s   =        j βj , the more words are bursty.




                                        14
Moving forward ...



  Three central questions:
  (1) how to represent documents?
  (2) how to model closely related documents?
  (3) how to model distantly related documents?


  A DCM is a good model of documents that all share a single theme.


  β   represents the central theme; for each document   ϕ   represents its
  variation on this theme.


  By combining DCMs with latent Dirichlet allocation (LDA),
  we answer (3).




                                   15
Digression 1: Mixture of DCMs




  Because of the 1:1 mapping between multinomials and documents,
  in a DCM model each document comes entirely from one subtopic.


  We want to allow multiple topics, multiple documents from the
  same topic, and multiple topics within one document.


  In 2006 we extended the DCM model to a mixture of DCM
  distributions. This allows multiple topics, but not multiple topics
  within one document.




                                    16
Digression 2: A non-text application



  Goal: Find companies whose stock prices tend to move together.


  Example: { IBM+, MSFT+, AAPL- } means IBM and Microsoft
  often rise, and Apple falls, on the same days.


  Let each day be a document containing words like IBM+.


  Each word is a stock symbol and a direction (+ or -). Each day has
  one copy of the word for each 1% change in the stock price.


  Let a co-moving group of stocks be a topic. Each day is a
  combination of multiple topics.




                                    17
Examples of discovered topics



           Computer Related                   Real Estate
         Stock   Company                Stock    Company
        NVDA     Nvidia                  SPG     Simon Properties
        SNDK     SanDisk                 AIV     Apt. Investment
        BRCM     Broadcom                KIM     Kimco Realty
           JBL   Jabil Circuit           AVB     AvalonBay
         KLAC    KLA-Tencor             DDR      Developers
          NSM    Nat'l Semicond.        EQR      Equity Residential



  The dataset contains 501 days of transactions between January
  2007 and September 2008.




                                   18
DCMLDA advantages



  Unlike a mixture model, a topic model allows many topics to occur
  in each document.


  DCMLDA allows the same topic to occur with dierent words in
  dierent documents.


  Consider a sports topic. Suppose rugby and hockey are
  equally common. But within each document, seeing rugby makes
  seeing rugby again more likely than seeing hockey.


  A standard topic model cannot represent this burstiness, unless the
  words rugby and hockey are spread across two topics.




                                   19
Hypothesis




  quot;A DCMLDA model with a few topics can t a corpus as well as an
  LDA model with many topics.quot;


  Motivation: A single DCMLDA topic can explain related aspects of
  documents more eectively than a single LDA topic.


  The hypothesis is conrmed by the experimental results below.
Latent Dirichlet Allocation (LDA)

  LDA is a generative model:


  For each of K topics, draw a multinomial to describe it.


  For each of D documents:
  (1) Determine the probability of each of K topics in this document.
  (2) For each of N words:
  rst draw a topic, then draw a word based on that topic.




                                 α      θ

                                        z



                             β   ϕ      w
                                            N
                                   K         D

                                   21
Graphical model


                              α         θ

                                        z



                      β       ϕ     w
                                            N
                               K             D

  The only xed parameters of the model are      α   and   β.

                           ϕ ∼ Dirichlet(β)
                           θ ∼ Dirichlet(α)
                      z   ∼ Multinomial(θ)
                     w    ∼ Multinomial(ϕ)

                                   22
Using LDA for text mining



  Training nds maximum-likelihood values for         ϕ   for each topic,
  and for   θ   for each document.


  For each topic,    ϕ    is a vector of word probabilities indicating the
  content of that topic.


  The distribution    θ   of each document is a reduced-dimensionality
  representation. It is useful for:

       learning to classify documents

       measuring similarity between documents

       more?




                                         23
Extending LDA to DCMLDA




  Goal: Allow multiple topics in a single document, while making
  subtopics be document-specic.


  In DCMLDA, for each topic k and each document d a fresh
  multinomial word distribution       ϕkd     is drawn.


  For each topic k , these multinomials are drawn from the same
  Dirichlet   βk ,   so all versions of the same topic are linked.


  Per-document instances of each topic allow for burstiness.




                                         24
Graphical models compared


                            α    θ

                                 z



                    β       ϕ    w
                                     N
                            K         D


                            α    θ

                                 z



                    β       ϕ    w

                            K        N
                                      D

                            25
DCMLDA generative process



   for   document d∈ {1, . . . , D } do
     draw topic distribution θd ∼ Dir(α)
     for topic k ∈ {1, . . . , K } do
       draw topic-word distribution ϕkd ∼ Dir(βk )
     end for
     for word    n ∈ {1, . . . , Nd } do
                      d n ∼ θd
         draw topic z ,
         draw   word wd ,n ∼ ϕzd ,n d
     end for
   end for




                                       26
Meaning of α and β




  When applying LDA, it is not necessary to learn     α   and   β.

  Steyvers and Griths recommend xed uniform values
  α = 50/K   and   β = .01,   where K is the number of topics.


  However, the information in the LDA      ϕ   values is in the DCMLDA   β
  values.


  Without an eective method to learn the hyperparameters, the
  DCMLDA model is not useful.




                                      27
Training



  Given a training set of documents, alternate:
  (a) optimize parameters          ϕ, θ,   and z given hyperparameters,
  (b) optimize hyperparameters             α, β   given document parameters.


  For xed   α   and   β,   do collapsed Gibbs sampling to nd the
  distribution of z .


  Given a z sample, nd        α   and     β   by Monte Carlo expectation-
  maximization.


  When desired, compute         ϕ   and    θ   from samples of z .
Gibbs sampling


  Gibbs sampling for DCMLDA is similar to the method for LDA.


  Start by factoring the complete likelihood of the model:


                       ( , z |α, β) = p (w |z , β)p (z |α).
                      p w




  DCMLDA and LDA are identical over the                α-to-z   pathway, so
   ( |α)
  p z         in DCMLDA is the same as for LDA:


                                                 ( d + α)
                                             B n··
                             ( |α) =
                            p z                           .
                                                  B (α)
                                        d
  B   (·)   is the Beta function, and n tkd      is how many times word t has
  topic k in document d .




                                            29
To get p (w |z , β), average over all possible                               ϕ   distributions:



               ( | , β) =
             p w z                                p z ( |ϕ)p (ϕ|β)d ϕ
                                              ϕ
                                                                        Nd
                                     =            p   (ϕ|β)                  ϕwd ,n zd ,n d d ϕ
                                              ϕ                d n =1
                                     =            p   (ϕ|β)             (ϕtkd )ntkd d ϕ.
                                              ϕ               d ,k ,t
Expand p (ϕ|β) as a Dirichlet distribution:

                                                                                                      

                                                                                              (ϕtkd )ntkd  d ϕ
                                          1
  ( | , β) =
p w z                                                  (ϕtkd )βtk −1  
                                     B   (β·k ) t
                ϕ         d ,k                                                      d ,k ,t
                                                                                              ( kd + β·k )
                                     (ϕtkd )βtk −1+ntkd d ϕ =
                                                                                          B n·
           =                                                                                               .
                                                                                               B (β·k )
               d ,k       ϕ      t                                                 d ,k

                                                         30
Gibbs sampling cont.




  Combining equations, the complete likelihood is


                                ( d + α)
                               B n··             ( kd + β·k )
                                                B n·
          ( , z |α, β) =
         p w                                                  .
                                 B (α)            B (β·k )
                           d                k




                                       31
Finding optimal α and β


  Optimal   α   and   β   values maximize p (w |α, β). Unfortunately, this
  likelihood is intractable.


  The complete likelihood p (w , z |α, β) is tractable. Based on it, we
  use single-sample Monte Carlo EM.


  Run Gibbs sampling for a burn-in period, with guesses for        α   and   β.

  Then draw a topic assignment z for each word of each document.
  Use this vector in the M-step to estimate new values for        α   and   β.

  Run Gibbs sampling for more iterations, to let topic assignments
  stabilize based on the new       α   and   β    values.


  Then repeat.




                                             32
Training algorithm




    Start with initial    α   and   β
    repeat
      Run Gibbs sampling to approximate steady state
      Choose a topic assignment for each word
      Choose    α   and   β   to maximize complete likelihood p (w , z |α, β)
    until   convergence of     α    and   β




                                              33
α and β to maximize complete likelihood


  Log complete likelihood is


      L(α, β; w , z ) =            [log Γ(n·kd + αk ) − log Γ(αk )]
                            d ,k
                       +           [log Γ(        αk ) − log Γ(         n·   kd + αk )]
                             d               k                     k
                       +              [log Γ(ntkd + βtk ) − log Γ(βtk )]
                           d ,k , t
                       +           [log Γ(        βtk ) − log Γ(        ntkd + βtk )].
                           d ,k              t                      t
  The rst two lines depend only on                   α,   and the second two on          β.
  Furthermore,   βtk   can be independently maximized for each k .




                                                 34
We get K   +1   equations to maximize:


         α = argmax               (log Γ(n·kd + αk ) − log Γ(αk ))
                           d ,k
                +       [log Γ(        αk ) − log Γ(         n·   kd + αk )]
                    d              k                    k
        β·k = argmax              (log Γ(ntkd + βtk ) − log Γ(βtk ))
                           d ,t
                +       [log Γ(        βtk ) − log Γ(        n   tkd + βtk )]
                    d              t                    t
Each equation denes a vector, either          {αk }k       or   {βtk }t .

With a carefully coded Matlab implementation of L-BFGS, one
iteration of EM takes about 100 seconds on sample data.




                                         35
Non-uniform α and β




  Implementations of DCMLDA must allow the             α   vector and   β   array
  to be non-uniform.


  In DCMLDA,    β   carries the information that   ϕ   carries in LDA.


  α   could be uniform in DCMLDA, but learning non-uniform values
  allows certain topics to have higher overall probability than others.




                                     36
Experimental design




  Goal: Check whether handling burstiness in DCMLDA model yields
  a better topic model than LDA.


  Compare DCMLDA only with LDA for two reasons:
  (1) Comparable conceptual complexity.
  (2) DCMLDA is not in competition with more complex topic
  models, since those can be modied to include DCM topics.




                                   37
Comparison method



  Given a test set of documents not used for training, estimate the
  likelihood p (w |α, β) for LDA and DCMLDA models.


  For DCMLDA, p (w |α, β) uses trained     α   and   β.

  For LDA, p (w |α, β) uses   α=α
                                ¯   and
                                              ¯
                                          β = β,   the scalar means of the
  DCMLDA values.


  Also compare to LDA with heuristic values     β = .01   and   α = 50/K ,
  where K is the number of topics.




                                    38
Datasets




  Compare LDA and DCMLDA as models for both text and non-text
  datasets.


  The text dataset is a collection of papers from the 2002 and 2003
  NIPS conferences, with 520955 words in 390 documents, and
  |V | = 6871.

  The SP 500 dataset contains 501 days of stock transactions
  between January 2007 and September 2008.    |V | = 1000.




                                  39
Digression: Computing likelihood




  The incomplete likelihood p (w |α, β) is intractable for topic models.


  The complete likelihood p (w , z |α, β) is tractable, so previous work
  has averaged it over z , but this approach is unreliable.


  Another possibility is to measure classication accuracy.


  But, our datasets do not have obvious classication schemes. Also,
  topics may be more accurate than predened classes.




                                    40
Empirical likelihood




  To calculate empirical likelihood (EL), rst train each model.


  Feed obtained parameter values   α     and   β   into a generative model.


  Get a large set of pseudo documents. Use the pseudo documents to
  train a tractable model: a mixture of multinomials.


  Estimate the test set likelihood as its likelihood under the tractable
  model.




                                    41
Digression2 : Stability of EL


  Investigate stability by running EL multiple times for the same
  DCMLDA model


  Train three independent 20-topic DCMLDA models on the SP500
  dataset, and run EL ve times for each model.


  Mean absolute dierence of EL values for the same model is 0.08%.


  Mean absolute dierence between EL values for separately trained
  DCMLDA models is 0.11%.


  Conclusion: Likelihood values are stable over DCMLDA models
  with a constant number of topics.




                                   42
Cross-validation




  Perform ve 5-fold cross-validation trials for each number of topics
  and each dataset.


  First train a DCMLDA model, then create two LDA models.
  Fitted LDA uses the means of the DCMLDA hyperparameters.
  Heuristic LDA uses xed parameter values.


  Results: For both datasets, DCMLDA is better than tted LDA,
  which is better than heuristic LDA.
Mean log-likelihood on the SP500 dataset. Heuristic model
likelihood is too low to show. Max. standard error is 11.2.




                   −6200

                                                          DCMLDA
                   −6250
                                                          LDA
                   −6300

                   −6350
  Log−Likelihood




                   −6400

                   −6450

                   −6500

                   −6550

                   −6600

                   −6650

                   −6700
                           0   50   100           150   200        250
                                     Number of Topics
                                          44
SP500 discussion




  On the SP500 dataset, the best t is DCMLDA with seven topics.


  A DCMLDA model with few topics is comparable to an LDA model
  with many topics.


  Above seven topics, DCMLDA likelihood drops. Data sparsity may
  prevent the estimation of   β   values that generalize well (overtting).


  LDA seems to undert regardless of how many topics are used.




                                       45
Mean log-likelihood on the NIPS dataset. Max. standard error 21.5.




                 −10000

                 −12000

                 −14000

                 −16000
Log−Likelihood




                 −18000

                 −20000

                 −22000

                 −24000                               DCMLDA
                                                      trained LDA
                 −26000                               heuristic LDA

                 −28000
                          0   20    40         60        80           100
                                   Number of Topics

                                         46
NIPS discussion




  On the NIPS dataset, DCMLDA outperforms LDA model at every
  number of topics.


  LDA with heuristic hyperparameter values almost equals the tted
  LDA model at 50 topics.


  The tted model is better when the number of topics is small.
Alternative heuristic values for hyperparameters




  Learning    α   and   β   is benecial, both for LDA and DCMLDA models.


  Optimal values are signicantly dierent from previously suggested
  heuristic values.


  Best   α   values around 0.7 seem independent of the number of
  topics, unlike the suggested value 50/K .




                                         48
Newer topic models




  Variants include the Correlated Topic Model (CTM) and the
  Pachinko Allocation Model (PAM). These outperform LDA on
  many tasks.


  However, DCMLDA competes only with LDA. The LDA core in
  other models can be replaced by DCMLDA to improve their
  performance.


  DCMLDA and complex topic models are complementary.




                                 49
A bursty correlated topic model


                           µ
                                  θ
                           Σ
                                  z



                      β    φ      w

                            K         N
                                       D
                           µ
                                  θ
                           Σ
                                  z



                      β    φ      w

                            K         N
                                       D


                            50
Conclusion




  The ability of the DCMLDA model to account for burstiness leads
  to a signicant improvement in likelihood over LDA.


  The burstiness of words, and of some non-text data, is an
  important phenomenon to capture in topic modeling.




                                  51

Mais conteúdo relacionado

Semelhante a Basic Research on Text Mining at UCSD.

KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
LDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionLDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionPradipto Das
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
"Visual analysis of controversy in user-generated encyclopedias": A critical ...
"Visual analysis of controversy in user-generated encyclopedias": A critical ..."Visual analysis of controversy in user-generated encyclopedias": A critical ...
"Visual analysis of controversy in user-generated encyclopedias": A critical ...Diego Maranan
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
 

Semelhante a Basic Research on Text Mining at UCSD. (20)

KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
LDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionLDA/TagLDA In Slow Motion
LDA/TagLDA In Slow Motion
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Canini09a
Canini09aCanini09a
Canini09a
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
 
LDAvis
LDAvisLDAvis
LDAvis
 
"Visual analysis of controversy in user-generated encyclopedias": A critical ...
"Visual analysis of controversy in user-generated encyclopedias": A critical ..."Visual analysis of controversy in user-generated encyclopedias": A critical ...
"Visual analysis of controversy in user-generated encyclopedias": A critical ...
 
Spatial LDA
Spatial LDASpatial LDA
Spatial LDA
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
 

Último

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Basic Research on Text Mining at UCSD.

  • 1. Basic Research on Text Mining at UCSD Charles Elkan University of California, San Diego June 5, 2009 1
  • 2. Text mining What is text mining? Working answer: Learning to classify documents, and learning to organize documents. Three central questions: (1) how to represent a document? (2) how to model a set of closely related documents? (3) how to model a set of distantly related documents? 2
  • 3. Why quot;basicquot; research? Mindsets: Probability versus linear algebra Linguistics versus databases Single topic per document versus multiple. Which issues are important? From most to least interesting :-) Sequencing of words Burstiness of words Titles versus bodies Repeated documents Included text Feature selection Applications: Recognizing helpful reviews on Amazon Finding related topics across books published decades apart.
  • 4. With thanks to ... Gabe Doyle, David Kauchak, Rasmus Madsen. Amarnath Gupta, Chaitan Baru. 4
  • 5. Three central questions: (1) how to represent a document? (2) how to model a set of closely related documents? (3) how to model a set of distantly related documents? Answers: (1) quot;bag of wordsquot; (2) Dirichlet compound multinomial (DCM) distribution (3) DCM-based topic model (DCMLDA)
  • 6. The quot;bag of wordsquot; representation Let V be a xed vocabulary. The vocabulary size is m = |V |. Each document is represented as a vector x of length m , j where x is the number of appearances of word j in the document. The length of the document is n = j xj . For typical documents, n m and x j = 0 for most words j . 6
  • 7. The multinomial distribution The probability of document x according to model θ is n ! x ( |θ) = p x θj j . j j!x j Each appearance of the same word j always has the same probability θj . Computing the probability of a document needs O (n ) time, not O (m ) time. 7
  • 8. The phenomenon of burstiness In reality, additional appearances of the same word are less surprising, i.e. they have higher probability. Example: Toyota Motor Corp. is expected to announce a major overhaul. Yoshi Inaba, a former senior Toyota executive, was formally asked by Toyota this week to oversee the U.S. business. Mr. Inaba is currently head of an international airport close to Toyota's headquarters in Japan. Toyota's U.S. operations now are suering from plunging sales. Mr. Inaba was credited with laying the groundwork for Toyota's fast growth in the U.S. before he left the company. Recently, Toyota has had to idle U.S. assembly lines and oer a limited number of voluntary buyouts. Toyota now employs 36,000 in the U.S.
  • 9. Empirical evidence of burstiness How to interpret the gure: The chance that a given rare word occurs 10 times in a document is 10 −6 . The chance that it occurs −6.5 . 20 times is 10 9
  • 10. Moreover ... A multinomial is appropriate only for modeling common words, which are not informative about the topic of a document. Burstiness and signicance are correlated: more informative words are also more bursty. 10
  • 11. A trained DCM model gives correct probabilities for all counts of all types of words. 11
  • 12. The Polya urn So what is the Dirichlet compound multinomial (DCM)? Consider a bucket with balls of m = |V | dierent colors. After a ball is selected randomly, it is replaced and one more ball of the same color is added. Each time a ball is drawn, the chance of drawing the same color again is increased. The initial number of balls with color j is αj . 12
  • 13. The bag-of-bag-of-words process Let ϕ be the parameter vector of a multinomial, i.e. a xed probability for each word. Let Dir(β) be a Dirichlet distribution over ϕ. To generate a document: (1) draw document-specic probabilities ϕ ∼ Dir(β) (2) draw n words w ∼ Mult(ϕ). Each document consists of words drawn from a multinomial that is xed for that document, but dierent for other documents. Remarkably, the Polya urn and the bag-of-bag-of-words process yield the same probability distribution over documents. 13
  • 14. How is burstiness captured? A multinomial parameter vector ϕ has length |V | and is constrained: j ϕj = 1. A DCM parameter vector β has the same length |V | but is unconstrained. The one extra degree of freedom allows the DCM to discount multiple observations of the same word, in an adjustable way. The smaller the sum s = j βj , the more words are bursty. 14
  • 15. Moving forward ... Three central questions: (1) how to represent documents? (2) how to model closely related documents? (3) how to model distantly related documents? A DCM is a good model of documents that all share a single theme. β represents the central theme; for each document ϕ represents its variation on this theme. By combining DCMs with latent Dirichlet allocation (LDA), we answer (3). 15
  • 16. Digression 1: Mixture of DCMs Because of the 1:1 mapping between multinomials and documents, in a DCM model each document comes entirely from one subtopic. We want to allow multiple topics, multiple documents from the same topic, and multiple topics within one document. In 2006 we extended the DCM model to a mixture of DCM distributions. This allows multiple topics, but not multiple topics within one document. 16
  • 17. Digression 2: A non-text application Goal: Find companies whose stock prices tend to move together. Example: { IBM+, MSFT+, AAPL- } means IBM and Microsoft often rise, and Apple falls, on the same days. Let each day be a document containing words like IBM+. Each word is a stock symbol and a direction (+ or -). Each day has one copy of the word for each 1% change in the stock price. Let a co-moving group of stocks be a topic. Each day is a combination of multiple topics. 17
  • 18. Examples of discovered topics Computer Related Real Estate Stock Company Stock Company NVDA Nvidia SPG Simon Properties SNDK SanDisk AIV Apt. Investment BRCM Broadcom KIM Kimco Realty JBL Jabil Circuit AVB AvalonBay KLAC KLA-Tencor DDR Developers NSM Nat'l Semicond. EQR Equity Residential The dataset contains 501 days of transactions between January 2007 and September 2008. 18
  • 19. DCMLDA advantages Unlike a mixture model, a topic model allows many topics to occur in each document. DCMLDA allows the same topic to occur with dierent words in dierent documents. Consider a sports topic. Suppose rugby and hockey are equally common. But within each document, seeing rugby makes seeing rugby again more likely than seeing hockey. A standard topic model cannot represent this burstiness, unless the words rugby and hockey are spread across two topics. 19
  • 20. Hypothesis quot;A DCMLDA model with a few topics can t a corpus as well as an LDA model with many topics.quot; Motivation: A single DCMLDA topic can explain related aspects of documents more eectively than a single LDA topic. The hypothesis is conrmed by the experimental results below.
  • 21. Latent Dirichlet Allocation (LDA) LDA is a generative model: For each of K topics, draw a multinomial to describe it. For each of D documents: (1) Determine the probability of each of K topics in this document. (2) For each of N words: rst draw a topic, then draw a word based on that topic. α θ z β ϕ w N K D 21
  • 22. Graphical model α θ z β ϕ w N K D The only xed parameters of the model are α and β. ϕ ∼ Dirichlet(β) θ ∼ Dirichlet(α) z ∼ Multinomial(θ) w ∼ Multinomial(ϕ) 22
  • 23. Using LDA for text mining Training nds maximum-likelihood values for ϕ for each topic, and for θ for each document. For each topic, ϕ is a vector of word probabilities indicating the content of that topic. The distribution θ of each document is a reduced-dimensionality representation. It is useful for: learning to classify documents measuring similarity between documents more? 23
  • 24. Extending LDA to DCMLDA Goal: Allow multiple topics in a single document, while making subtopics be document-specic. In DCMLDA, for each topic k and each document d a fresh multinomial word distribution ϕkd is drawn. For each topic k , these multinomials are drawn from the same Dirichlet βk , so all versions of the same topic are linked. Per-document instances of each topic allow for burstiness. 24
  • 25. Graphical models compared α θ z β ϕ w N K D α θ z β ϕ w K N D 25
  • 26. DCMLDA generative process for document d∈ {1, . . . , D } do draw topic distribution θd ∼ Dir(α) for topic k ∈ {1, . . . , K } do draw topic-word distribution ϕkd ∼ Dir(βk ) end for for word n ∈ {1, . . . , Nd } do d n ∼ θd draw topic z , draw word wd ,n ∼ ϕzd ,n d end for end for 26
  • 27. Meaning of α and β When applying LDA, it is not necessary to learn α and β. Steyvers and Griths recommend xed uniform values α = 50/K and β = .01, where K is the number of topics. However, the information in the LDA ϕ values is in the DCMLDA β values. Without an eective method to learn the hyperparameters, the DCMLDA model is not useful. 27
  • 28. Training Given a training set of documents, alternate: (a) optimize parameters ϕ, θ, and z given hyperparameters, (b) optimize hyperparameters α, β given document parameters. For xed α and β, do collapsed Gibbs sampling to nd the distribution of z . Given a z sample, nd α and β by Monte Carlo expectation- maximization. When desired, compute ϕ and θ from samples of z .
  • 29. Gibbs sampling Gibbs sampling for DCMLDA is similar to the method for LDA. Start by factoring the complete likelihood of the model: ( , z |α, β) = p (w |z , β)p (z |α). p w DCMLDA and LDA are identical over the α-to-z pathway, so ( |α) p z in DCMLDA is the same as for LDA: ( d + α) B n·· ( |α) = p z . B (α) d B (·) is the Beta function, and n tkd is how many times word t has topic k in document d . 29
  • 30. To get p (w |z , β), average over all possible ϕ distributions: ( | , β) = p w z p z ( |ϕ)p (ϕ|β)d ϕ ϕ Nd = p (ϕ|β) ϕwd ,n zd ,n d d ϕ ϕ d n =1 = p (ϕ|β) (ϕtkd )ntkd d ϕ. ϕ d ,k ,t Expand p (ϕ|β) as a Dirichlet distribution:     (ϕtkd )ntkd  d ϕ 1 ( | , β) = p w z  (ϕtkd )βtk −1   B (β·k ) t ϕ d ,k d ,k ,t ( kd + β·k ) (ϕtkd )βtk −1+ntkd d ϕ = B n· = . B (β·k ) d ,k ϕ t d ,k 30
  • 31. Gibbs sampling cont. Combining equations, the complete likelihood is ( d + α) B n·· ( kd + β·k ) B n· ( , z |α, β) = p w . B (α) B (β·k ) d k 31
  • 32. Finding optimal α and β Optimal α and β values maximize p (w |α, β). Unfortunately, this likelihood is intractable. The complete likelihood p (w , z |α, β) is tractable. Based on it, we use single-sample Monte Carlo EM. Run Gibbs sampling for a burn-in period, with guesses for α and β. Then draw a topic assignment z for each word of each document. Use this vector in the M-step to estimate new values for α and β. Run Gibbs sampling for more iterations, to let topic assignments stabilize based on the new α and β values. Then repeat. 32
  • 33. Training algorithm Start with initial α and β repeat Run Gibbs sampling to approximate steady state Choose a topic assignment for each word Choose α and β to maximize complete likelihood p (w , z |α, β) until convergence of α and β 33
  • 34. α and β to maximize complete likelihood Log complete likelihood is L(α, β; w , z ) = [log Γ(n·kd + αk ) − log Γ(αk )] d ,k + [log Γ( αk ) − log Γ( n· kd + αk )] d k k + [log Γ(ntkd + βtk ) − log Γ(βtk )] d ,k , t + [log Γ( βtk ) − log Γ( ntkd + βtk )]. d ,k t t The rst two lines depend only on α, and the second two on β. Furthermore, βtk can be independently maximized for each k . 34
  • 35. We get K +1 equations to maximize: α = argmax (log Γ(n·kd + αk ) − log Γ(αk )) d ,k + [log Γ( αk ) − log Γ( n· kd + αk )] d k k β·k = argmax (log Γ(ntkd + βtk ) − log Γ(βtk )) d ,t + [log Γ( βtk ) − log Γ( n tkd + βtk )] d t t Each equation denes a vector, either {αk }k or {βtk }t . With a carefully coded Matlab implementation of L-BFGS, one iteration of EM takes about 100 seconds on sample data. 35
  • 36. Non-uniform α and β Implementations of DCMLDA must allow the α vector and β array to be non-uniform. In DCMLDA, β carries the information that ϕ carries in LDA. α could be uniform in DCMLDA, but learning non-uniform values allows certain topics to have higher overall probability than others. 36
  • 37. Experimental design Goal: Check whether handling burstiness in DCMLDA model yields a better topic model than LDA. Compare DCMLDA only with LDA for two reasons: (1) Comparable conceptual complexity. (2) DCMLDA is not in competition with more complex topic models, since those can be modied to include DCM topics. 37
  • 38. Comparison method Given a test set of documents not used for training, estimate the likelihood p (w |α, β) for LDA and DCMLDA models. For DCMLDA, p (w |α, β) uses trained α and β. For LDA, p (w |α, β) uses α=α ¯ and ¯ β = β, the scalar means of the DCMLDA values. Also compare to LDA with heuristic values β = .01 and α = 50/K , where K is the number of topics. 38
  • 39. Datasets Compare LDA and DCMLDA as models for both text and non-text datasets. The text dataset is a collection of papers from the 2002 and 2003 NIPS conferences, with 520955 words in 390 documents, and |V | = 6871. The SP 500 dataset contains 501 days of stock transactions between January 2007 and September 2008. |V | = 1000. 39
  • 40. Digression: Computing likelihood The incomplete likelihood p (w |α, β) is intractable for topic models. The complete likelihood p (w , z |α, β) is tractable, so previous work has averaged it over z , but this approach is unreliable. Another possibility is to measure classication accuracy. But, our datasets do not have obvious classication schemes. Also, topics may be more accurate than predened classes. 40
  • 41. Empirical likelihood To calculate empirical likelihood (EL), rst train each model. Feed obtained parameter values α and β into a generative model. Get a large set of pseudo documents. Use the pseudo documents to train a tractable model: a mixture of multinomials. Estimate the test set likelihood as its likelihood under the tractable model. 41
  • 42. Digression2 : Stability of EL Investigate stability by running EL multiple times for the same DCMLDA model Train three independent 20-topic DCMLDA models on the SP500 dataset, and run EL ve times for each model. Mean absolute dierence of EL values for the same model is 0.08%. Mean absolute dierence between EL values for separately trained DCMLDA models is 0.11%. Conclusion: Likelihood values are stable over DCMLDA models with a constant number of topics. 42
  • 43. Cross-validation Perform ve 5-fold cross-validation trials for each number of topics and each dataset. First train a DCMLDA model, then create two LDA models. Fitted LDA uses the means of the DCMLDA hyperparameters. Heuristic LDA uses xed parameter values. Results: For both datasets, DCMLDA is better than tted LDA, which is better than heuristic LDA.
  • 44. Mean log-likelihood on the SP500 dataset. Heuristic model likelihood is too low to show. Max. standard error is 11.2. −6200 DCMLDA −6250 LDA −6300 −6350 Log−Likelihood −6400 −6450 −6500 −6550 −6600 −6650 −6700 0 50 100 150 200 250 Number of Topics 44
  • 45. SP500 discussion On the SP500 dataset, the best t is DCMLDA with seven topics. A DCMLDA model with few topics is comparable to an LDA model with many topics. Above seven topics, DCMLDA likelihood drops. Data sparsity may prevent the estimation of β values that generalize well (overtting). LDA seems to undert regardless of how many topics are used. 45
  • 46. Mean log-likelihood on the NIPS dataset. Max. standard error 21.5. −10000 −12000 −14000 −16000 Log−Likelihood −18000 −20000 −22000 −24000 DCMLDA trained LDA −26000 heuristic LDA −28000 0 20 40 60 80 100 Number of Topics 46
  • 47. NIPS discussion On the NIPS dataset, DCMLDA outperforms LDA model at every number of topics. LDA with heuristic hyperparameter values almost equals the tted LDA model at 50 topics. The tted model is better when the number of topics is small.
  • 48. Alternative heuristic values for hyperparameters Learning α and β is benecial, both for LDA and DCMLDA models. Optimal values are signicantly dierent from previously suggested heuristic values. Best α values around 0.7 seem independent of the number of topics, unlike the suggested value 50/K . 48
  • 49. Newer topic models Variants include the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). These outperform LDA on many tasks. However, DCMLDA competes only with LDA. The LDA core in other models can be replaced by DCMLDA to improve their performance. DCMLDA and complex topic models are complementary. 49
  • 50. A bursty correlated topic model µ θ Σ z β φ w K N D µ θ Σ z β φ w K N D 50
  • 51. Conclusion The ability of the DCMLDA model to account for burstiness leads to a signicant improvement in likelihood over LDA. The burstiness of words, and of some non-text data, is an important phenomenon to capture in topic modeling. 51