SlideShare uma empresa Scribd logo
1 de 176
Baixar para ler offline
Probabilistic Topic Models

          David M. Blei

    Department of Computer Science
         Princeton University


        August 22, 2011
Information overload



                       As more information becomes
                       available, it becomes more
                       difficult to find and discover
                       what we need.



                       We need new tools to help us
                       organize, search, and
                       understand these vast
                       amounts of information.
Topic modeling




Topic modeling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic archives.
  1   Discover the hidden themes that pervade the collection.
  2   Annotate the documents according to those themes.
  3   Use annotations to organize, summarize, and search the texts.
Discover topics from a corpus

       “Genetics”    “Evolution” “Disease”        “Computers”
           human        evolution     disease       computer
          genome      evolutionary     host           models
             dna         species     bacteria      information
          genetic      organisms     diseases          data
            genes           life    resistance      computers
         sequence         origin     bacterial        system
            gene         biology        new          network
         molecular       groups       strains        systems
        sequencing   phylogenetic     control          model
            map           living    infectious        parallel
       information      diversity     malaria        methods
          genetics        group      parasite        networks
          mapping          new       parasites       software
          project           two       united            new
         sequences      common     tuberculosis    simulations
Model the evolution of topics over time

                       "Theoretical Physics"                                                          "Neuroscience"


  FORCE                                                                                                                         OXYGEN
   o o o                   o                                      LASER                                                         o
           o           o                                                                                                            o
                   o           o                                  o o o                  NERVE                              o
               o                                                        o o                                                                                             o o
                                   o                          o                    o                                                                                o
                   o o                                                                    o o o o o
                       o                                                                            o o
                   o o o                                                                                      o         o                               o       o
     RELATIVITY o          o o
                                 o
                                     o
                                                                                                                o
                                                                                                                    o                   o           o       o
                 o     o       o o o o o                                                                          o
                                                                                                                o o
               o                   o     o o                                                                  o
             o           o                                                                                          o                           o       NEURON
                               o                                                                          o             o
                                                                                                      o                                     o
                   o                          o                                                   o                         o               o
                                                  o
                                                  o                                           o o
               o                                                                          o o                                   o     o
                                              o       o                                                                             o           o
         o                                                                                                                          o o
       o                     o                            o
   o o                   o o                                  o                                                           o o               o
                     o o                                          o                                                   o o                         o
                                                                                                                                                o o
   o o o o o o o o o                                                  o                                           o                                 o o o
                                                                          o                             o o                                           o o o
                                                                              o                   o o o                                                     o o
                                                                                  o o     o o o o                                                         o
                                                                                                                                                            o o


  1880     1900            1920        1940       1960            1980            2000   1880   1900          1920      1940            1960            1980            2000
Model connections between topics
                                                                                                                                                             neurons
                                                                                                                              brain                          stimulus
                                                                                                                                                               motor
                                                                                                                             memory
                                                                                                                                                               visual
                                 activated                                                                                   subjects                                           synapses
                         tyrosine phosphorylation                                                                                                             cortical
                                                                                                                                left                                                ltp
                                 activation
                             phosphorylation                   p53                                                             task            surface                          glutamate
                                  kinase                    cell cycle                 proteins                                                   tip                            synaptic
                                                             activity                   protein
                                                              cyclin                   binding              rna                                 image                            neurons
                                                           regulation                  domain               dna                                sample            materials
                                                                                                                               computer
                                                                                       domains        rna polymerase                                              organic
                                                                                                                                problem        device
                                        receptor                                                         cleavage
                                                                                                                             information
                                                                                                                                                                  polymer
                   science                                               amino acids
  research
                  scientists
                                       receptors                             cdna
                                                                                                            site
                                                                                                                              computers
                                                                                                                                                                 polymers
   funding                                                                                                                                                       molecules      physicists
   support          says                  ligand                          sequence                                             problems
                                                                                                                                             laser                               particles
     nih          research               ligands                           isolated                                                         optical                              physics
  program          people                                                   protein                     sequence                              light
                                       apoptosis                                                       sequences         surface
                                                                                                                                                                                 particle
                                                                                                                                           electrons                           experiment
                                                                                                         genome           liquid           quantum
                                                          wild type                                        dna          surfaces                                                                 stars
                                                           mutant                       enzyme         sequencing          fluid
                                                          mutations                    enzymes                            model                       reaction                               astronomers
   united states                                          mutants
                                                                                          iron
                                                                                       active site
                                                                                                                                                     reactions                                 universe
       women                               cells
                                                          mutation                     reduction                                                     molecule                                  galaxies
    universities
                                            cell                                                                                                     molecules
                                        expression                                                               magnetic
                                                                                                                                                                                                galaxy
                                         cell lines                                        plants
                                                                                                                magnetic field                     transition state
      students                         bone marrow                                          plant
                                                                                                                    spin
                                                                                                              superconductivity
                                                                                           gene
     education                                                                             genes
                                                                                                              superconducting
                                                                                                                                                pressure                    mantle
                                                                                        arabidopsis
                                                     bacteria                                                                                high pressure                   crust                   sun
                                                     bacterial                                                                                 pressures                 upper mantle             solar wind
                                                       host                                                  fossil record                        core                    meteorites                earth
                                                    resistance                         development               birds                         inner core                   ratios                 planets
             mice                                    parasite                            embryos                fossils                                                                             planet
                                                                        gene                                  dinosaurs
           antigen                      virus                                           drosophila                                species
                                                                      disease                                    fossil
            t cells                       hiv                                             genes                                    forest
                                                                     mutations
          antigens                       aids                                           expression                                forests
                                                                      families                                                                                    earthquake                  co2
       immune response                infection                                                                                 populations
                                                                     mutation                                                                                     earthquakes               carbon
                                       viruses                                                                                  ecosystems
                                                                                                                                                                      fault             carbon dioxide
                                                                                                                    ancient                                         images                 methane
                          patients                                                             genetic               found
                          disease                          cells                             population             impact
                                                                                                                                                                      data                   water
                                                                                                                                                                                                              ozone
                         treatment                       proteins                            populations      million years ago        volcanic                                                           atmospheric
                           drugs                                                             differences             africa
                           clinical                    researchers                                                                     deposits                        climate
                                                                                                                                                                                                         measurements
                                                                                              variation                                                                                                   stratosphere
                                                         protein                                                                       magma                           ocean
                                                                                                                                       eruption                           ice                            concentrations
                                                          found                                                                       volcanism                       changes
                                                                                                                                                                  climate change
Find hierarchies of topics

                                                                                                                            online
                                                                                                                          scheduling
                                                                                            quantum                          task
                         Quantum lower bounds by polynomials                                                              competitive
                         On the power of bounded concurrency I: finite automata             automata                                         approximation
                                                                                               nc                           tasks
                         Dense quantum coding and quantum finite automata                                                                          s
                         Classical physics and the Church--Turing Thesis                   automaton                                             points
                                                                                           languages                                           distance
                                                                                                                                                convex
                                                                                                                       n                                                          routing                  Nearly optimal algorithms and bounds for multilayer channel routing
                                                                                       machine                     functions                                                     adaptive                  How bad is selfish routing?
                                                                                        domain                    polynomial                                 networks             network                  Authoritative sources in a hyperlinked environment
                                                                                                                                                                                 networks                  Balanced sequences and optimal routing
                                                                                        degree                        log                                    protocol            protocols
                                                                                       degrees
                                                                                      polynomials                  algorithm                                 network
                                                                                                                                                             packets
                                                                                                                                                               link
                                                                                                     learning
                                                                                                    learnable
                                                                                                    statistical                                                                                     constraint
                                                                                                    examples                                                                                      dependencies                  Module algebra
                                                                                                      classes                                                                                           local                   On XML integrity constraints in the presence of DTDs
 An optimal algorithm for intersecting line segments in the plane           graph                                                                                                                                               Closure properties of constraints
 Recontamination does not help to search a graph                            graphs                                                                                                                 consistency                  Dynamic functional dependencies and database aging
 A new approach to the maximum-flow problem                                  edge                                                                                                                    tractable
 The time complexity of maximum matching by simulated annealing            minimum                                                      the,of                             database
                                                                           vertices                                                                                       constraints
                                                                                                                                         a, is                              algebra
                                                                                                                                         and                                boolean                       logic
                                                                     m                                                                                                     relational                    logics
                                                                merging                            n                                                                                                     query
                                                                networks                      algorithm                                                                                                 theories
                                                                 sorting                                                                                                                              languages
                                                               multiplication                    time
                                                                                                  log
                                                                                               bound
                                                                                                                                                                   system
                                                                                                                                                                                              learning
                                                                      consensus
                                                                                                                                                                  systems                    knowledge
                                                                        objects
                                                                                                                                    logic                       performance                  reasoning
                                                                       messages
                                                                        protocol                                                 programs                         analysis                   verification
                                                                                                                                                                                               circuit
                                                                     asynchronous
                                                                                                                                  systems                        distributed
                                                                                                                                 language
                                                                                            trees
                                                                                           regular                                   sets                                             networks
                                                                                                                                                                                                                Single-class bounds of multi-class queuing networks
                                                                                             tree                                                                                     queuing
                                                                                                                                                                                                                The maximum concurrent flow problem
                                                                                           search                                                                                    asymptotic                 Contention in shared memory algorithms
                                                                                         compression                                                               database         productform                 Linear probing with a nonuniform address distribution
                                                                                                                                                                 transactions          server
                                                                                                                                                                    retrieval
                                                                                                                                                                 concurrency
                                     Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998                 proof                                              restrictions
                                                                                                              property                    formulas
                                     A mechanical proof of the Church-Rosser theorem
                                                                                                              program                     firstorder
                                     Timed regular expressions
                                     On the power and limitations of strictness analysis                     resolution                   decision
                                                                                                              abstract                    temporal
                                                                                                                                           queries
Annotate images
                          Automatic image annotation
 tain people              Automatic predictedwaterpattern textile display predicted caption: tree
                   predicted caption:
                                     image annotation leaves branch
                                         predicted caption:
                                             caption:
                                    people market tree mountain people
                                         sky
                   birds nest leaves branch tree                          birds nest
                                                                                                                                        p
                                                                                                                                        p




utomatic image mountainnest caption: flowers tree coralpredicted caption: image anno
          sky water tree mountain people annotation hills tree
                 predicted caption:
          predicted caption:
 e coral SKY WATER TREE
                sky water buildings people             fish branch Automatic
                                                  predicted caption:
                                           predicted predicted caption:    predicted caption:
                                               SCOTLAND WATER people marketWATER BUILDING
                                           birds scotland water ocean
                                                     leaves water tree
                                                                                                             p
                                                                        SKY patternbuildings people mountain s
                                                                          sky water textile display
         predicted caption:                        predicted caption:HILLS TREEpredicted caption:
                                                      FLOWER
        MOUNTAIN PEOPLE
         sky water tree mountain people            birds nest leaves branch tree
                                                                                            PEOPLE WATER
                                                                                 people market pattern textile display




             predicted caption:                  predicted caption:                      predicted caption:
            fish water ocean tree coral         sky water buildings people mountain scotland water flowers hills tree
            predicted caption:
                   predicted caption:                  predicted caption: Probabilistic modelsof text andpredicted caption:
                                                predicted caption: caption:
                                                            predicted                   predicted caption: p.5/53
                                                                                                          images –
                                                                                                                                        pr
 tain peoplefish water ocean leaves branch treePEOPLE sky water pattern textile display water flowersleavestree
        FISH WATERtree coral
                   birds nest OCEAN                    people market tree mountain scotland BIRDS NEST TREE
                                                            MARKET PATTERN
                                                sky water buildings peoplemountain people                birds nest hills branch tree   pe
               TREE CORAL                               TEXTILE DISPLAY                               BRANCH LEAVES
Discover influential articles



                                                                                                               Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% Nonsynonymous
                                                                                                               DNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003)
                                                                                                               [178 citations]




                     0.030

                                                                          Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973)
                     0.025                                               [296 citations]
 WeightedInfluence




                     0.020

                                                            William K. Gregory, The New Anthropogeny: Twenty-Five Stages of
                     0.015                                  Vertebrate Evolution, from Silurian Chordate to Man, Science (1933)
                                                            [3 citations]

                     0.010


                     0.005


                     0.000

                                              1880                    1900                    1920                     1940                   1960                   1980                    2000
                                                                                                                  Year
                      W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America, Science (1916)
                      [3 citations]
Table 2
Predict links between articles + Regression for two documents
   Top eight link predictions made by RTM (ψ ) and LDA
                                                e
   (italicized) from Cora. The models were fit with 10 topics. Boldfaced titles indicate actual
      documents cited by or citing each document. Over the whole corpus, RTM improves
      precision over LDA + Regression by 80% when evaluated on the first 20 documents
                                             retrieved.
          Markov chain Monte Carlo convergence diagnostics: A comparative review
  Minorization conditions and convergence rates for Markov chain Monte Carlo
                Rates of convergence of the Hastings and Metropolis algorithms




                                                                                                 RTM (ψe )
              Possible biases induced by MCMC convergence diagnostics
       Bounding convergence time of the Gibbs sampler in Bayesian image restoration
                          Self regenerative Markov chain Monte Carlo
         Auxiliary variable methods for Markov chain Monte Carlo with applications
      Rate of Convergence of the Gibbs Sampler by Gaussian Approximation
               Diagnosing convergence of Markov chain Monte Carlo algorithms
                    Exact Bound for the Convergence of Metropolis Chains




                                                                                                 LDA + Regression
                          Self regenerative Markov chain Monte Carlo
  Minorization conditions and convergence rates for Markov chain Monte Carlo
                                      Gibbs-markov models
         Auxiliary variable methods for Markov chain Monte Carlo with applications
  Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models
                                Mediating instrumental variables
                      A qualitative framework for probabilistic inference
                           Adaptation for Self Regenerative MCMC


            Competitive environments evolve better solutions for complex tasks
                      Coevolving High Level Representations
                          A Survey of Evolutionary Strategies




                                                                                           R
Characterize political decisions

             tax credit,budget authority,energy,outlays,tax
                  county,eligible,ballot,election,jurisdiction
        bank,transfer,requires,holding company,industrial
                    housing,mortgage,loan,family,recipient
                   energy,fuel,standard,administrator,lamp
                       student,loan,institution,lender,school
                    medicare,medicaid,child,chip,coverage
                     defense,iraq,transfer,expense,chapter
       business,administrator,bills,business concern,loan
  transportation,rail,railroad,passenger,homeland security
                     cover,bills,bridge,transaction,following
                          bills,tax,subparagraph,loss,taxable
                        loss,crop,producer,agriculture,trade
                           head,start,child,technology,award
                          computer,alien,bills,user,collection
            science,director,technology,mathematics,bills
         coast guard,vessel,space,administrator,requires
                              child,center,poison,victim,abuse
                                       land,site,bills,interior,river
                        energy,bills,price,commodity,market
                 surveillance,director,court,electronic,flood
                                 child,fire,attorney,internet,bills
                      drug,pediatric,product,device,medical
                 human,vietnam,united nations,call,people
                              bills,iran,official,company,sudan
              coin,inspector,designee,automobile,lebanon
                 producer,eligible,crop,farm,subparagraph
                    people,woman,american,nation,school
                             veteran,veterans,bills,care,injury
   dod,defense,defense and appropriation,military,subtitle
Organize and browse large corpora
This tutorial




  • What are topic models?
  • What kinds of things can they do?
  • How do I compute with a topic model?
  • What are some unsanswered questions in this field?
  • How can I learn more?
Related subjects

Topic modeling is a case study in modern machine learning with
probabilistic models. It touches on

  • Directed graphical models
  • Conjugate priors and nonconjugate priors
  • Time series modeling
  • Modeling with graphs
  • Hierarchical Bayesian methods
  • Approximate posterior inference (MCMC, variational methods)
  • Exploratory and descriptive data analysis
  • Model selection and Bayesian nonparametric methods
  • Mixed membership models
  • Prediction from sparse and noisy inputs
If you remember one picture...



 Assumptions
                 Inference   Discovered structure
                 algorithm



 Data
Organization

  • Introduction to topic modeling
      •   Latent Dirichlet allocation
      •   Open source implementations and tools

  • Beyond latent Dirichlet allocation
      • Modeling richer assumptions
      • Supervised topic modeling
      • Bayesian nonparametric topic modeling

  • Algorithms
      • Gibbs sampling
      • Variational inference
      • Online variational inference

  • Discussion, open questions, and resources
Introduction to Topic Modeling
Probabilistic modeling



  1   Data are assumed to be observed from a generative probabilistic
      process that includes hidden variables.
        • In text, the hidden variables are the thematic structure.

  2   Infer the hidden structure using posterior inference
         • What are the topics that describe this collection?

  3   Situate new data into the estimated model.
         • How does a new document fit into the topic structure?
Latent Dirichlet allocation (LDA)




       Simple intuition: Documents exhibit multiple topics.
Generative model for LDA

                                              Topic proportions and
       Topics              Documents
                                                  assignments
    gene      0.04
    dna       0.02
    genetic   0.01
    .,,




    life     0.02
    evolve   0.01
    organism 0.01
    .,,




    brain     0.04
    neuron    0.02
    nerve     0.01
    ...



    data     0.02
    number   0.02
    computer 0.01
    .,,




  • Each topic is a distribution over words
  • Each document is a mixture of corpus-wide topics
  • Each word is drawn from one of those topics
The posterior distribution

                                                Topic proportions and
     Topics               Documents
                                                    assignments




  • In reality, we only observe the documents
  • The other structure are hidden variables
The posterior distribution

                                                Topic proportions and
     Topics                 Documents
                                                    assignments




  • Our goal is to infer the hidden variables
  • I.e., compute their distribution conditioned on the documents
              p(topics, proportions, assignments | documents)
LDA as a graphical model

                              Per-word
      Proportions
                          topic assignment
       parameter

                 Per-document         Observed                          Topic
               topic proportions        word             Topics       parameter




           α         θd        Zd,n    Wd,n                βk            η
                                                 N
                                                     D            K

  • Encodes our assumptions about the data
  • Connects to algorithms for computing with data
  • See Pattern Recognition and Machine Learning (Bishop, 2006).
LDA as a graphical model

                               Per-word
       Proportions
                           topic assignment
        parameter

                  Per-document         Observed                          Topic
                topic proportions        word             Topics       parameter




            α         θd        Zd,n    Wd,n                βk            η
                                                  N
                                                      D            K

  • Nodes are random variables; edges indicate dependence.
  • Shaded nodes are observed.
  • Plates indicate replicated variables.
LDA as a graphical model

                                   Per-word
        Proportions
                               topic assignment
         parameter

                   Per-document            Observed                           Topic
                 topic proportions           word              Topics       parameter




             α            θd        Zd,n    Wd,n                 βk            η
                                                      N
                                                          D             K

    K                 D                     N
          p(βi | η)         p(θd | α)            p(zd,n | θd )p(wd,n | β1:K , zd,n )
    i=1               d=1                  n=1
LDA




      α        θd      Zd,n    Wd,n                βk          η
                                        N
                                            D           K

 • This joint defines a posterior.

 • From a collection of documents, infer
      • Per-word topic assignment zd,n
      • Per-document topic proportions θd
      • Per-corpus topic distributions βk

 • Then use posterior expectations to perform the task at hand,
   e.g., information retrieval, document similarity, exploration, ...
LDA




        α        θd     Zd,n     Wd,n               βk        η
                                        N
                                             D           K

Approximate posterior inference algorithms
  • Mean field variational methods (Blei et al., 2001, 2003)
  • Expectation propagation (Minka and Lafferty, 2002)
  • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
  • Collapsed variational inference (Teh et al., 2006)
  • Online variational inference (Hoffman et al., 2010)
Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).
Example inference




       α       θd     Zd,n   Wd,n              βk         η
                                     N
                                         D          K


  • Data: The OCR’ed collection of Science from 1990–2000
      • 17K documents
      • 11M words
      • 20K unique terms (stop words and rare words removed)

  • Model: 100-topic LDA model using variational inference.
Example inference




                                  0.4
                                  0.3
                    Probability

                                  0.2
                                  0.1
                                  0.0
                                        1 8 16 26 36 46 56 66 76 86 96

                                                     Topics
Example inference

      “Genetics”    “Evolution” “Disease”        “Computers”
          human        evolution     disease       computer
         genome      evolutionary     host           models
            dna         species     bacteria      information
         genetic      organisms     diseases          data
           genes           life    resistance      computers
        sequence         origin     bacterial        system
           gene         biology        new          network
        molecular       groups       strains        systems
       sequencing   phylogenetic     control          model
           map           living    infectious        parallel
      information      diversity     malaria        methods
         genetics        group      parasite        networks
         mapping          new       parasites       software
         project           two       united            new
        sequences      common     tuberculosis    simulations
Example inference (II)
Example inference (II)


          problem           model        selection        species
         problems            rate           male           forest
        mathematical      constant          males        ecology
          number        distribution      females           fish
             new             time            sex        ecological
        mathematics       number          species     conservation
         university           size         female        diversity
             two            values       evolution     population
             first           value      populations        natural
          numbers          average      population     ecosystems
            work             rates         sexual     populations
             time            data        behavior      endangered
       mathematicians      density     evolutionary      tropical
            chaos        measured         genetic         forests
           chaotic         models      reproductive     ecosystem
2600




                                                                                           Perplexity
Held out perplexity                                                                                     2400

                                                                                                        2200

                                                                                                        2000

                                                                                                        1800
                                   B LEI , N G , AND J ORDAN
                                                                                                        1600

                                                                                                        1400
                                                                                                            0   10    20   30     40          50   60     70     80     90     100
                                  Nematode abstracts                                                                       Associated Press
                                                                                                                             Number of Topics
                3400                                                                                    7000
                                                                 Smoothed Unigram                                                                        Smoothed Unigram
                                                                 Smoothed Mixt. Unigrams                                                                 Smoothed Mixt. Unigrams
                3200                                                                                    6500
                                                                 LDA                                                                                     LDA
                                                                 Fold in pLSI                                                                            Fold in pLSI
                3000
                                                                                                        6000
                2800
                                                                                                        5500




                                                                                           Perplexity
                2600
   Perplexity




                                                                                                        5000
                2400
                                                                                                        4500
                2200

                                                                                                        4000
                2000

                1800                                                                                    3500

                1600                                                                                    3000

                1400                                                                                    2500
                    0   10   20    30     40      50      60    70     80     90     100                    0   20    40   60     80      100      120   140     160    180    200
                                        Number of Topics
                                                                                                                                Number of Topics
                7000
                                                               Smoothed Unigram
                                                                                Figure
                                                               Smoothed Mixt. Unigrams   9: Perplexity results on the nematode (Top) and AP (Bottom) corpora for LDA, the unigram
                6500                                           LDA                          model, mixture of unigrams, and pLSI.
                                                               Fold in pLSI
                6000

                                                                                                −                    log p(wd )
                                        perplexity = exp                                                        d
                5500
   Perplexity




                                                                                                                      d Nd
                5000                                                                                                                   1010

                4500

                4000

                3500

                3000
Used to explore and browse document collections
Aside: The Dirichlet distribution


  • The Dirichlet distribution is an exponential family distribution over
    the simplex, i.e., positive vectors that sum to one

                                    Γ(   αi )
                       p(θ | α) =        i
                                                    θiαi −1 .
                                     i Γ(αi )   i

  • It is conjugate to the multinomial. Given a multinomial
    observation, the posterior distribution of θ is a Dirichlet.

  • The parameter α controls the mean shape and sparsity of θ.

  • The topic proportions are a K dimensional Dirichlet.
    The topics are a V dimensional Dirichlet.
α=1

                                    1                                       2                                       3                                       4                                       5
          1.0

          0.8

          0.6

          0.4
                    q                                                                                               q                               q
                                                                                    q                                                                                           q
          0.2           q                                                                                               q                                                           q
                                    q                                   q                           q
                                                                q               q       q                                               q       q                                           q               q       q
                                                        q                                                                       q           q                   q   q       q                   q
                                        q                   q               q                               q               q                                           q                                       q
                q           q   q           q   q                                               q                                   q                                                   q               q
                                                    q               q                       q                   q                                       q                                           q
          0.0                                                                                           q                                                   q


                                    6                                       7                                       8                                       9                                   10
          1.0

          0.8

          0.6
  value




          0.4
                                                                                                    q                                                               q           q
                                                q                       q                                   q                                           q
          0.2           q           q                   q
                                                                                                                    q
                                                                                                                                                q
                                                                                                                                                                                                    q
                                                                    q                                                               q                                               q       q
                                q                                                                                                                                                       q                           q
                    q                       q                                   q       q   q                   q           q               q                           q
                            q                       q       q   q           q       q                   q                                                       q           q                                   q
                q                       q                                                       q                       q       q       q           q       q                                               q
          0.0                                                                                                                                                                                   q       q


                                11                                      12                                      13                                      14                                      15
          1.0

          0.8

          0.6
                                                                                                                                                                        q
          0.4
                                                                                                                                                                                        q
                                                                                                                            q
                                q                           q                                                           q
          0.2                               q                                   q                                                                                                                           q
                        q           q                                   q                       q                                                           q
                                                q                                                       q                                                                                                           q
                                                                            q               q                                                   q                           q                       q   q       q
                                        q               q                                           q                           q       q
                                                                    q                                       q       q                       q       q           q                           q   q
                            q                       q           q                   q   q                       q                                       q           q               q
          0.0   q   q                                                                                                               q                                           q


                1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                            item
α = 10

                                     1                                       2                                       3                                       4                                       5
           1.0

           0.8

           0.6

           0.4

                                                 q
           0.2                                                                                   q                   q
                                                                                                                                                                         q
                                             q                   q               q       q   q               q                   q                                                       q                           q
                 q   q                                                                                                                   q   q           q   q               q               q       q       q   q
                             q   q   q               q   q   q       q   q           q                           q       q                       q   q               q           q   q                   q
                         q               q                                   q                       q   q                   q       q                           q
                                                                                                                                                                                                 q
           0.0
                                     6                                       7                                       8                                       9                                   10
           1.0

           0.8

           0.6
   value




           0.4

           0.2                           q                       q                                                                                                                                   q
                                                 q           q                               q                       q           q           q       q               q                   q                       q
                 q       q                                           q   q               q           q           q       q   q                   q               q                                       q           q
                     q               q               q                       q       q           q       q                           q                   q   q           q   q       q       q               q
                             q   q           q           q                       q                           q                           q                                       q               q
           0.0
                                 11                                      12                                      13                                      14                                      15
           1.0

           0.8

           0.6

           0.4

           0.2               q                                   q                       q                                           q
                                                                                                                                         q
                                                                                                                                             q                                                                   q
                     q   q                                                                           q                                                           q               q                                   q
                                     q       q           q           q                                               q       q   q                       q                   q           q               q   q
                 q               q       q       q   q       q           q   q   q           q   q       q   q           q                                           q
                                                                                     q                           q                               q   q       q           q           q       q   q   q
           0.0
                 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                             item
α = 100

                                     1                                       2                                       3                                       4                                       5
           1.0

           0.8

           0.6

           0.4

           0.2
                 q   q           q       q               q       q       q           q   q       q   q   q               q   q   q           q   q   q       q       q   q       q           q       q       q   q   q
                         q   q       q       q   q   q       q       q       q   q           q               q   q   q               q   q               q       q           q       q   q       q       q

           0.0
                                     6                                       7                                       8                                       9                                   10
           1.0

           0.8

           0.6
   value




           0.4

           0.2
                 q   q   q   q   q       q   q   q   q       q   q   q   q   q   q   q   q   q   q   q   q       q       q   q   q   q   q   q   q   q   q   q   q   q       q   q   q   q       q   q   q   q   q   q
                                     q                   q                                                   q       q                                                   q                   q

           0.0
                                 11                                      12                                      13                                      14                                      15
           1.0

           0.8

           0.6

           0.4

           0.2
                 q       q   q               q   q   q   q               q       q   q   q   q   q       q   q       q   q       q           q       q   q       q           q   q       q   q           q   q       q
                     q           q   q   q                   q   q   q       q                       q           q           q       q   q       q           q       q   q           q           q   q           q

           0.0
                 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                             item
α=1

                                    1                                       2                                       3                                       4                                       5
          1.0

          0.8

          0.6

          0.4
                    q                                                                                               q                               q
                                                                                    q                                                                                           q
          0.2           q                                                                                               q                                                           q
                                    q                                   q                           q
                                                                q               q       q                                               q       q                                           q               q       q
                                                        q                                                                       q           q                   q   q       q                   q
                                        q                   q               q                               q               q                                           q                                       q
                q           q   q           q   q                                               q                                   q                                                   q               q
                                                    q               q                       q                   q                                       q                                           q
          0.0                                                                                           q                                                   q


                                    6                                       7                                       8                                       9                                   10
          1.0

          0.8

          0.6
  value




          0.4
                                                                                                    q                                                               q           q
                                                q                       q                                   q                                           q
          0.2           q           q                   q
                                                                                                                    q
                                                                                                                                                q
                                                                                                                                                                                                    q
                                                                    q                                                               q                                               q       q
                                q                                                                                                                                                       q                           q
                    q                       q                                   q       q   q                   q           q               q                           q
                            q                       q       q   q           q       q                   q                                                       q           q                                   q
                q                       q                                                       q                       q       q       q           q       q                                               q
          0.0                                                                                                                                                                                   q       q


                                11                                      12                                      13                                      14                                      15
          1.0

          0.8

          0.6
                                                                                                                                                                        q
          0.4
                                                                                                                                                                                        q
                                                                                                                            q
                                q                           q                                                           q
          0.2                               q                                   q                                                                                                                           q
                        q           q                                   q                       q                                                           q
                                                q                                                       q                                                                                                           q
                                                                            q               q                                                   q                           q                       q   q       q
                                        q               q                                           q                           q       q
                                                                    q                                       q       q                       q       q           q                           q   q
                            q                       q           q                   q   q                       q                                       q           q               q
          0.0   q   q                                                                                                               q                                           q


                1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                            item
α = 0.1

                                     1                                       2                                       3                                       4                                       5
           1.0                           q                                                                                               q


           0.8                                                                           q
                                                                                                                                                                                                     q

           0.6
                                                                                                             q
           0.4
                                                                                                         q

           0.2                                                               q                                               q                                                                               q
                                                                                                                                                                                                 q
                                                                 q   q                               q
                                     q                                   q                                                                   q                                                                       q
           0.0   q   q   q   q   q           q   q   q   q   q                   q   q       q   q               q   q   q       q   q           q   q   q   q   q   q   q   q   q   q   q   q           q       q


                                     6                                       7                                       8                                       9                                   10
           1.0

           0.8                                                                                                                                                                                           q

                                                                                                                                                         q

           0.6
   value




                                                                                                 q
           0.4                   q                                                                               q
                                                                                             q
                                     q                                   q                                                                   q
           0.2                                   q                                                           q
                                                                     q
                                                                                 q
                             q                                                                                                                                                                       q           q
                                                                             q                                                                               q
                 q   q                               q                                                                                                                                   q   q
           0.0           q               q   q           q   q   q                   q   q           q   q           q   q   q   q   q   q       q   q           q   q   q   q   q   q           q           q       q


                                 11                                      12                                      13                                      14                                      15
           1.0                                                                                                                           q

                                                                                         q
           0.8
                                     q
           0.6
                                                                                                                                     q

           0.4
                                                                                                                                                                                                                     q
                         q                                                                                                                                                           q               q
                                                                                                             q
           0.2                                                                                                   q

                                         q               q                                                                                                                                               q
                                                                                                                                                                                         q
                     q                                       q                                   q       q                                                   q           q       q                           q
           0.0   q           q   q           q   q   q           q   q   q   q   q   q       q       q               q   q   q   q           q   q   q   q       q   q       q               q   q               q


                 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                             item
α = 0.01

                                     1                                       2                                       3                                       4                                       5
           1.0       q
                                                                                     q                                                                       q
                                                                                                                                                                                                             q



           0.8

                                                                                                             q
           0.6

           0.4                                                                                       q



           0.2
                                                                         q                                                                                                   q
           0.0   q       q   q   q   q   q   q   q   q   q   q   q   q       q   q       q   q   q       q       q   q   q   q   q   q   q   q   q   q   q       q   q   q       q   q   q   q   q   q   q       q   q


                                     6                                       7                                       8                                       9                                   10
           1.0               q                                                           q                                   q                                               q                           q



           0.8

           0.6
   value




           0.4

           0.2

           0.0   q   q   q       q   q   q   q   q   q   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q   q   q   q       q   q   q   q   q   q       q   q   q


                                 11                                      12                                      13                                      14                                      15
           1.0                                                       q                                                                                                                                       q

                                             q
           0.8
                                                                                                                                                             q

           0.6
                                                                                                                     q
                                                                                                                 q
           0.4
                                                                                                                                                                         q

           0.2
                                                 q
                 q
           0.0       q   q   q   q   q   q           q   q   q   q       q   q   q   q   q   q   q   q   q   q           q   q   q   q   q   q   q   q   q       q   q       q   q   q   q   q   q   q   q       q   q


                 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                             item
α = 0.001

                                     1                                       2                                       3                                       4                                       5
           1.0   q                                                                       q                               q                   q                                                                       q



           0.8

           0.6

           0.4

           0.2

           0.0       q   q   q   q   q   q   q   q   q   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q       q   q   q   q       q   q   q   q   q   q   q   q   q   q   q   q   q   q   q   q   q


                                     6                                       7                                       8                                       9                                   10
           1.0           q                                       q                                               q                                               q                           q



           0.8

           0.6
   value




           0.4

           0.2

           0.0   q   q       q   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q   q   q   q       q   q   q   q   q   q       q   q   q   q   q   q


                                 11                                      12                                      13                                      14                                      15
           1.0                                   q                                       q                                   q                                   q                       q



           0.8

           0.6

           0.4

           0.2

           0.0   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q       q   q   q   q   q   q   q   q       q   q   q   q   q       q   q   q   q   q   q   q


                 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
                                                                                                             item
Why does LDA “work”?


Why does the LDA posterior put “topical” words together?

  • Word probabilities are maximized by dividing the words among
    the topics. (More terms means more mass to be spread around.)

  • In a mixture, this is enough to find clusters of co-occurring words.

  • In LDA, the Dirichlet on the topic proportions can encourage
    sparsity, i.e., a document is penalized for using many topics.

  • Loosely, this can be thought of as softening the strict definition of
    “co-occurrence” in a mixture model.

  • This flexibility leads to sets of terms that more tightly co-occur.
Summary of LDA




       α         θd     Zd,n    Wd,n               βk         η
                                        N
                                            D           K


  • LDA can
      •   visualize the hidden thematic structure in large corpora
      •   generalize new data to fit into that structure

  • Builds on Deerwester et al. (1990) and Hofmann (1999)
    It is a mixed membership model (Erosheva, 2004).
    Relates to multinomial PCA (Jakulin and Buntine, 2002)

  • Was independently invented for genetics (Pritchard et al., 2000)
Implementations of LDA


There are many available implementations of topic modeling—

    LDA-C∗         A C implementation of LDA
    HDP∗           A C implementation of the HDP (“infinite LDA”)
    Online LDA∗    A python package for LDA on massive data
    LDA in R∗      Package in R for many topic models
    LingPipe       Java toolkit for NLP and computational linguistics
    Mallet         Java toolkit for statistical NLP
    TMVE∗          A python package to build browsers from topic models

∗   available at www.cs.princeton.edu/∼blei/
Example: LDA in R (Jonathan Chang)

    perspective identifying tumor suppressor genes in human...
    letters global warming report leslie roberts article global....
    research news a small revolution gets under way the 1990s....
    a continuing series the reign of trial and error draws to a close...
    making deep earthquakes in the laboratory lab experimenters...
    quick fix for freeways thanks to a team of fast working...
    feathers fly in grouse population dispute researchers...
               ....

    245 1897:1 1467:1 1351:1 731:2 800:5 682:1 315:6 3668:1 14:1
    260 4261:2 518:1 271:6 2734:1 2662:1 2432:1 683:2 1631:7
    279 2724:1 107:3 518:1 141:3 3208:1 32:1 2444:1 182:1 250:1
    266 2552:1 1993:1 116:1 539:1 1630:1 855:1 1422:1 182:3 2432:1
    233 1372:1 1351:1 261:1 501:1 1938:1 32:1 14:1 4067:1 98:2
    148 4384:1 1339:1 32:1 4107:1 2300:1 229:1 529:1 521:1 2231:1
    193 569:1 3617:1 3781:2 14:1 98:1 3596:1 3037:1 1482:12 665:2
                      ....

    docs <- read.documents("mult.dat")
    K <- 20
    alpha <- 1/20
    eta <- 0.001
    model <- lda.collapsed.gibbs.sampler(documents, K, vocab, 1000, alpha, eta)
1                2                3                4              5
  dna          protein            water            says           mantle
  gene           cell            climate        researchers         high
sequence         cells         atmospheric          new             earth
 genes         proteins        temperature        university      pressure
sequences       receptor          global             just          seismic
  human             fig          surface           science           crust
 genome          binding          ocean              like         temperature
  genetic        activity         carbon            work             earths
  analysis      activation      atmosphere           first           lower
    two          kinase           changes           years         earthquakes

     6                7                8                9             10
  end             time           materials         dna           disease
  article         data           surface            rna           cancer
   start            two              high       transcription     patients
 science          model           structure        protein        human
 readers             fig         temperature         site           gene
 service           system         molecules        binding         medical
  news             number          chemical       sequence         studies
   card            different       molecular       proteins          drug
   circle           results           fig          specific         normal
  letters            rate          university     sequences          drugs

    11             12                13               14              15
 years        species            protein           cells           space
  million      evolution        structure           cell            solar
   ago         population        proteins          virus         observations
    age        evolutionary        two               hiv            earth
 university     university       amino            infection          stars
   north       populations       binding           immune          university
   early          natural           acid            human            mass
    fig           studies        residues          antigen            sun
  evidence        genetic        molecular         infected       astronomers
   record         biology         structural         viral         telescope

    16             17                18               19              20
  fax           cells           energy           research        neurons
manager          cell           electron          science           brain
science         gene              state            national          cells
  aaas          genes             light            scientific       activity
advertising   expression        quantum            scientists         fig
   sales      development        physics              new          channels
  member          mutant         electrons           states        university
recruitment        mice             high            university       cortex
associate            fig           laser             united         neuronal
washington        biology        magnetic             health         visual
Open source document browser (with Allison Chaney)
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince

Mais conteúdo relacionado

Destaque

Ecoforum 2011 Dave Johnson Brownfield Regeneration In Nsw
Ecoforum 2011 Dave Johnson Brownfield Regeneration In NswEcoforum 2011 Dave Johnson Brownfield Regeneration In Nsw
Ecoforum 2011 Dave Johnson Brownfield Regeneration In Nswdavidjrca
 
StartupWeekend Kosice 2012 Final pitches
StartupWeekend Kosice 2012 Final pitchesStartupWeekend Kosice 2012 Final pitches
StartupWeekend Kosice 2012 Final pitchesMichal Maxian
 
Interoperabilitas e-gov
Interoperabilitas e-govInteroperabilitas e-gov
Interoperabilitas e-govagungbudip
 
Ansible @ WebElement 2015
Ansible @ WebElement 2015Ansible @ WebElement 2015
Ansible @ WebElement 2015Michal Maxian
 
StartupWeekend Brno #1 Friday Deck
StartupWeekend Brno #1 Friday DeckStartupWeekend Brno #1 Friday Deck
StartupWeekend Brno #1 Friday DeckMichal Maxian
 

Destaque (9)

Ecoforum 2011 Dave Johnson Brownfield Regeneration In Nsw
Ecoforum 2011 Dave Johnson Brownfield Regeneration In NswEcoforum 2011 Dave Johnson Brownfield Regeneration In Nsw
Ecoforum 2011 Dave Johnson Brownfield Regeneration In Nsw
 
StartupWeekend Kosice 2012 Final pitches
StartupWeekend Kosice 2012 Final pitchesStartupWeekend Kosice 2012 Final pitches
StartupWeekend Kosice 2012 Final pitches
 
Internet
InternetInternet
Internet
 
Presentación1
Presentación1Presentación1
Presentación1
 
Interoperabilitas e-gov
Interoperabilitas e-govInteroperabilitas e-gov
Interoperabilitas e-gov
 
ITexperience 2013
ITexperience 2013ITexperience 2013
ITexperience 2013
 
Ansible @ WebElement 2015
Ansible @ WebElement 2015Ansible @ WebElement 2015
Ansible @ WebElement 2015
 
StartupWeekend Brno #1 Friday Deck
StartupWeekend Brno #1 Friday DeckStartupWeekend Brno #1 Friday Deck
StartupWeekend Brno #1 Friday Deck
 
Compressor
CompressorCompressor
Compressor
 

Último

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Último (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

kddvince

  • 1. Probabilistic Topic Models David M. Blei Department of Computer Science Princeton University August 22, 2011
  • 2. Information overload As more information becomes available, it becomes more difficult to find and discover what we need. We need new tools to help us organize, search, and understand these vast amounts of information.
  • 3. Topic modeling Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. 1 Discover the hidden themes that pervade the collection. 2 Annotate the documents according to those themes. 3 Use annotations to organize, summarize, and search the texts.
  • 4. Discover topics from a corpus “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
  • 5. Model the evolution of topics over time "Theoretical Physics" "Neuroscience" FORCE OXYGEN o o o o LASER o o o o o o o o o NERVE o o o o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
  • 6. Model connections between topics neurons brain stimulus motor memory visual activated subjects synapses tyrosine phosphorylation cortical left ltp activation phosphorylation p53 task surface glutamate kinase cell cycle proteins tip synaptic activity protein cyclin binding rna image neurons regulation domain dna sample materials computer domains rna polymerase organic problem device receptor cleavage information polymer science amino acids research scientists receptors cdna site computers polymers funding molecules physicists support says ligand sequence problems laser particles nih research ligands isolated optical physics program people protein sequence light apoptosis sequences surface particle electrons experiment genome liquid quantum wild type dna surfaces stars mutant enzyme sequencing fluid mutations enzymes model reaction astronomers united states mutants iron active site reactions universe women cells mutation reduction molecule galaxies universities cell molecules expression magnetic galaxy cell lines plants magnetic field transition state students bone marrow plant spin superconductivity gene education genes superconducting pressure mantle arabidopsis bacteria high pressure crust sun bacterial pressures upper mantle solar wind host fossil record core meteorites earth resistance development birds inner core ratios planets mice parasite embryos fossils planet gene dinosaurs antigen virus drosophila species disease fossil t cells hiv genes forest mutations antigens aids expression forests families earthquake co2 immune response infection populations mutation earthquakes carbon viruses ecosystems fault carbon dioxide ancient images methane patients genetic found disease cells population impact data water ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa clinical researchers deposits climate measurements variation stratosphere protein magma ocean eruption ice concentrations found volcanism changes climate change
  • 7. Find hierarchies of topics online scheduling quantum task Quantum lower bounds by polynomials competitive On the power of bounded concurrency I: finite automata automata approximation nc tasks Dense quantum coding and quantum finite automata s Classical physics and the Church--Turing Thesis automaton points languages distance convex n routing Nearly optimal algorithms and bounds for multilayer channel routing machine functions adaptive How bad is selfish routing? domain polynomial networks network Authoritative sources in a hyperlinked environment networks Balanced sequences and optimal routing degree log protocol protocols degrees polynomials algorithm network packets link learning learnable statistical constraint examples dependencies Module algebra classes local On XML integrity constraints in the presence of DTDs An optimal algorithm for intersecting line segments in the plane graph Closure properties of constraints Recontamination does not help to search a graph graphs consistency Dynamic functional dependencies and database aging A new approach to the maximum-flow problem edge tractable The time complexity of maximum matching by simulated annealing minimum the,of database vertices constraints a, is algebra and boolean logic m relational logics merging n query networks algorithm theories sorting languages multiplication time log bound system learning consensus systems knowledge objects logic performance reasoning messages protocol programs analysis verification circuit asynchronous systems distributed language trees regular sets networks Single-class bounds of multi-class queuing networks tree queuing The maximum concurrent flow problem search asymptotic Contention in shared memory algorithms compression database productform Linear probing with a nonuniform address distribution transactions server retrieval concurrency Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998 proof restrictions property formulas A mechanical proof of the Church-Rosser theorem program firstorder Timed regular expressions On the power and limitations of strictness analysis resolution decision abstract temporal queries
  • 8. Annotate images Automatic image annotation tain people Automatic predictedwaterpattern textile display predicted caption: tree predicted caption: image annotation leaves branch predicted caption: caption: people market tree mountain people sky birds nest leaves branch tree birds nest p p utomatic image mountainnest caption: flowers tree coralpredicted caption: image anno sky water tree mountain people annotation hills tree predicted caption: predicted caption: e coral SKY WATER TREE sky water buildings people fish branch Automatic predicted caption: predicted predicted caption: predicted caption: SCOTLAND WATER people marketWATER BUILDING birds scotland water ocean leaves water tree p SKY patternbuildings people mountain s sky water textile display predicted caption: predicted caption:HILLS TREEpredicted caption: FLOWER MOUNTAIN PEOPLE sky water tree mountain people birds nest leaves branch tree PEOPLE WATER people market pattern textile display predicted caption: predicted caption: predicted caption: fish water ocean tree coral sky water buildings people mountain scotland water flowers hills tree predicted caption: predicted caption: predicted caption: Probabilistic modelsof text andpredicted caption: predicted caption: caption: predicted predicted caption: p.5/53 images – pr tain peoplefish water ocean leaves branch treePEOPLE sky water pattern textile display water flowersleavestree FISH WATERtree coral birds nest OCEAN people market tree mountain scotland BIRDS NEST TREE MARKET PATTERN sky water buildings peoplemountain people birds nest hills branch tree pe TREE CORAL TEXTILE DISPLAY BRANCH LEAVES
  • 9. Discover influential articles Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% Nonsynonymous DNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003) [178 citations] 0.030 Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973) 0.025 [296 citations] WeightedInfluence 0.020 William K. Gregory, The New Anthropogeny: Twenty-Five Stages of 0.015 Vertebrate Evolution, from Silurian Chordate to Man, Science (1933) [3 citations] 0.010 0.005 0.000 1880 1900 1920 1940 1960 1980 2000 Year W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America, Science (1916) [3 citations]
  • 10. Table 2 Predict links between articles + Regression for two documents Top eight link predictions made by RTM (ψ ) and LDA e (italicized) from Cora. The models were fit with 10 topics. Boldfaced titles indicate actual documents cited by or citing each document. Over the whole corpus, RTM improves precision over LDA + Regression by 80% when evaluated on the first 20 documents retrieved. Markov chain Monte Carlo convergence diagnostics: A comparative review Minorization conditions and convergence rates for Markov chain Monte Carlo Rates of convergence of the Hastings and Metropolis algorithms RTM (ψe ) Possible biases induced by MCMC convergence diagnostics Bounding convergence time of the Gibbs sampler in Bayesian image restoration Self regenerative Markov chain Monte Carlo Auxiliary variable methods for Markov chain Monte Carlo with applications Rate of Convergence of the Gibbs Sampler by Gaussian Approximation Diagnosing convergence of Markov chain Monte Carlo algorithms Exact Bound for the Convergence of Metropolis Chains LDA + Regression Self regenerative Markov chain Monte Carlo Minorization conditions and convergence rates for Markov chain Monte Carlo Gibbs-markov models Auxiliary variable methods for Markov chain Monte Carlo with applications Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models Mediating instrumental variables A qualitative framework for probabilistic inference Adaptation for Self Regenerative MCMC Competitive environments evolve better solutions for complex tasks Coevolving High Level Representations A Survey of Evolutionary Strategies R
  • 11. Characterize political decisions tax credit,budget authority,energy,outlays,tax county,eligible,ballot,election,jurisdiction bank,transfer,requires,holding company,industrial housing,mortgage,loan,family,recipient energy,fuel,standard,administrator,lamp student,loan,institution,lender,school medicare,medicaid,child,chip,coverage defense,iraq,transfer,expense,chapter business,administrator,bills,business concern,loan transportation,rail,railroad,passenger,homeland security cover,bills,bridge,transaction,following bills,tax,subparagraph,loss,taxable loss,crop,producer,agriculture,trade head,start,child,technology,award computer,alien,bills,user,collection science,director,technology,mathematics,bills coast guard,vessel,space,administrator,requires child,center,poison,victim,abuse land,site,bills,interior,river energy,bills,price,commodity,market surveillance,director,court,electronic,flood child,fire,attorney,internet,bills drug,pediatric,product,device,medical human,vietnam,united nations,call,people bills,iran,official,company,sudan coin,inspector,designee,automobile,lebanon producer,eligible,crop,farm,subparagraph people,woman,american,nation,school veteran,veterans,bills,care,injury dod,defense,defense and appropriation,military,subtitle
  • 12. Organize and browse large corpora
  • 13. This tutorial • What are topic models? • What kinds of things can they do? • How do I compute with a topic model? • What are some unsanswered questions in this field? • How can I learn more?
  • 14. Related subjects Topic modeling is a case study in modern machine learning with probabilistic models. It touches on • Directed graphical models • Conjugate priors and nonconjugate priors • Time series modeling • Modeling with graphs • Hierarchical Bayesian methods • Approximate posterior inference (MCMC, variational methods) • Exploratory and descriptive data analysis • Model selection and Bayesian nonparametric methods • Mixed membership models • Prediction from sparse and noisy inputs
  • 15. If you remember one picture... Assumptions Inference Discovered structure algorithm Data
  • 16. Organization • Introduction to topic modeling • Latent Dirichlet allocation • Open source implementations and tools • Beyond latent Dirichlet allocation • Modeling richer assumptions • Supervised topic modeling • Bayesian nonparametric topic modeling • Algorithms • Gibbs sampling • Variational inference • Online variational inference • Discussion, open questions, and resources
  • 18. Probabilistic modeling 1 Data are assumed to be observed from a generative probabilistic process that includes hidden variables. • In text, the hidden variables are the thematic structure. 2 Infer the hidden structure using posterior inference • What are the topics that describe this collection? 3 Situate new data into the estimated model. • How does a new document fit into the topic structure?
  • 19. Latent Dirichlet allocation (LDA) Simple intuition: Documents exhibit multiple topics.
  • 20. Generative model for LDA Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each topic is a distribution over words • Each document is a mixture of corpus-wide topics • Each word is drawn from one of those topics
  • 21. The posterior distribution Topic proportions and Topics Documents assignments • In reality, we only observe the documents • The other structure are hidden variables
  • 22. The posterior distribution Topic proportions and Topics Documents assignments • Our goal is to infer the hidden variables • I.e., compute their distribution conditioned on the documents p(topics, proportions, assignments | documents)
  • 23. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K • Encodes our assumptions about the data • Connects to algorithms for computing with data • See Pattern Recognition and Machine Learning (Bishop, 2006).
  • 24. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K • Nodes are random variables; edges indicate dependence. • Shaded nodes are observed. • Plates indicate replicated variables.
  • 25. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K K D N p(βi | η) p(θd | α) p(zd,n | θd )p(wd,n | β1:K , zd,n ) i=1 d=1 n=1
  • 26. LDA α θd Zd,n Wd,n βk η N D K • This joint defines a posterior. • From a collection of documents, infer • Per-word topic assignment zd,n • Per-document topic proportions θd • Per-corpus topic distributions βk • Then use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, exploration, ...
  • 27. LDA α θd Zd,n Wd,n βk η N D K Approximate posterior inference algorithms • Mean field variational methods (Blei et al., 2001, 2003) • Expectation propagation (Minka and Lafferty, 2002) • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) • Collapsed variational inference (Teh et al., 2006) • Online variational inference (Hoffman et al., 2010) Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).
  • 28. Example inference α θd Zd,n Wd,n βk η N D K • Data: The OCR’ed collection of Science from 1990–2000 • 17K documents • 11M words • 20K unique terms (stop words and rare words removed) • Model: 100-topic LDA model using variational inference.
  • 29. Example inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics
  • 30. Example inference “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
  • 32. Example inference (II) problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
  • 33. 2600 Perplexity Held out perplexity 2400 2200 2000 1800 B LEI , N G , AND J ORDAN 1600 1400 0 10 20 30 40 50 60 70 80 90 100 Nematode abstracts Associated Press Number of Topics 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 1800 3500 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 Smoothed Unigram Figure Smoothed Mixt. Unigrams 9: Perplexity results on the nematode (Top) and AP (Bottom) corpora for LDA, the unigram 6500 LDA model, mixture of unigrams, and pLSI. Fold in pLSI 6000 − log p(wd ) perplexity = exp d 5500 Perplexity d Nd 5000 1010 4500 4000 3500 3000
  • 34. Used to explore and browse document collections
  • 35. Aside: The Dirichlet distribution • The Dirichlet distribution is an exponential family distribution over the simplex, i.e., positive vectors that sum to one Γ( αi ) p(θ | α) = i θiαi −1 . i Γ(αi ) i • It is conjugate to the multinomial. Given a multinomial observation, the posterior distribution of θ is a Dirichlet. • The parameter α controls the mean shape and sparsity of θ. • The topic proportions are a K dimensional Dirichlet. The topics are a V dimensional Dirichlet.
  • 36. α=1 1 2 3 4 5 1.0 0.8 0.6 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 6 7 8 9 10 1.0 0.8 0.6 value 0.4 q q q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 11 12 13 14 15 1.0 0.8 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 37. α = 10 1 2 3 4 5 1.0 0.8 0.6 0.4 q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 6 7 8 9 10 1.0 0.8 0.6 value 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 11 12 13 14 15 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 38. α = 100 1 2 3 4 5 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 6 7 8 9 10 1.0 0.8 0.6 value 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 11 12 13 14 15 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 39. α=1 1 2 3 4 5 1.0 0.8 0.6 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 6 7 8 9 10 1.0 0.8 0.6 value 0.4 q q q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 11 12 13 14 15 1.0 0.8 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 40. α = 0.1 1 2 3 4 5 1.0 q q 0.8 q q 0.6 q 0.4 q 0.2 q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 0.8 q q 0.6 value q 0.4 q q q q q q 0.2 q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q 0.8 q 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 41. α = 0.01 1 2 3 4 5 1.0 q q q q 0.8 q 0.6 0.4 q 0.2 q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 q q q q q 0.8 0.6 value 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q q 0.8 q 0.6 q q 0.4 q 0.2 q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 42. α = 0.001 1 2 3 4 5 1.0 q q q q q 0.8 0.6 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 q q q q q 0.8 0.6 value 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q q q q 0.8 0.6 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  • 43. Why does LDA “work”? Why does the LDA posterior put “topical” words together? • Word probabilities are maximized by dividing the words among the topics. (More terms means more mass to be spread around.) • In a mixture, this is enough to find clusters of co-occurring words. • In LDA, the Dirichlet on the topic proportions can encourage sparsity, i.e., a document is penalized for using many topics. • Loosely, this can be thought of as softening the strict definition of “co-occurrence” in a mixture model. • This flexibility leads to sets of terms that more tightly co-occur.
  • 44. Summary of LDA α θd Zd,n Wd,n βk η N D K • LDA can • visualize the hidden thematic structure in large corpora • generalize new data to fit into that structure • Builds on Deerwester et al. (1990) and Hofmann (1999) It is a mixed membership model (Erosheva, 2004). Relates to multinomial PCA (Jakulin and Buntine, 2002) • Was independently invented for genetics (Pritchard et al., 2000)
  • 45. Implementations of LDA There are many available implementations of topic modeling— LDA-C∗ A C implementation of LDA HDP∗ A C implementation of the HDP (“infinite LDA”) Online LDA∗ A python package for LDA on massive data LDA in R∗ Package in R for many topic models LingPipe Java toolkit for NLP and computational linguistics Mallet Java toolkit for statistical NLP TMVE∗ A python package to build browsers from topic models ∗ available at www.cs.princeton.edu/∼blei/
  • 46. Example: LDA in R (Jonathan Chang) perspective identifying tumor suppressor genes in human... letters global warming report leslie roberts article global.... research news a small revolution gets under way the 1990s.... a continuing series the reign of trial and error draws to a close... making deep earthquakes in the laboratory lab experimenters... quick fix for freeways thanks to a team of fast working... feathers fly in grouse population dispute researchers... .... 245 1897:1 1467:1 1351:1 731:2 800:5 682:1 315:6 3668:1 14:1 260 4261:2 518:1 271:6 2734:1 2662:1 2432:1 683:2 1631:7 279 2724:1 107:3 518:1 141:3 3208:1 32:1 2444:1 182:1 250:1 266 2552:1 1993:1 116:1 539:1 1630:1 855:1 1422:1 182:3 2432:1 233 1372:1 1351:1 261:1 501:1 1938:1 32:1 14:1 4067:1 98:2 148 4384:1 1339:1 32:1 4107:1 2300:1 229:1 529:1 521:1 2231:1 193 569:1 3617:1 3781:2 14:1 98:1 3596:1 3037:1 1482:12 665:2 .... docs <- read.documents("mult.dat") K <- 20 alpha <- 1/20 eta <- 0.001 model <- lda.collapsed.gibbs.sampler(documents, K, vocab, 1000, alpha, eta)
  • 47. 1 2 3 4 5 dna protein water says mantle gene cell climate researchers high sequence cells atmospheric new earth genes proteins temperature university pressure sequences receptor global just seismic human fig surface science crust genome binding ocean like temperature genetic activity carbon work earths analysis activation atmosphere first lower two kinase changes years earthquakes 6 7 8 9 10 end time materials dna disease article data surface rna cancer start two high transcription patients science model structure protein human readers fig temperature site gene service system molecules binding medical news number chemical sequence studies card different molecular proteins drug circle results fig specific normal letters rate university sequences drugs 11 12 13 14 15 years species protein cells space million evolution structure cell solar ago population proteins virus observations age evolutionary two hiv earth university university amino infection stars north populations binding immune university early natural acid human mass fig studies residues antigen sun evidence genetic molecular infected astronomers record biology structural viral telescope 16 17 18 19 20 fax cells energy research neurons manager cell electron science brain science gene state national cells aaas genes light scientific activity advertising expression quantum scientists fig sales development physics new channels member mutant electrons states university recruitment mice high university cortex associate fig laser united neuronal washington biology magnetic health visual
  • 48. Open source document browser (with Allison Chaney)