This document discusses topic modelling and APIs. It proposes representing algorithms like topic modelling as "mills" that encapsulate work without owning data. Mills for topic modelling are described, including resources for creating a topic model, classifying text with a trained model, and getting the classification results. Finally, the current state of machine learning APIs and some references are acknowledged.
17. Latent Dirichlet
Allocation (LDA)
✤ Mainly a “clustering” algorithm
✤ Defines topics as latent variables within the documents
✤ Its implementations available in most programming languages
✤ Python => Gensim and Java => Mallet
18. Topic Modelling
concepts in LDA
✤ Document: “Bag of words” vs. “Markov chain”
✤ Word: mere an id (“library”=>123, “librarian”=>789)
✤ Dictionary: set of all words
✤ Corpus: set of all documents
✤ Topic: Distribution over words (LDA)
19. Using Latent Dirichlet
Allocation
✤ Document as a vector of topic weights {0: 0.01, 12: 0.19,
42: 0.23}
✤ Cosine similarity for document similarity
✤ Document similarity works really well
✤ Not great in some domains [to fix => Hierarchical]
✤ Boosting
33. Mills
✤ A single piece of work/specialty (& verb)
✤ Encapsulating an “algorithm”
✤ Do not own data (own config tho):
Raw data in, processed result out
✤ All calls are safe and idempotent