2. BigML Education Program 2Ensembles
In This Video
• Introduction to Topic Models
• Exploration of Topic Models in the BigML Interface
• Inference of topic distributions using a trained topic
model
• Parameterization of topic models
3. BigML Education Program 3Ensembles
Data For Topic Models
• Unstructured text data
• Short stories, novels, newspaper articles
• Web pages
• Customer reviews or surveys
• E-mail Messages
• Data is not like most machine learning data
• Often no fields in each row (i.e., no “columns”)
• Each instance is just the text of the document
4. BigML Education Program 4Ensembles
Categorizing Instances
• Often, many instances will have words indicating they
are about the same thing (the same topic)
• It may be useful to identify instances corresponding to a
certain topic
• Topic modeling automatically discovers common topics
in the data
• Can assign a score to each instance indicating how
much that instance is “about” a given topic
5. BigML Education Program 5Ensembles
Generative Modeling
• Decision trees / Logistic regression are discriminative
models
• Aggressively model the classification boundary
• Parsimonious: Don’t consider anything you don’t
have to
• Topic models are generative models
• Posit a theory of how the data was generated
• Tweak the theory to fit the data
6. BigML Education Program 6Ensembles
Title Text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
DocumentTerm
7. BigML Education Program 7Ensembles
Topics
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness:
some are born
great, some
achieve
greatness…
term probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
• A topic is a term generator
• Invoke it a bunch of times to get a document
• Most will be nonsense, but eventually you’ll generate
your dataset
8. BigML Education Program 8Ensembles
Topic Models
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵTopic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
Generate
Document
9. BigML Education Program 9Ensembles
Review
• Topic models are generative models for unstructured
text data
• The BigML interface provides an intuitive way to explore
your topic model
• You can get the topic distribution for an instance by
using the “topic distribution” or “batch topic distribution”
options in the model resource view
• Changing the “number of topics” and specifying
“excluded terms” may give you a much different and
possibly better topic model