User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features.
2. Outline
● Motivation and Why movie reviews
● Problem statement
● How? or the overall system
● Text preprocessing approaches
● Postprocessing: movie topics from a reviews
corpus
● Similarity
● Experimental setup and results
3. Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html
Motivation
4. Motivation
● movie genres are not enough.
● classify movies
○ keywords
○ moods
○ imdb ratings
○ micro genres
7. Problem statement
● Feature extraction from user reviews of
movies
● Use extracted features to find similar
movies.
8. The overall system
Movie reviews corpus
● preprocessing
○ tokenization, stopwords, lemmatized.
● post processing
○ topic modeling: Movie topics from a reviews corpus
● similarity measure
○ return movies with similar topics distribution
12. Post processing: LDA
For each document in the collection, the words can be generated
in two stage process
1) Randomly choose a distribution over topics.
2) For each word in the document
a) Randomly choose a topic from the distribution over
topics in step 1.
b) Randomly choose a word from the corresponding
distribution over the vocabulary
Documents exhibit multiple topics
17. The overall system: implementation
Movie reviews corpus
● preprocessing
○ nltk and gensim’s simple preprocessing.
● post processing
○ gensim python wrapper to MALLET
○ index topic distribution of query movies, q and 1k
movies corpus, C.
● similarity measure
○ python numpy implementation
○ apply distance metric on indexed q and C.
○ sort and pick top 5 movies.
20. Conclusion
● Movie topics as efficient features for RS
○ represents movies by underlying semantic patterns
○ useful for capturing movie genre and mood.
○ but not so well with plot.
○ user written movie reviews are useful movie meta-data.
● The developed prototype
○ easy to add more movie meta-data
○ python allows scalability.
○ Topics as an explanation needs further tuning.
21. Future directions
● Movie review preprocessing
○ bigram, trigrams.
○ create multi-word movie keywords or language
construction
● Building complex topic models
○ Hierarchical LDA
○ author-topic model
■ include authorship information.
■ similarity between authors
22. Thank You
Questions ?
Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
23. Extra slides
List of extra slides and notes
● Original LDA paper
● introduction to probabilistic topic modeling
● and A. Huang’s Similarity measures for text document
clustering
● Another good LDA description
● Integrating out multinomial parameters in LDA
● language construction in micro genres
movie similarity
and then we finish with Conclusion and future directions.
made for a popular streaming service
user reviews are read not only to know how good or bad is the movie, but also to know what the movie is about.
more than sentiment analysis.
gives audience point of view.
Use extracted features but it could be used for other purposes as well.
System as we can implement each part as a module
finish the one complete cycle and then repeat cycle if time .
preprocessing is general to data processing, here it is text
text processing using NLTK toolkit
start with small examples.
chunking, named entity extraction.
tokenization, stopwords, lemmatized.
before starting, lets understand document representation.
DR is important part of information retrieval.
Main intuition is document exhibit multiple topics
Simple intuition: Documents exhibit multiple topics, so does movie reviews about the movie, if preprocessing removes irrelevant words.
Not all the topics of a review are important.
Example
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
discover the hidden themes from the collection.
annotate the documents according to those themes.
use annotations to organize, summarize, search, form predictions
LDA is a statistical model of document collections that tries to capture this intuition
after training LDA model, we can look at the generated topics. notice each detail here
the Hellinger distance between P and Q is defined as
It is important to note that for cosine similarity, higher value is better whereas for hellinger distance, smaller value represents more similarity.
start with: we developed the prototyping system
Useful for capturing movie genre and mood information.
the system to 10k movies with some effort.
user written movie review contains information about movies.