SlideShare uma empresa Scribd logo
1 de 24
Efficient Features for 
Movie Recommendation 
Systems 
Project presentation 
Suvir Bhargav
Outline 
● Motivation and Why movie reviews 
● Problem statement 
● How? or the overall system 
● Text preprocessing approaches 
● Postprocessing: movie topics from a reviews 
corpus 
● Similarity 
● Experimental setup and results
Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html 
Motivation
Motivation 
● movie genres are not enough. 
● classify movies 
○ keywords 
○ moods 
○ imdb ratings 
○ micro genres
micro genres 
source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
Why movie reviews? 
Source: a sample user written movie review from imdb
Problem statement 
● Feature extraction from user reviews of 
movies 
● Use extracted features to find similar 
movies.
The overall system 
Movie reviews corpus 
● preprocessing 
○ tokenization, stopwords, lemmatized. 
● post processing 
○ topic modeling: Movie topics from a reviews corpus 
● similarity measure 
○ return movies with similar topics distribution
Text preprocessing 
tokenization, stopwords, lemmatized. 
Simple information extraction 
Figure credit to nltk book.
Post processing 
Document representation: Vector Space Model (VSM) 
Picture credit: pyevolve
Post processing: generative model 
source: David blei’s slide
Post processing: LDA 
For each document in the collection, the words can be generated 
in two stage process 
1) Randomly choose a distribution over topics. 
2) For each word in the document 
a) Randomly choose a topic from the distribution over 
topics in step 1. 
b) Randomly choose a word from the corresponding 
distribution over the vocabulary 
Documents exhibit multiple topics
Movie topics from a reviews corpus
Similarity Measure 
● Cosine Similarity 
● KL divergence 
● Hellinger distance
Similarity Measure 
Cosine Similarity
Similarity Measure 
Hellinger Distance
The overall system: implementation 
Movie reviews corpus 
● preprocessing 
○ nltk and gensim’s simple preprocessing. 
● post processing 
○ gensim python wrapper to MALLET 
○ index topic distribution of query movies, q and 1k 
movies corpus, C. 
● similarity measure 
○ python numpy implementation 
○ apply distance metric on indexed q and C. 
○ sort and pick top 5 movies.
Experimental setup 
Movie reviews corpus of 1k movies 
reviews data source: imdb
Experimental setup 
Evaluation criteria
Conclusion 
● Movie topics as efficient features for RS 
○ represents movies by underlying semantic patterns 
○ useful for capturing movie genre and mood. 
○ but not so well with plot. 
○ user written movie reviews are useful movie meta-data. 
● The developed prototype 
○ easy to add more movie meta-data 
○ python allows scalability. 
○ Topics as an explanation needs further tuning.
Future directions 
● Movie review preprocessing 
○ bigram, trigrams. 
○ create multi-word movie keywords or language 
construction 
● Building complex topic models 
○ Hierarchical LDA 
○ author-topic model 
■ include authorship information. 
■ similarity between authors
Thank You 
Questions ? 
Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
Extra slides 
List of extra slides and notes 
● Original LDA paper 
● introduction to probabilistic topic modeling 
● and A. Huang’s Similarity measures for text document 
clustering 
● Another good LDA description 
● Integrating out multinomial parameters in LDA 
● language construction in micro genres
LDA

Mais conteúdo relacionado

Mais procurados

Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Kishor Datta Gupta
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
Rishabh Misra
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
IJMER
 
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the CloudMMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
Xavier Amatriain
 
Download
DownloadDownload
Download
butest
 

Mais procurados (20)

Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Context-Aware Points of Interest Suggestion with Dynamic Weather Data Management
Context-Aware Points of Interest Suggestion with Dynamic Weather Data ManagementContext-Aware Points of Interest Suggestion with Dynamic Weather Data Management
Context-Aware Points of Interest Suggestion with Dynamic Weather Data Management
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
 
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the CloudMMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud
 
Download
DownloadDownload
Download
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time Deployment
 
Introduction to Model-Based Machine Learning for Transportation
Introduction to Model-Based Machine Learning for TransportationIntroduction to Model-Based Machine Learning for Transportation
Introduction to Model-Based Machine Learning for Transportation
 
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
 
Machine learning in computer security
Machine learning in computer securityMachine learning in computer security
Machine learning in computer security
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
How popular are your tweets?
How popular are your tweets?How popular are your tweets?
How popular are your tweets?
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 

Destaque

CSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 MarCSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 Mar
cstalks
 

Destaque (8)

CSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 MarCSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 Mar
 
A Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation SystemA Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation System
 
Developing and Movie Recommendation System in R
Developing and Movie Recommendation System in RDeveloping and Movie Recommendation System in R
Developing and Movie Recommendation System in R
 
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataMoviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with Spark
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

Semelhante a Movie topics- Efficient features for movie recommendation systems

Semelhante a Movie topics- Efficient features for movie recommendation systems (20)

Video Recommendation Engines as a Service
Video Recommendation Engines as a ServiceVideo Recommendation Engines as a Service
Video Recommendation Engines as a Service
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Movie lens movie recommendation system
Movie lens movie recommendation systemMovie lens movie recommendation system
Movie lens movie recommendation system
 
Applied Data Science for E-Commerce
Applied Data Science for E-CommerceApplied Data Science for E-Commerce
Applied Data Science for E-Commerce
 
How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...
 
Challenge@rule ml2015 rule based recommender systems for the Web of Data
Challenge@rule ml2015 rule based recommender systems for the Web of DataChallenge@rule ml2015 rule based recommender systems for the Web of Data
Challenge@rule ml2015 rule based recommender systems for the Web of Data
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
[submission] Final_Presentation
[submission] Final_Presentation[submission] Final_Presentation
[submission] Final_Presentation
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Movie Recommendation System using ml.pptx
Movie Recommendation System using ml.pptxMovie Recommendation System using ml.pptx
Movie Recommendation System using ml.pptx
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentation
 
A Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemA Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation System
 
Strata NYC: Building turn-key recommendations for 5% of internet video
Strata NYC: Building turn-key recommendations for 5% of internet videoStrata NYC: Building turn-key recommendations for 5% of internet video
Strata NYC: Building turn-key recommendations for 5% of internet video
 
Code Palousa presentation- "Giving Digital Eyes to your Synthetic Tests"
Code Palousa presentation- "Giving Digital Eyes to your Synthetic Tests"Code Palousa presentation- "Giving Digital Eyes to your Synthetic Tests"
Code Palousa presentation- "Giving Digital Eyes to your Synthetic Tests"
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Movie topics- Efficient features for movie recommendation systems

  • 1. Efficient Features for Movie Recommendation Systems Project presentation Suvir Bhargav
  • 2. Outline ● Motivation and Why movie reviews ● Problem statement ● How? or the overall system ● Text preprocessing approaches ● Postprocessing: movie topics from a reviews corpus ● Similarity ● Experimental setup and results
  • 3. Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html Motivation
  • 4. Motivation ● movie genres are not enough. ● classify movies ○ keywords ○ moods ○ imdb ratings ○ micro genres
  • 5. micro genres source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
  • 6. Why movie reviews? Source: a sample user written movie review from imdb
  • 7. Problem statement ● Feature extraction from user reviews of movies ● Use extracted features to find similar movies.
  • 8. The overall system Movie reviews corpus ● preprocessing ○ tokenization, stopwords, lemmatized. ● post processing ○ topic modeling: Movie topics from a reviews corpus ● similarity measure ○ return movies with similar topics distribution
  • 9. Text preprocessing tokenization, stopwords, lemmatized. Simple information extraction Figure credit to nltk book.
  • 10. Post processing Document representation: Vector Space Model (VSM) Picture credit: pyevolve
  • 11. Post processing: generative model source: David blei’s slide
  • 12. Post processing: LDA For each document in the collection, the words can be generated in two stage process 1) Randomly choose a distribution over topics. 2) For each word in the document a) Randomly choose a topic from the distribution over topics in step 1. b) Randomly choose a word from the corresponding distribution over the vocabulary Documents exhibit multiple topics
  • 13. Movie topics from a reviews corpus
  • 14. Similarity Measure ● Cosine Similarity ● KL divergence ● Hellinger distance
  • 17. The overall system: implementation Movie reviews corpus ● preprocessing ○ nltk and gensim’s simple preprocessing. ● post processing ○ gensim python wrapper to MALLET ○ index topic distribution of query movies, q and 1k movies corpus, C. ● similarity measure ○ python numpy implementation ○ apply distance metric on indexed q and C. ○ sort and pick top 5 movies.
  • 18. Experimental setup Movie reviews corpus of 1k movies reviews data source: imdb
  • 20. Conclusion ● Movie topics as efficient features for RS ○ represents movies by underlying semantic patterns ○ useful for capturing movie genre and mood. ○ but not so well with plot. ○ user written movie reviews are useful movie meta-data. ● The developed prototype ○ easy to add more movie meta-data ○ python allows scalability. ○ Topics as an explanation needs further tuning.
  • 21. Future directions ● Movie review preprocessing ○ bigram, trigrams. ○ create multi-word movie keywords or language construction ● Building complex topic models ○ Hierarchical LDA ○ author-topic model ■ include authorship information. ■ similarity between authors
  • 22. Thank You Questions ? Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
  • 23. Extra slides List of extra slides and notes ● Original LDA paper ● introduction to probabilistic topic modeling ● and A. Huang’s Similarity measures for text document clustering ● Another good LDA description ● Integrating out multinomial parameters in LDA ● language construction in micro genres
  • 24. LDA

Notas do Editor

  1. movie similarity and then we finish with Conclusion and future directions.
  2. made for a popular streaming service
  3. user reviews are read not only to know how good or bad is the movie, but also to know what the movie is about. more than sentiment analysis. gives audience point of view.
  4. Use extracted features but it could be used for other purposes as well.
  5. System as we can implement each part as a module finish the one complete cycle and then repeat cycle if time .
  6. preprocessing is general to data processing, here it is text text processing using NLTK toolkit start with small examples. chunking, named entity extraction. tokenization, stopwords, lemmatized.
  7. before starting, lets understand document representation. DR is important part of information retrieval.
  8. Main intuition is document exhibit multiple topics Simple intuition: Documents exhibit multiple topics, so does movie reviews about the movie, if preprocessing removes irrelevant words. Not all the topics of a review are important. Example Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals) discover the hidden themes from the collection. annotate the documents according to those themes. use annotations to organize, summarize, search, form predictions
  9. LDA is a statistical model of document collections that tries to capture this intuition
  10. after training LDA model, we can look at the generated topics. notice each detail here
  11. the Hellinger distance between P and Q is defined as It is important to note that for cosine similarity, higher value is better whereas for hellinger distance, smaller value represents more similarity.
  12. start with: we developed the prototyping system Useful for capturing movie genre and mood information. the system to 10k movies with some effort. user written movie review contains information about movies.