SlideShare uma empresa Scribd logo
1 de 32
Anime Recommendation
Executive Summary
• Problem Statement
• Business Values
• Project Requirements
Problem Statement
• How is rating this anime if we give it to user?
• How popular each anime based-on follower?
• How many anime group based-on their genres?
• Which anime will be recommended to user based-on their
preference?
Business Values
• Able to choose anime to match current viewer
• Able to push advertisement to potential viewer
• Able to upsell similar products for each anime
• Able to accurately predict anime rating and popularity for license
acquisition
Requirements
• Anime data and user rating
• Recommendation algorithm using ALS
• Clustering algorithm using K-NN
• Model evaluation using RMSE
Data
• From https://www.kaggle.com/CooperUnion/anime-
recommendations-database
• Contains information on user preference data from 73,516 users on
12,294 anime
• Each user is able to add anime to their completed list and give it a
rating and this data set is a compilation of those ratings
Data
• 2 files, anime.csv and rating.csv
• Data volume
• 12,294 rows for anime.csv
• 7,813,737 rows for rating.csv
Schema
• anime.csv
• anime_id: myanimelist.net's unique id identifying an anime
• name: full name of anime
• genre: comma separated list of genres for this anime
• type: movie, TV, OVA, etc.
• episodes: how many episodes in this show. (1 if movie)
• rating: average rating out of 10 for this anime
• members: number of community members that are in this anime's "group"
Schema
Schema
• rating.csv
• user_id: non identifiable randomly generated user id
• anime_id: the anime that this user has rated
• rating: rating out of 10 this user has assigned (-1 if the user watched it but
didn't assign a rating)
Schema
Feature
• Use original dataset to build recommendation model
• Extract unique genre from genres column in anime.csv to build
clustering model
Feature
• Recommendation
• anime_id
• rating, also used as target
• user_id
Feature
• Clustering
• anime_id
• Pivoted genres (Action, Adventure, Comedy, Drama, …)
• type
• episodes
• rating
• members
• Use predicted cluster as target
Running Prototyping Experiment
• Get data
• Data pre-processing
• Feature engineering
• Train the model
• Model evaluation
Get Data
• Dataset was downloaded from
https://www.kaggle.com/CooperUnion/anime-recommendations-
database
• Data is in comma separated value file format
• See data information in ”Data” section
Data Pre-Processing
• Data retrieved are well-formed
• Some NULL value in rating was found
• Unknown episodes represented as “Unknown”
• Rows with NULL and/or Unknown values was filtered out
• Total filtered rows is ~500
Feature Engineering
• Use original data schema
• Processed only data in rating.csv
• Use anime_id, user_id and rating as features
• rating also used as target
Train the Model
• Processed only data in rating.csv
• Ratio of train-to-test data is 80:20
• Use ALS algorithm to build rating predictive model
Model Evaluation
• Data in anime.csv is used for map anime_id with human-readible
name
• Predicted ratings were of type “floating point”
• Using RMSE as an evaluation method
• Some row of test data cannot be predicted, we get “NaN” as a result
• NaN (Not-a-Number) was filtered out
Anime Recommendation
Part 2
Contents
• Clustering model with K-Means
• Real-time data processing
• Visualization
Clustering with K-Means
Environment
• CRAN R 3.4.2
• Anime data file (anime.csv)
• Genres distance file (distance.csv)
Build a Clustering Model
• Try build with 5 to 10 clusters
• Use distance.csv file to determine the distance
• Visualizing clusters
Discussion
• Distance value can be determine as indicated in “How to produce a
pretty plot of the results of k-means cluster analysis?” discussion
(https://stats.stackexchange.com/questions/31083/how-to-produce-
a-pretty-plot-of-the-results-of-k-means-cluster-analysis)
• Distance value in anime clustering should be a normalized value
• Can be percent of running scene for each genre
• Example: action scene running for 12 minutes out of 24 minutes, so
distance for action is 50%
Real-time Data Processing
Environment
• Web API
• Kafka
• Spark Streaming
Environment
Client
Client
Client
Request
Response
Producer Consumer
Demonstration
Visualization
Demonstration

Mais conteúdo relacionado

Mais procurados

Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
Roger Chen
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
Salil Navgire
 

Mais procurados (20)

Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Movie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetMovie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens Dataset
 
Sentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataSentiment Analysis using Twitter Data
Sentiment Analysis using Twitter Data
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix APIMaintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systems
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Practicing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case StudiesPracticing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case Studies
 
A pattern language for microservices - June 2021
A pattern language for microservices - June 2021 A pattern language for microservices - June 2021
A pattern language for microservices - June 2021
 
API Management in Digital Transformation
API Management in Digital TransformationAPI Management in Digital Transformation
API Management in Digital Transformation
 
Textual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie ReviewsTextual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie Reviews
 

Semelhante a Anime recommendation (Big Data Certification#6)

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Semelhante a Anime recommendation (Big Data Certification#6) (20)

Design Recommender systems from scratch
Design Recommender systems from scratchDesign Recommender systems from scratch
Design Recommender systems from scratch
 
Looking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction APILooking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction API
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNN
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Recommender System Using AZURE ML
Recommender System Using AZURE MLRecommender System Using AZURE ML
Recommender System Using AZURE ML
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty Images
 
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache SparkBig Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
 
P211 Group 1 Amazon Beauty Products Recommendation.pptx
P211 Group 1 Amazon Beauty Products Recommendation.pptxP211 Group 1 Amazon Beauty Products Recommendation.pptx
P211 Group 1 Amazon Beauty Products Recommendation.pptx
 
Big Data LDN 2017: Serving Predictive Models with Redis
Big Data LDN 2017: Serving Predictive Models with RedisBig Data LDN 2017: Serving Predictive Models with Redis
Big Data LDN 2017: Serving Predictive Models with Redis
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Algorithms presentation
Algorithms presentationAlgorithms presentation
Algorithms presentation
 
Darin Briskman_Amazon_June_9_2017_Presentation
Darin Briskman_Amazon_June_9_2017_PresentationDarin Briskman_Amazon_June_9_2017_Presentation
Darin Briskman_Amazon_June_9_2017_Presentation
 
Getting to Know the Video Consumer - NAB Show 2018
Getting to Know the Video Consumer - NAB Show 2018Getting to Know the Video Consumer - NAB Show 2018
Getting to Know the Video Consumer - NAB Show 2018
 
Running with Elephants: Predictive Analytics with HDInsight
Running with Elephants: Predictive Analytics with HDInsightRunning with Elephants: Predictive Analytics with HDInsight
Running with Elephants: Predictive Analytics with HDInsight
 
URUG Ruby on Rails Workshop - Sesssion 5
URUG Ruby on Rails Workshop - Sesssion 5URUG Ruby on Rails Workshop - Sesssion 5
URUG Ruby on Rails Workshop - Sesssion 5
 
AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)
AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)
AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
File Upload 2015
File Upload 2015File Upload 2015
File Upload 2015
 

Mais de IMC Institute

Mais de IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Anime recommendation (Big Data Certification#6)

  • 2. Executive Summary • Problem Statement • Business Values • Project Requirements
  • 3. Problem Statement • How is rating this anime if we give it to user? • How popular each anime based-on follower? • How many anime group based-on their genres? • Which anime will be recommended to user based-on their preference?
  • 4. Business Values • Able to choose anime to match current viewer • Able to push advertisement to potential viewer • Able to upsell similar products for each anime • Able to accurately predict anime rating and popularity for license acquisition
  • 5. Requirements • Anime data and user rating • Recommendation algorithm using ALS • Clustering algorithm using K-NN • Model evaluation using RMSE
  • 6. Data • From https://www.kaggle.com/CooperUnion/anime- recommendations-database • Contains information on user preference data from 73,516 users on 12,294 anime • Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings
  • 7. Data • 2 files, anime.csv and rating.csv • Data volume • 12,294 rows for anime.csv • 7,813,737 rows for rating.csv
  • 8. Schema • anime.csv • anime_id: myanimelist.net's unique id identifying an anime • name: full name of anime • genre: comma separated list of genres for this anime • type: movie, TV, OVA, etc. • episodes: how many episodes in this show. (1 if movie) • rating: average rating out of 10 for this anime • members: number of community members that are in this anime's "group"
  • 10. Schema • rating.csv • user_id: non identifiable randomly generated user id • anime_id: the anime that this user has rated • rating: rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating)
  • 12. Feature • Use original dataset to build recommendation model • Extract unique genre from genres column in anime.csv to build clustering model
  • 13. Feature • Recommendation • anime_id • rating, also used as target • user_id
  • 14. Feature • Clustering • anime_id • Pivoted genres (Action, Adventure, Comedy, Drama, …) • type • episodes • rating • members • Use predicted cluster as target
  • 15. Running Prototyping Experiment • Get data • Data pre-processing • Feature engineering • Train the model • Model evaluation
  • 16. Get Data • Dataset was downloaded from https://www.kaggle.com/CooperUnion/anime-recommendations- database • Data is in comma separated value file format • See data information in ”Data” section
  • 17. Data Pre-Processing • Data retrieved are well-formed • Some NULL value in rating was found • Unknown episodes represented as “Unknown” • Rows with NULL and/or Unknown values was filtered out • Total filtered rows is ~500
  • 18. Feature Engineering • Use original data schema • Processed only data in rating.csv • Use anime_id, user_id and rating as features • rating also used as target
  • 19. Train the Model • Processed only data in rating.csv • Ratio of train-to-test data is 80:20 • Use ALS algorithm to build rating predictive model
  • 20. Model Evaluation • Data in anime.csv is used for map anime_id with human-readible name • Predicted ratings were of type “floating point” • Using RMSE as an evaluation method • Some row of test data cannot be predicted, we get “NaN” as a result • NaN (Not-a-Number) was filtered out
  • 22. Contents • Clustering model with K-Means • Real-time data processing • Visualization
  • 24. Environment • CRAN R 3.4.2 • Anime data file (anime.csv) • Genres distance file (distance.csv)
  • 25. Build a Clustering Model • Try build with 5 to 10 clusters • Use distance.csv file to determine the distance • Visualizing clusters
  • 26. Discussion • Distance value can be determine as indicated in “How to produce a pretty plot of the results of k-means cluster analysis?” discussion (https://stats.stackexchange.com/questions/31083/how-to-produce- a-pretty-plot-of-the-results-of-k-means-cluster-analysis) • Distance value in anime clustering should be a normalized value • Can be percent of running scene for each genre • Example: action scene running for 12 minutes out of 24 minutes, so distance for action is 50%
  • 28. Environment • Web API • Kafka • Spark Streaming