SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Data cleaning on
applications’ rating
and review from
Google Play Store
Towards Data Cleaning
Xiaotong Hu
Shang Li
Yi Chun Liu
Yu-Chen Su
Contents
Datasets and Use Cases
1
Data Cleaning Method and Process
2
Building Database
3
Text Data Cleaning
4
Workflow Model
5
Future Work
6
Datasets and Use Cases
3
Datasets
● Web crawler data
● Google Play Store Rating
● Google Play Store Users
Review
Use Cases
● For product managers, they can review
the applications through the the rating
score and comments on Google Play
Store
● Continuously optimize application
products
Data Cleaning Method and Process
4
Dataset Column Method
Google Play
Store
Rating 1. Place “NaN” with “null” 2. Transform to number
Reviews 1. Transform to number
Size
1. Change “M” (Megabit) to Kilobit 2. Remove “k” 3. Replace “Varies with device” with “00000” 4. Transform to
number
Installs 1. Remove “+” and “,” 2. Transform to number
Type 1. Place “NaN” with “null” 2. Create dummy variable column 3. Transform dummy variable column to number
Price 1. Remove “$” 2. Transform to number
Genres 1. Split into two columns by “;”
Google Play
Store User
Reviews
Sentiment Polarity 1. Place empty cells with “Null” 2. Transform to number
Sentiment Subjectivity 1. Place empty cells with “Null” 2. Transform to number
Data Cleaning - Special Cases
● Remove the data which are mismatched with columns.
● The translated_Review column will be cleaned with Python since
OpenRefine is not efficient to remove punctuation
# Example comment
I like eat delicious food. That's I'm cooking
food myself, case "10 Best Foods" helps lot,
also "Best Before (Shelf Life)"
After using Openrefine to clean up the data, we are able import data into MySQL database
Import Data to MySQL
6
GooglePlayStore
App TEXT
Category TEXT
Rating DOUBLE
Reviews INT
Size INT
Installs INT
Type Text
Typedummy INT
Price INT
ContentRating Text
Genres TEXT
Genres1 TEXT
Genres2 TEXT
LastUpdated DATETIME
CurrentVer Text
AndroidVer Text
Reviews
App Text
Translated_ Review Text
Sentiment Text
Sentiment_ Polarity DOUBLE
Sentiment_ Subjectivity DOUBLE
Schema & Datatype:
Rules:
● if Sentiment_Polarity > 0 => Sentiment is POSITIVE
● if Sentiment_Polarity < 0 => Sentiment is NEGATIVE
● if Sentiment_Polarity = 0 => Sentiment is NEUTRAL
● if Sentiment_Polarity IS NULL => Sentiment IS NULL
Integrity Constraints Violation Check
7
NO Violation found
Join Two Tables into One
SQL Syntax: Joint Table
● 70,471 Observation
● 20 Variables
● 17.7MB in CSV
● Figure out the keyword
frequency based on each
sentiment categories
● Python - Natural Language
Toolkit (NLTK)
Text Review Data Cleaning
Step 1: Remove punctuation
remove string punctuation, including
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
Example Comment
[‘I like eat delicious food. That’s I’m
cooking food myself, case “10 Best
Foods” helps lot, also “Best Before
(Shelf Life)”’]
[‘I like eat delicious food Thats Im
cooking food myself case 10 Best Foods
helps lot also Best Before Shelf Life’]
Text Review Data Cleaning
Step 2: Tokenizer
Splits a string into substrings using
a regular expression
['i', 'like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'myself', 'case', '10', 'best', 'foods',
'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']
['like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'case', '10', 'best', 'foods', 'helps',
'lot', 'also', 'best', 'shelf', 'life']
Step 3: Remove stop words
Remove words that do not contain
important significance to be used
in search queries
Text Review Data Cleaning
Step 4: Stemming & Lemmatization
Stemming
- Stemming is the reduction
method to convert words into
stems, such as treating "cats" as
"cat" and "effective" as "effect"
- The word may be unable to
express complete semantics
after stemming
V.S.
Lemmatization
- Lemmatization is
transformation method to
transform the word into its
original form, such as treating
“drove” to “drive” and “driving”
as “drive”
Text Review Data Cleaning
Workflow Model
Yesworkflow Model (OpenRefine) –
Google Play Store Rating
Yesworkflow Model (OpenRefine) –
Google Play Store Users Review
Future Work
16
● Everyone is responsible for different part with different tools
● Because of some constraints of each tools, it is difficult to cooperate with
each other during the data cleaning process
● Study on how to improve the cooperation efficiency when everyone using
different tools
Thank you for listening
Any Questions?

Mais conteúdo relacionado

Mais procurados

Event managementsystem
Event managementsystemEvent managementsystem
Event managementsystemPraveen Jha
 
Institute Mangement System PPT By Mukesh
Institute Mangement System PPT By MukeshInstitute Mangement System PPT By Mukesh
Institute Mangement System PPT By MukeshMukesh Kumar
 
Presentation on Railway Reservation System
Presentation on Railway Reservation SystemPresentation on Railway Reservation System
Presentation on Railway Reservation SystemPriyanka Sharma
 
PPT-Presentation-Pharmacy-Management-System-Project.pptx
PPT-Presentation-Pharmacy-Management-System-Project.pptxPPT-Presentation-Pharmacy-Management-System-Project.pptx
PPT-Presentation-Pharmacy-Management-System-Project.pptxAryankumarKeshari
 
Development of-pharmacy-management-system
Development of-pharmacy-management-systemDevelopment of-pharmacy-management-system
Development of-pharmacy-management-systemJoy Sarker
 
Online Advertisement Project Presentation
Online Advertisement Project PresentationOnline Advertisement Project Presentation
Online Advertisement Project Presentationsatvirsandhu9
 
Real estate management system
Real estate management systemReal estate management system
Real estate management systemSouvikSarkar75
 
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptx
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptxFITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptx
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptxsikhaverma3
 
Event management system
Event management systemEvent management system
Event management systemD Yogendra Rao
 
Hotel Management System
Hotel Management System Hotel Management System
Hotel Management System Kusum Sankhala
 
Digital Marketing plan for an Elearning startup.
Digital Marketing plan for an Elearning startup.Digital Marketing plan for an Elearning startup.
Digital Marketing plan for an Elearning startup.Vishanth NJ
 
Online property management system design document
Online property management system design documentOnline property management system design document
Online property management system design documentAbhilasha Lahigude
 
Online Car Purchase
Online Car Purchase  Online Car Purchase
Online Car Purchase Vikesh Bawane
 
Tour and travel management system
Tour and travel management systemTour and travel management system
Tour and travel management systemRavindra Chaudhary
 
Real Estate Management System in Vb.Net
Real Estate Management System in Vb.NetReal Estate Management System in Vb.Net
Real Estate Management System in Vb.NetNafis Shaikh
 

Mais procurados (20)

Event managementsystem
Event managementsystemEvent managementsystem
Event managementsystem
 
Institute Mangement System PPT By Mukesh
Institute Mangement System PPT By MukeshInstitute Mangement System PPT By Mukesh
Institute Mangement System PPT By Mukesh
 
Event management system
Event management systemEvent management system
Event management system
 
Presentation on Railway Reservation System
Presentation on Railway Reservation SystemPresentation on Railway Reservation System
Presentation on Railway Reservation System
 
tour management system
tour management systemtour management system
tour management system
 
PPT-Presentation-Pharmacy-Management-System-Project.pptx
PPT-Presentation-Pharmacy-Management-System-Project.pptxPPT-Presentation-Pharmacy-Management-System-Project.pptx
PPT-Presentation-Pharmacy-Management-System-Project.pptx
 
Development of-pharmacy-management-system
Development of-pharmacy-management-systemDevelopment of-pharmacy-management-system
Development of-pharmacy-management-system
 
Online Advertisement Project Presentation
Online Advertisement Project PresentationOnline Advertisement Project Presentation
Online Advertisement Project Presentation
 
Real estate management system
Real estate management systemReal estate management system
Real estate management system
 
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptx
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptxFITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptx
FITNESS-GYM-MANAGEMENT-SYSTEM-Project-Presentation.pptx
 
Event management system
Event management systemEvent management system
Event management system
 
Job portal
Job portalJob portal
Job portal
 
Hotel Management System
Hotel Management System Hotel Management System
Hotel Management System
 
Digital Marketing plan for an Elearning startup.
Digital Marketing plan for an Elearning startup.Digital Marketing plan for an Elearning startup.
Digital Marketing plan for an Elearning startup.
 
Online property management system design document
Online property management system design documentOnline property management system design document
Online property management system design document
 
Use case of hospital managment system
Use case of hospital managment systemUse case of hospital managment system
Use case of hospital managment system
 
Online Car Purchase
Online Car Purchase  Online Car Purchase
Online Car Purchase
 
Dbms project.ppt
Dbms project.pptDbms project.ppt
Dbms project.ppt
 
Tour and travel management system
Tour and travel management systemTour and travel management system
Tour and travel management system
 
Real Estate Management System in Vb.Net
Real Estate Management System in Vb.NetReal Estate Management System in Vb.Net
Real Estate Management System in Vb.Net
 

Semelhante a Data cleaning on the rating and review from Google Play Store

Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating PredictionKartik Lunkad
 
Empowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningEmpowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningVipul Munot
 
PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviewsGary Giust
 
Ranked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningRanked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningSangjun Han
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdfcaa28steve
 
Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Optimizely
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Emily Potter
 
Supercharge Your Testing Program
Supercharge Your Testing ProgramSupercharge Your Testing Program
Supercharge Your Testing ProgramOptimizely
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127Kaizenlogcom
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#J On The Beach
 
Test Driven Database Development With Data Dude
Test Driven Database Development With Data DudeTest Driven Database Development With Data Dude
Test Driven Database Development With Data DudeCory Foy
 
IRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET Journal
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguageSebastian W. Cheah
 

Semelhante a Data cleaning on the rating and review from Google Play Store (20)

Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating Prediction
 
Empowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningEmpowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews Mining
 
PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviews
 
Ranked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningRanked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data Mining
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Ashwin resume
Ashwin resumeAshwin resume
Ashwin resume
 
Lean Six Sigma
Lean Six SigmaLean Six Sigma
Lean Six Sigma
 
Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools.
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Voice of the Market, Tom Anderson
Voice of the Market, Tom AndersonVoice of the Market, Tom Anderson
Voice of the Market, Tom Anderson
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
 
Supercharge Your Testing Program
Supercharge Your Testing ProgramSupercharge Your Testing Program
Supercharge Your Testing Program
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
Test Driven Database Development With Data Dude
Test Driven Database Development With Data DudeTest Driven Database Development With Data Dude
Test Driven Database Development With Data Dude
 
Business analyst
Business analystBusiness analyst
Business analyst
 
IRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion Mining
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with Language
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 

Mais de National Taiwan University (9)

Prediction on covid-19 recovery rate
Prediction on covid-19 recovery ratePrediction on covid-19 recovery rate
Prediction on covid-19 recovery rate
 
Consumer analytics - Strategy for Ctrip
Consumer analytics - Strategy for CtripConsumer analytics - Strategy for Ctrip
Consumer analytics - Strategy for Ctrip
 
(2017) Marketing Proposal - GIANT
(2017) Marketing Proposal - GIANT(2017) Marketing Proposal - GIANT
(2017) Marketing Proposal - GIANT
 
Case Study : IKEA
Case Study : IKEA Case Study : IKEA
Case Study : IKEA
 
企業政策_藍海策略_台積電
企業政策_藍海策略_台積電 企業政策_藍海策略_台積電
企業政策_藍海策略_台積電
 
幸福保險0929三之三版
幸福保險0929三之三版幸福保險0929三之三版
幸福保險0929三之三版
 
企業政策_第一組期末報告_六角國際
企業政策_第一組期末報告_六角國際企業政策_第一組期末報告_六角國際
企業政策_第一組期末報告_六角國際
 
管理學期末報告 第 七 組
管理學期末報告 第 七 組管理學期末報告 第 七 組
管理學期末報告 第 七 組
 
經濟期末報告 第六組
經濟期末報告 第六組經濟期末報告 第六組
經濟期末報告 第六組
 

Último

Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证a8om7o51
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 

Último (20)

Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 

Data cleaning on the rating and review from Google Play Store

  • 1. Data cleaning on applications’ rating and review from Google Play Store Towards Data Cleaning Xiaotong Hu Shang Li Yi Chun Liu Yu-Chen Su
  • 2. Contents Datasets and Use Cases 1 Data Cleaning Method and Process 2 Building Database 3 Text Data Cleaning 4 Workflow Model 5 Future Work 6
  • 3. Datasets and Use Cases 3 Datasets ● Web crawler data ● Google Play Store Rating ● Google Play Store Users Review Use Cases ● For product managers, they can review the applications through the the rating score and comments on Google Play Store ● Continuously optimize application products
  • 4. Data Cleaning Method and Process 4 Dataset Column Method Google Play Store Rating 1. Place “NaN” with “null” 2. Transform to number Reviews 1. Transform to number Size 1. Change “M” (Megabit) to Kilobit 2. Remove “k” 3. Replace “Varies with device” with “00000” 4. Transform to number Installs 1. Remove “+” and “,” 2. Transform to number Type 1. Place “NaN” with “null” 2. Create dummy variable column 3. Transform dummy variable column to number Price 1. Remove “$” 2. Transform to number Genres 1. Split into two columns by “;” Google Play Store User Reviews Sentiment Polarity 1. Place empty cells with “Null” 2. Transform to number Sentiment Subjectivity 1. Place empty cells with “Null” 2. Transform to number
  • 5. Data Cleaning - Special Cases ● Remove the data which are mismatched with columns. ● The translated_Review column will be cleaned with Python since OpenRefine is not efficient to remove punctuation # Example comment I like eat delicious food. That's I'm cooking food myself, case "10 Best Foods" helps lot, also "Best Before (Shelf Life)"
  • 6. After using Openrefine to clean up the data, we are able import data into MySQL database Import Data to MySQL 6 GooglePlayStore App TEXT Category TEXT Rating DOUBLE Reviews INT Size INT Installs INT Type Text Typedummy INT Price INT ContentRating Text Genres TEXT Genres1 TEXT Genres2 TEXT LastUpdated DATETIME CurrentVer Text AndroidVer Text Reviews App Text Translated_ Review Text Sentiment Text Sentiment_ Polarity DOUBLE Sentiment_ Subjectivity DOUBLE Schema & Datatype:
  • 7. Rules: ● if Sentiment_Polarity > 0 => Sentiment is POSITIVE ● if Sentiment_Polarity < 0 => Sentiment is NEGATIVE ● if Sentiment_Polarity = 0 => Sentiment is NEUTRAL ● if Sentiment_Polarity IS NULL => Sentiment IS NULL Integrity Constraints Violation Check 7 NO Violation found
  • 8. Join Two Tables into One SQL Syntax: Joint Table ● 70,471 Observation ● 20 Variables ● 17.7MB in CSV
  • 9. ● Figure out the keyword frequency based on each sentiment categories ● Python - Natural Language Toolkit (NLTK) Text Review Data Cleaning Step 1: Remove punctuation remove string punctuation, including !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ Example Comment [‘I like eat delicious food. That’s I’m cooking food myself, case “10 Best Foods” helps lot, also “Best Before (Shelf Life)”’] [‘I like eat delicious food Thats Im cooking food myself case 10 Best Foods helps lot also Best Before Shelf Life’]
  • 10. Text Review Data Cleaning Step 2: Tokenizer Splits a string into substrings using a regular expression ['i', 'like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'myself', 'case', '10', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life'] ['like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'case', '10', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'shelf', 'life'] Step 3: Remove stop words Remove words that do not contain important significance to be used in search queries
  • 11. Text Review Data Cleaning Step 4: Stemming & Lemmatization Stemming - Stemming is the reduction method to convert words into stems, such as treating "cats" as "cat" and "effective" as "effect" - The word may be unable to express complete semantics after stemming V.S. Lemmatization - Lemmatization is transformation method to transform the word into its original form, such as treating “drove” to “drive” and “driving” as “drive”
  • 12. Text Review Data Cleaning
  • 14. Yesworkflow Model (OpenRefine) – Google Play Store Rating
  • 15. Yesworkflow Model (OpenRefine) – Google Play Store Users Review
  • 16. Future Work 16 ● Everyone is responsible for different part with different tools ● Because of some constraints of each tools, it is difficult to cooperate with each other during the data cleaning process ● Study on how to improve the cooperation efficiency when everyone using different tools
  • 17. Thank you for listening Any Questions?