obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
Data cleaning on the rating and review from Google Play Store
1. Data cleaning on
applications’ rating
and review from
Google Play Store
Towards Data Cleaning
Xiaotong Hu
Shang Li
Yi Chun Liu
Yu-Chen Su
2. Contents
Datasets and Use Cases
1
Data Cleaning Method and Process
2
Building Database
3
Text Data Cleaning
4
Workflow Model
5
Future Work
6
3. Datasets and Use Cases
3
Datasets
● Web crawler data
● Google Play Store Rating
● Google Play Store Users
Review
Use Cases
● For product managers, they can review
the applications through the the rating
score and comments on Google Play
Store
● Continuously optimize application
products
4. Data Cleaning Method and Process
4
Dataset Column Method
Google Play
Store
Rating 1. Place “NaN” with “null” 2. Transform to number
Reviews 1. Transform to number
Size
1. Change “M” (Megabit) to Kilobit 2. Remove “k” 3. Replace “Varies with device” with “00000” 4. Transform to
number
Installs 1. Remove “+” and “,” 2. Transform to number
Type 1. Place “NaN” with “null” 2. Create dummy variable column 3. Transform dummy variable column to number
Price 1. Remove “$” 2. Transform to number
Genres 1. Split into two columns by “;”
Google Play
Store User
Reviews
Sentiment Polarity 1. Place empty cells with “Null” 2. Transform to number
Sentiment Subjectivity 1. Place empty cells with “Null” 2. Transform to number
5. Data Cleaning - Special Cases
● Remove the data which are mismatched with columns.
● The translated_Review column will be cleaned with Python since
OpenRefine is not efficient to remove punctuation
# Example comment
I like eat delicious food. That's I'm cooking
food myself, case "10 Best Foods" helps lot,
also "Best Before (Shelf Life)"
6. After using Openrefine to clean up the data, we are able import data into MySQL database
Import Data to MySQL
6
GooglePlayStore
App TEXT
Category TEXT
Rating DOUBLE
Reviews INT
Size INT
Installs INT
Type Text
Typedummy INT
Price INT
ContentRating Text
Genres TEXT
Genres1 TEXT
Genres2 TEXT
LastUpdated DATETIME
CurrentVer Text
AndroidVer Text
Reviews
App Text
Translated_ Review Text
Sentiment Text
Sentiment_ Polarity DOUBLE
Sentiment_ Subjectivity DOUBLE
Schema & Datatype:
7. Rules:
● if Sentiment_Polarity > 0 => Sentiment is POSITIVE
● if Sentiment_Polarity < 0 => Sentiment is NEGATIVE
● if Sentiment_Polarity = 0 => Sentiment is NEUTRAL
● if Sentiment_Polarity IS NULL => Sentiment IS NULL
Integrity Constraints Violation Check
7
NO Violation found
8. Join Two Tables into One
SQL Syntax: Joint Table
● 70,471 Observation
● 20 Variables
● 17.7MB in CSV
9. ● Figure out the keyword
frequency based on each
sentiment categories
● Python - Natural Language
Toolkit (NLTK)
Text Review Data Cleaning
Step 1: Remove punctuation
remove string punctuation, including
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
Example Comment
[‘I like eat delicious food. That’s I’m
cooking food myself, case “10 Best
Foods” helps lot, also “Best Before
(Shelf Life)”’]
[‘I like eat delicious food Thats Im
cooking food myself case 10 Best Foods
helps lot also Best Before Shelf Life’]
10. Text Review Data Cleaning
Step 2: Tokenizer
Splits a string into substrings using
a regular expression
['i', 'like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'myself', 'case', '10', 'best', 'foods',
'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']
['like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'case', '10', 'best', 'foods', 'helps',
'lot', 'also', 'best', 'shelf', 'life']
Step 3: Remove stop words
Remove words that do not contain
important significance to be used
in search queries
11. Text Review Data Cleaning
Step 4: Stemming & Lemmatization
Stemming
- Stemming is the reduction
method to convert words into
stems, such as treating "cats" as
"cat" and "effective" as "effect"
- The word may be unable to
express complete semantics
after stemming
V.S.
Lemmatization
- Lemmatization is
transformation method to
transform the word into its
original form, such as treating
“drove” to “drive” and “driving”
as “drive”
16. Future Work
16
● Everyone is responsible for different part with different tools
● Because of some constraints of each tools, it is difficult to cooperate with
each other during the data cleaning process
● Study on how to improve the cooperation efficiency when everyone using
different tools