4. Yelp dataset
Academic Dataset.
Information about local business.
Five Json files:
1. business
2. check-in
3. review
4. tip
5. user
https://www.yelp.com/dataset_challenge/dataset
6. User Reviews
Jason Files contains user reviews:
Example:
{"user_id": "Iu6AxdBYGR4A0wspR9BYHA", "review_id":
"KPvLNJ21_4wbYNctrOwWdQ", "stars": 5, "date": "2014-02-13", "text":
"Excellent food. Superb customer service. I miss the mario machines
they used to have, but it's still a great place steeped in tradition.", "type":
"review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}
8. “Given a user Review
text , can we predict
user Rating?”
9. Simpler Problem
Assuming That:
• bad rate (1-3)
• good rate (4-5)
we will predict whether the user like the
service (rate=4 or 5) or not(rate=1,2 or 3)
10. “Given a user Review
text , can we predict
whether user like
service or not?”
16. Classical Machine Learning
combination of several machine learning algorithms.
Logistic Regression.
Naive Bayes.
Stochastic gradient Descent.
Support Vector Machine.
combine results using majority votes.
https://github.com/amrqura/yelpRatePrediction
17. Classical Machine Learning
Features:
use words of the review text as features as the following:
1.Extract most common 3000 words in the training dataset.
In each statement examine if the frequent words exists or not.
Feature set : matrix of [2000*3000]
2000: number of statements.
3000: boolean values examine existence of frequent word.
https://github.com/amrqura/yelpRatePrediction
18. Classical Machine Learning
Training:
Train model in 80% statements and validate on 20%.
Run:
$ python3 TextClassification.py
https://github.com/amrqura/yelpRatePrediction
23. CNN Dataset
Each statement is represented by n*k matrix.
n= number of words
k= vector length.
we use word2vec to convert each word to vector.
https://github.com/amrqura/deepYelpRatePrediction
24. CNN Training
implementation using Tensorflow.
number of filters=128
filter size=3,4,5.
drop out=0.5
Apply max pooling.
output layer: two nodes with softmax function.
use cross-entropy error function.
use L2 regularization.
https://github.com/amrqura/deepYelpRatePrediction
25. CNN
Training:
5 cross validation.
Train model in 1600 statements and validate on 400.
Run:
$ python rating_prediction.py
Accuracy: 0.73549998
https://github.com/amrqura/deepYelpRatePrediction
29. Conclusion
2 implementations to predict the user rating from user
review.
• classical machine learning
• Deep learning , convolutional neural network.
public Docker image:
https://hub.docker.com/r/amrkoura/yelpchallenge/