The document describes the development of a restaurant recommendation application. It discusses ingesting data on restaurants and menus from HTML pages and the Metro API. Over 960 restaurants and 115,000 menu items across 10 cities were analyzed. Models were trained to cluster restaurants and recommend options based on user criteria. The analysis and recommendations were visualized in Tableau. While some predictions worked well, others were less accurate, and additional data and features could improve results.
2. Introduction
History:
Since 1979 there have been services that review
and rank restaurants (Zagat)
•
Today:
According to Nielson – Americans have on
average 41 apps on their smartphones, many of
which provide a recommendation service
3. Introduction
A variety of restaurant recommendation apps
have been created
Features include: find restaurants, make reservations,
and healthy options
–
A Restaurant Recommender would aim to help
users save money, time, and could help cure
buyers remorse
4. Problem Summary
We need a tool that resolves the challenge of
finding a restaurant in your area based upon
specific cuisine and menu item criteria
entered by the user
5. Hypothesis
Hypothesis: The Restaurant Recommender will recommend a
more accurate restaurant compared to selecting a restaurant
based on chance alone
Ho (null hypothesis): A user will find a restaurant that they like
based on chance alone
HA(alternative hypothesis): The restaurant recommender app
will provide a better restaurant suggestion to the user compared
chance alone
6. Data Ingestion
• WORM Storage
–Stored HTML menu pages in one location
which could be read many times
• Parsed HTML with BeautifulSoup
–Built out a list of “Restaurant” objects
• GET requests to WMATA API to pull metro
station data
–JSON data parsed with pandas read_json()
function
Ingestion Wrangling Analysis Modeling Visualization
7. Wrangling and Munging
• Majority of time spent wrangling the data and
building restaurants
–Removing duplicate and incomplete
records
–Standardizing inconsistent fields (e.g. price)
–Aggregating and grouping
–Data types
• Merged restaurant and WMATA data using
Euclidean distance
Ingestion Wrangling Analysis Modeling Visualization
8. Data Overview
Ingestion Wrangling Analysis Modeling Visualization
964 Total Restaurants
115,517 Total Menu Items
• Restaurant data includes:
–Name
–Location (address, latitude, longitude)
–Type of cuisine
–Menu (item, price, description)
• WMATA data includes:
–Station name
–Location (latitude, longitude)
–Metro Line
13. Feature Selection
• Four feature extraction pipelines using sklearn
–Chunking
–Cuisine Type
• TfidfVectorizer
–Extract keywords and assign significance score
– Tokenize and chunk parts of speech using nltk
• LabelBinarizer
–Convert cuisine types to binary features
• FeatureUnion
Ingestion Wrangling Analysis Modeling Visualization
14. Modeling and Prediction
• Transformation pipelines and transformed
feature vectors pickled
• Kmeans models fitted using training
restaurant data, then pickled
• User inputs entered via Flask are stored as
training instance
• Relevant pipeline and model loaded to
transform and predict
Ingestion Wrangling Analysis Modeling Visualization
16. Ingestion Wrangling Analysis Modeling Visualization
Reporting and Visualization
• Restaurant recommendations are determined
by similarity within a matched cluster
–“Similarity” is calculated by minimizing sklearn’s
pairwise euclidean distance function between the
test data and the training instances in the feature
space
• Predictions are exported into an interactive
Tableau visualization
–Allows the user flexibility in making a selection
through filtering and visual indicators
18. Results
• Some predictions are good, others not so
good
–Some clusters still contain a “hodge podge”
• Removing the “cuisine type” feature helped to
eliminate what we saw as overfit
• Different k values saw better results in some
cases, worse in others
• Additional features (price, ratings, metro)
would require more clusters and MORE DATA
19. Conclusions
• More data over a “better” model
• Might improve results using transformations
like Singular Value Decomposition (SVD) or
Latent Dirichlet Allocation (LDA)
– Better model analysis
• With more data, improve our tokenizer
– Incorporate stemming, improve chunking
• Incorporating user feedback into prediction
model (ex: Flask interface)
20. Additional Opportunities
• “Waiter-caller” function that would allow users to login, use
the restaurant map search function, click on a restaurant, and
be matched up with menu items based on keyword matches.
As opposed to reading through an entire menu to find
relevant items.
–Required more knowledge and implementation of
javascript, css, and jinja into the Flask environment.
• Sentiment analyzer was developed but not integrated. Would
allow users to go to restaurant and input a review. The review
would then be analyzed giving back a recommended score (1-
5) to the user.
–Similar requirements
21. Sources
• Downey, Allen B. Think Bayes. O’Reilly Media; 1st Edition. 2013. Paperback.
• Downey, Allen B. Think Python. O’Reilly Media; 1st Edition, 2012. Paperback.
• Dwyer, Gareth. Flask by Example. Packt Publishing, 2016. Paperback.
• Harris, Harlin, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An
Introspective Survey of Data Scientists and Their Work. O’Reilly Media; 1st Edition,
2013.
• Julian, David. Designing Machine Learning Systems with Python. Packt Publishing,
2016. Paperback.
• Kirk, Matthew. Thoughtful Machine Learning: A Test-Driven Approach. O’Reilly
Media; 1st Edition, 2014. Paperback.
• Kumar, Ashish. Learning Predictive Analytics with Python. Packt Publishing, 2016.
Paperback.
• McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy,
and IPython. O’Reilly Media; 1st Edition, 2012. Paperback.
• Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web.
O’Reilly Media; 1st Edition, 2015. Paperback.
• Raschka, Sebastian. Python Machine Learning. Packt Publishing, 2015. Paperback.
• Segaran, Toby. Programming Collective Intelligence: Building Smart Web 2.0
Applications. O’Reilly Media, 2007. Paperback.