1. Web crawling and database tables
• We want to crawl/scrap web
pages and get the proper
content to build standartize
database tables.
What can we use?
2. Google Search Tools
* Google uses structured data that it finds on the web to
understand the content of the page, as well as to gather
information about the web and the world in general.
* Structured data is a standardized format for providing
information about a page and classifying the page content;
for example, on a recipe page, what are the ingredients,
the cooking time and temperature, the calories, and so on.
3. Google search tools
• https://schema.org
Schema.org is a collaboration
between Google, Microsoft,
Yahoo! and Yandex - large search
engines who will use this marked-
up data from web pages.
* schema.org provide a normalize
about property, type and
descriptions of structured data
tags.
• The Google Structured Data
Testing Tool is an easy and
useful tool for validating your
structured data, and in some
cases, previewing a feature in
Google Search.
https://search.google.com/structu
red-data/testing-tool/
5. Inspect of source code with The Google Structured Data Testing Tool
from the point of structured data
• Search results of ‘yemek tarif’ on Google.
First page websites (03.03.2020; 14:00);
1. Yemek.com
2. Lezzet.com.tr
3. Refikaninmutfagi.com
4. Nefisyemektarifleri.com
6. Inspect of this web page’s source code
** Common issue of ‘yemek.com, nefisyemektarifleri.com, lezzet.com.tr’ is there is
no match on the main page but run the (javascript) code before.
On source code page (ctrl-f);
https://yemek.com/ // no match ‘recipeIngredient’
https://yemek.com/tarif/narenciyeli-hashasli-kek/ // match ‘recipeIngredient’
7. Website Useful Structured Data
Yemek.com
+
Lezzet.com.tr
+
Nefisyemektarifleri.com
+
Refikaninmutfagi.com
-
** yemek.com, nefisyemektarifleri.com, lezzet.com.tr have useful structured
data.
We crawl/scrape this sites with same settings and send a json, csv file or
database.
** refikaninmutfagi.com has not useful structured data. We set a specific
crawl format for this site.
9. • We extract (schema.org) microdata using scrapy.
https://blog.scrapinghub.com/2014/06/18/extracting-schema-org-
microdata-using-scrapy-selectors-and-xpath
* Alternative ways to scrape websites (Schema.org Microdata, JSON
Linked Data, internal JavaScript variables, and XHRs).
https://blog.apify.com/web-scraping-in-2018-forget-html-use-xhrs-
metadata-or-javascript-variables-8167f252439c
• End to end scrapy tutorial part I-IV (2019 sep).
https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-
tutorial-part-i-11e350bcdec0