Web crawling scraping

Web crawling and database tables
• We want to crawl/scrap web
pages and get the proper
content to build standartize
database tables.
What can we use?

Google Search Tools
* Google uses structured data that it finds on the web to
understand the content of the page, as well as to gather
information about the web and the world in general.
* Structured data is a standardized format for providing
information about a page and classifying the page content;
for example, on a recipe page, what are the ingredients,
the cooking time and temperature, the calories, and so on.

Google search tools
• https://schema.org
Schema.org is a collaboration
between Google, Microsoft,
Yahoo! and Yandex - large search
engines who will use this marked-
up data from web pages.
* schema.org provide a normalize
about property, type and
descriptions of structured data
tags.
• The Google Structured Data
Testing Tool is an easy and
useful tool for validating your
structured data, and in some
cases, previewing a feature in
Google Search.
https://search.google.com/structu
red-data/testing-tool/

@type
@id
url
name
image
dateModified
totalTime
recipeYield
recipeIngredient
recipeInstructions
recipeCategory
keywords
recipeCuisine
cookTime
prepTime
"recipeIngredient": [
"1 (15 ounce) package double crust ready-to-use pie
crust",
"6 cups thinly sliced, peeled apples (6 medium)",
"3/4 cup sugar", "2 tablespoons all-purpose flour",
"3/4 teaspoon ground cinnamon",
"1/4 teaspoon salt",
"1/8 teaspoon ground nutmeg",
"1 tablespoon lemon juice"
]
There are structured data format and property
examples for recipe.

Inspect of source code with The Google Structured Data Testing Tool
from the point of structured data
• Search results of ‘yemek tarif’ on Google.
First page websites (03.03.2020; 14:00);
1. Yemek.com
2. Lezzet.com.tr
3. Refikaninmutfagi.com
4. Nefisyemektarifleri.com

Inspect of this web page’s source code
** Common issue of ‘yemek.com, nefisyemektarifleri.com, lezzet.com.tr’ is there is
no match on the main page but run the (javascript) code before.
On source code page (ctrl-f);
https://yemek.com/ // no match ‘recipeIngredient’
https://yemek.com/tarif/narenciyeli-hashasli-kek/ // match ‘recipeIngredient’

Website Useful Structured Data
Yemek.com
+
Lezzet.com.tr
+
Nefisyemektarifleri.com
+
Refikaninmutfagi.com
-
** yemek.com, nefisyemektarifleri.com, lezzet.com.tr have useful structured
data.
We crawl/scrape this sites with same settings and send a json, csv file or
database.
** refikaninmutfagi.com has not useful structured data. We set a specific
crawl format for this site.

yemek.com lezzet.com.tr nefisyemektarifleri.com refikaninmutfagi.com
@type @type @type @type
@id name @id @id
url image url url
name description mainEntityOfPage inLanguage
image recipeYield name
image recipeIngredient name datePublished
image recipeInstructions headline dateModified
dateModified prepTime description description
totalTime cookTime datePublished isPartOf
recipeYield author dateModified
recipeIngredient aggregateRating url
recipeInstructions keywords mainEntityOfPage
recipeCategory nutrition recipeYield
keywords recipeCategory prepTime
recipeCuisine recipeCuisine cookTime
cookTime video totalTime
prepTime recipeIngredient
description ingredients
author recipeInstructions
aggregateRating author
nutrition aggregateRating
keywords
nutrition
recipeCategory
recipeCuisine
video

• We extract (schema.org) microdata using scrapy.
https://blog.scrapinghub.com/2014/06/18/extracting-schema-org-
microdata-using-scrapy-selectors-and-xpath
* Alternative ways to scrape websites (Schema.org Microdata, JSON
Linked Data, internal JavaScript variables, and XHRs).
https://blog.apify.com/web-scraping-in-2018-forget-html-use-xhrs-
metadata-or-javascript-variables-8167f252439c
• End to end scrapy tutorial part I-IV (2019 sep).
https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-
tutorial-part-i-11e350bcdec0

Web crawling scraping

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Web crawling scraping

Semelhante a Web crawling scraping (20)

Último

Último (20)

Web crawling scraping