SlideShare uma empresa Scribd logo
1 de 167
Baixar para ler offline
Advanced Web
Scraping
or
How To Make Internet
Your Database
by @estevecastells & @NachoMascort
I’m Esteve Castells
International SEO Specialist @
Softonic
You can find me on @estevecastells
https://estevecastells.com/
Newsletter: http://bit.ly/Seopatia
Hi!
Hi!
I’m Nacho Mascort
SEO Manager @ Grupo Planeta
You can find me on:
@NachoMascort
https://seohacks.es
You can see my scripts on:
https://github.com/NachoSEO
What are we gonna see?
1. What is Web Scraping?
2. Myths about Web Scraping
3. Main use cases
a. In our website
b. In external websites
4. Understanding the DOM
5. Extraction methods
6. Web Scraping Tools
7. Web Scraping with Python
8. Tips
9. Case studies
10. Bonus
by @estevecastells & @NachoMascort
1.
What is Web Scraping?
1.1 What is Web Scraping?
The scraping or web scraping, is a technique with
which through software, information or content is
extracted from a website.
There are simple 'scrapers' that parse the HTML of a
website, to browsers that render JS and perform
complex navigation and extraction tasks.
1.2 What are the use cases for
Web Scraping?
The uses of scraping are infinite, only limited by your
creativity and the legality of your actions.
The most basic uses can be to check changes in
your own or a competitor's website, even to create
dynamic websites based on multiple data sources.
2.
Myths about Web Scraping
3.
Main use cases
3.1 Main use cases in our websites
Checking the Value of Certain HTML Tags
➜ Are all elements as defined in our
documentation?
○ Deployment checks
➜ Are we sending conflicting signals?
○ HTTP Headers
○ Sitemaps vs goals
○ Duplicity of HTML tags
○ Incorrect label location
➜ Disappearance of HTML tags
3.2 Main use cases in external
websites
● Automate processes: what a human would do
and you can save money
○ Visual changes
● Are you adding new features?
○ Changes in HTML (goals, etc.)
● Are you adding new Schema tagging or
changing your indexing strategy?
○ Content changes
● Do you update/cure your content?
○ Monitor ranking changes in Google
4.
Understanding the DOM
DOCUMENT
OBJECT
MODEL
4.1 Document Object Model
What is it?
It is the structural representation of a document.
Defines the hierarchy
of each element
within each page.
Depending on its
position a tag can be:
● Child
● Parent
● Sibiling
4.1 Document Object Model
Components of a website?
Our browser makes a get request to the server
and it returns several files that the browser
renders.
These files are usually:
➜ HTML
➜ CSS
➜ JS
➜ Images
➜ ...
4.2 Código fuente vs DOM
They're two different things.
You can consult any HTML of a site by typing in
the browser bar:
view-source: https://www.domain.tld/path
*With CSS and JS it is not necessary because
the browser does not render them
** Ctrl / Cmd + u
What’s the source
code?
What’s the source
code?
>>> view-source:
4.2 Source code vs DOM
No JS has been executed in the source code.
Depending on the behavior of the JS you may
obtain "false" data.
4.2 Source code vs DOM
If the source code doesn't work, what do we do?
We can "see an approximation" to the DOM in
the "Elements" tab of the Chrome developer
tools (and any other browser).
4.2 Source code vs DOM
4.2 Source code vs DOM
Or pressing F12
Shortcuts are cooler!
What's on the DOM?
>>> F12
We can see JS
changes in real time
4.3 Google, what do you see?
Experiment from a little over a year ago:
The idea is to modify the Meta Robots tag (via JS) of
a URL to deindex the page and see if Google pays
attention to the value found in the source code or in
the DOM.
URL to experiment with:
https://seohacks.es/dashboard/
4.3 Google, what do you see?
The following code is added:
<script>
jQuery('meta[name="robots"]').remove();
var meta = document.createElement('meta');
meta.name = 'robots';
meta.content = 'noindex, follow';
jQuery('head').append(meta);
</script>
4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
3. It adds the attributes "name" with value
"robots" and "content" with value "noindex,
follow".
4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
3. It adds the attributes "name" with value
"robots" and "content" with value "noindex,
follow".
4. Adds to the head the meta variable that
contains the tag with the values that cause a
deindexation
4.3 Google, what do you see?
Transforms this:
In this:
4.3 Result
DEINDEXED
More data
https://www.searchviu.com/en/javascript-canonical-tags/
5.
Methods of extraction
5. Methods of extraction
We can extract the information from each document
using different models that are quite similar to each
other.
5. Methods of extraction
We can extract the information from each document
using different models that are quite similar to each
other.
These are the ones:
➜ Xpath
➜ CSS Selectors
➜ Others such as regex or specific tool selectors
5.1 Xpath
Use path expressions to define a node or nodes
within a document
We can get them:
➜ Writing them ourselves
➜ Through developer tools within a browser
5.1.1 Xpath Syntax
The writing standard is as follows:
//tag[@attribute=’value’]
5.1.1 Xpath Syntax
The writing standard is as follows:
//tag[@attribute=’value’]
For this tag:
<input id=”seoplus” type=”submit” value=”Log In”/>
5.1.1 Xpath Syntax
//tag[@attribute=’value’]
For this tag:
<input id=”seoplus” type=”submit” value=”Log In”/>
➜ Tag: input
5.1.1 Xpath Syntax
//tag[@attribute=’value’]
For this tag:
<input id=”seoplus” type=”submit” value=”Log In”/>
➜ Tag: input
➜ Attributes:
○ Id
○ Type
○ Value
5.1.1 Xpath Syntax
//tag[@attribute=’value’]
For this tag:
<input id=”seoplus” type=”submit” value=”Log In”/>
➜ Tag: input
➜ Attributes:
○ Id = seoplus
○ Type = submit
○ Value = Log In
5.1.1 Xpath Syntax
//input[@id=’seoplus’]
5.1.2 Dev Tools
5.1.2 Dev Tools
5.2 CSS Selectors
As its name suggests, these are the same selectors
we use to write CSS.
We can get them:
➜ Writing them ourselves with the same syntax
as modifying the styles of a site
➜ Through developer tools within a browser
*tip: to select a label we can use the xpath syntax and remove the @ from the
attribute
5.2.1 Dev Tools
5.3 Xpath vs CSS
Xpath CSS
Direct Child //div/a div > a
Child o Subchild //div//a div a
ID //div[@id=”example”] #example
Class //div[@clase=”example”] .example
Attributes //input[@name='username']
input[name='user
name']
https://saucelabs.com/resources/articles/selenium-tips-css-selectors
5.4 Others
We can access certain nodes of the DOM by other
methods such as:
➜ Regex
➜ Specific selectors of python libraries
➜ Adhoc tools
6.
Web Scraping Tools
Some of the tens of tools that exist for
Web Scraping
Plugins
Tools
Scraper
Jason The Miner
Here there are more than 30 if you didn’t like these ones.
https://www.octoparse.com/blog/top-30-free-web-scraping-software/
From basic tools or plugins that we can use to do
basic scrapings, in some cases to get data out faster
without having to pull out Python or JS to 'advanced'
tools.
➜ Scraper
➜ Screaming Frog
➜ Google Sheets
➜ Grepsr
6.1 Web Scraping Tools
Scraper is a Google Chrome plugin that you can use
to make small scrapings of elements in a minimally
well-structured HTML.
It is also useful to remove the XPath when
sometimes Google Chrome Dev Tools does not
remove it well to use it in other tools. As a plus, it
works like Google Chrome Dev Tools, on the DOM
6.1.1 Scraper
1. Doble click in
the element
we want to
pull
2. Click on
Scrape Similar
3. Done!
6.1.1 Scraper
If the elements
are well
structured, we
can get
everything pulled
extremely easily,
without the need
to use external
programs or
programming.
6.1.1 Scraper
6.1.1 Scraper
Here we have the Xpath
6.1.1 Scraper
List of elements that
we are going to take
out. Supports multiple
columns
6.1.1 Scraper
Easily expor to
Excel (copypaste)
6.1.1 Scraper
Or to GDocs
(one-click)
Screaming Frog is one of the SEO tools par
excellence, which can also be used for basic (and
even advanced) scraping.
As a crawler you can use Text only (pure HTML) or
JS rendering, if your website uses client-side
rendering.
Its extraction mode is simple but with it you can get
much of what you need to do, for the other you can
use Python or other tools.
6.1.2 Screaming Frog
6.1.2 Screaming Frog
Configuration > Custom > Extraction
6.1.2 Screaming Frog
We have various modes
- CSS path (CSS selector)
- XPath (the main we will
use)
- Regex
6.1.2 Screaming Frog
We have up to 10 selectors,
which will generally be
sufficient. Otherwise, we will
have to use Excel with the
SEARCHV function to join
two or more scrapings.
6.1.2 Screaming Frog
We will then have to decide
whether we want to extract
the content into HTML, text
only or the entire HTML
element
6.1.2 Screaming Frog
Once we have
all the
extractors set,
we just have to
run it, either in
crawler mode or
ready mode
with a sitemap.
6.1.2 Screaming Frog
Once we have everything configured perfectly (sometimes
we will have to test the correct XPath several times), we
can leave it crawling and export the data obtained.
6.1.2 Screaming Frog
Some of the most common uses are, both on original
websites and competitors.
➜ Monitor changes/lost data in a deploy
➜ Monitor weekly changes in web content
➜ Check quantity increase or decrease or
content/thin content ratios
The limit of scraping with Screaming Frog.
You can do 99% of the things you want to do and with
JS-rendering made easy!
6.1.2 Screaming Frog
Cutre tip: A 'cutre' use case for removing all
URLs from a sitemap index is to import the entire
list and then clean it up with Excel. In case you
don't (yet) know how to use Python.
1. Go to Download Sitemap
index
2. Put the URL of the sitemap
index
6.1.2 Screaming Frog
3. Wait for all the sitemaps to
download (can take mins)
4. Select all, copy paste to Excel
6.1.2 Screaming Frog
Then we replace "Found " and we'll have all the
clean URLs of a sitemap index.
In this way we can then clean and pull results by
URL patterns those that are interesting to us.
Ex: a category, a page type, containing X word in
the URL, etc.
That way we can segment even more our
scraping from either our website or a
competitor's website.
6.1.3 Cloud version: FandangoSEO
If you need to run intensive crawls of millions of pages
with pagetype segmentation, with FandangoSEO you
can set interesting XPaths with content extraction, count
and exists.
6.1.4 Google Sheets
With Google Sheets we can also import most elements
of a web page, from HTML to JSON with a small
external script.
➜ Pro's:
○ It imports HTML, CSV, TSV, XML, JSON and
RSS.
○ Hosted in the cloud
○ Free and for the whole family
○ Easy to use with familiar functions
➜ Con’s:
○ It gets caught easily and usually takes
thousands of rows to process
6.1.4 Google Sheets
➜ Easily import feeds to create your own
Feedly or news aggregator
6.1.5 Grepsr
Grepsr is a tool that is based on an extension that
facilitates visual extraction, and also offers data export in
CSV or API (json) format.
First of all we will install the extension in Chrome and run
it, loading the desired page to scrape.
6.1.5 Grepsr
Then, click on 'Select' and select the exact element you
want, by hovering with the mouse you can refine it.
6.1.5 Grepsr
Once selected, we
will have marked the
element and if it is
well structured HTML,
it will be very easy
without having to pull
XPath or CSS
selectors.
6.1.5 Grepsr
Once selected all our fields, we will proceed to save them
by clicking on “Next”, we can name each field and extract
it in text form or extract the CSS class itself.
6.1.5 Grepsr
Finally, we can add pagination for each of our fields, if
required, either in HTML with a next link, or if you have
load more or infinite scroll (ajax).
6.1.5 Grepsr
6.1.5 Grepsr
To select the pagination, we will follow the same process
as with the elements to scrape.
(Optional part, not everything requires pagination)
6.1.5 Grepsr
Finally, we can also configure a login if necessary, as well
as additional fields that are close to the extracted field
(images, goals, etc.).
6.1.5 Grepsr
Finally, we will have the data in both JSON and CSV
formats. However, we will need a (free) Grepsr account
to export them!
7.
Web Scraping with Python
7 Why Python?
➜ It's a very simple language to understand
➜ Easy approach for those starting with
programming
➜ Much growth and great community behind it
➜ Core uses for massive data analysis and with
very powerful libraries behind it (not just
scraping)
➜ We can work on the browser!
○ https://colab.research.google.com
7.1 Type of data
To start scraping we must know at least these
concepts to program in python:
➜ Variables
➜ Lists
➜ Integers, Floats, Strings, Boolean Values....
➜ For
➜ Conditional
➜ Imports
7.2 Scrapping Libraries
There are several but I will focus on two:
➜ Requests + BeautifulSoup: To scrape data from
the source code of a site. Useful for sites with
static data.
➜ Selenium: Tool to automate QA that can help us
scrape sites with dynamic content whose
values are in the DOM but not in the source
code.
Colab does not support selenium, we will have to
work with Jupyter (or any IDE)
With 5 lines of code (or less)
you can see the parsed HTML
Accessing any element of
parsed HTML is easy
We can create a data frame
and process the information
as desired
Or download it
7.3 Process
We analyze the HTML,
looking for patterns
We generate the script
for an element or URL
We extend it to affect
all data
8.
Tips
There are many websites that serve their pages on a
User-agent basis. Sometimes you will be interested in
being a desktop device, sometimes a mobile device.
Sometimes a Windows, sometimes a Mac.
Sometimes a Googlebot, sometimes a bingbot.
Adapt each scraping to what you need to get the
desired results!
8.1 User-agent
To scrape a website like Google with advanced
security mechanisms, it will be necessary to use
proxies, among other measures.
Proxies act as an intermediary between a request
made by an X computer and a Z server. In this way,
we leave little trace when it comes to being
identified.
Depending on the website and number of requests
we recommend using one quantity or another.
Generally, more than one request per second from
the same IP address is not recommended.
8.2 Proxies
Generally the use of proxies is more recommended
than a VPN, since the VPN does the same thing but
under a single IP.
It is always advisable to use a VPN with another
geo for any kind of tracking on third party websites,
to avoid possible problems or identifications. Also, if
you are caught by IP (e.g. Cloudflare) you will never
be able to access the web again from that IP (if it is
static).
Recommended service: ExpressVPN
8.3 VPN’s
8.4 Concurrency
Concurrency consists of limiting the number of
requests a network can make per second. We are
interested in limiting the requests we always make,
in order to avoid saturating the server, be it ours or a
competitor's.
If we saturate the server, we will have to make the
requests again or, depending on the case, start the
whole crawling process again.
Indicative numbers:
➜ Small websites: 5 req/sec - 5 threads
➜ Large websites: 20 req/sec - 20 threads
8.5 Data cleaning
It is common that after a data scraping, we find
data that does not fit what we need. Normally, we'll
have to work on the data to clean it up.
Some of the most common corrections:
➜ Duplicates
➜ Format correction/unification
➜ Spaces
➜ Strange characters
➜ Currencies
9.
Case studies
9. Casos prácticos
Here are 2 case studies:
➜ Using scraping to automate the curation of
content listings
➜ Scraping to generate a product feed for our
websites
Using scraping to automate
the curation of content
listings
9.1 Using scraping to automate the
curation of content listings
It can be firmly said that the best search engine at
the moment is Google.
What if we use Google's results to generate our own
listings, based on the ranking (relevancy) that it
gives to websites that position for what we want to
position?
9.1.1 Jason The Miner
To do so, we will use Jason The Miner, a scraping
library made by Marc Mignonsin, Principal Software
Engineer at Softonic (@mawrkus) at Github and
(@crossrecursion) at Twitter
9.1.1 Jason The Miner
Jason The Miner is a versatile and modular Node.js
based library that can be adapted to any website
and need.
9.1.1 Jason The Miner
aa
9.1.2 Concept
We launched a query as 'best
washing machines'.
We will enter the top 20-30
results, analyze HTML and
extract the link ID from the
Amazon links.
Then we will do a count and
we will be automatically
validating based on dozens of
websites which is the best
washing machine.
9.1.2 Concept
Then, we will have a list of IDs
with their URL, which we can
scrape directly from Google
Play or using their API, and
semi-automatically fill our CMS
(WordPress, or whatever we
have).
This allows us to automate
content research/curing and
focus on delivering real value in
what we write.
Screenshot is an outcome
based on Google Play Store
9.1.3 Action
First of all we will generate the basis to create the
URL, with our user-agent, as well as the language
we are interested in.
9.1.3 Action
Then we are going to generate a maximum
concurrence so that Google does not ban our IP or
skip captchas.
9.1.3 Action
Finally, let's define exactly the flow of the crawler. If
you need to enter links/websites, and what you need
to extract from them.
9.1.3 Action
Finally, we will transform the output into a.json file
that we can use to upload to our CMS.
9.1.3 Action
And we can even configure it to be automatically
uploaded to the CMS once the processes are
finished.
9.1.3 Acción
What does Jason the Miner do?
➜ Load (HTTP, file, json, ....)
➜ Parse (HTML w/ CSS by default)
➜ Transform
But this is ok, but we need to do it in
bulk for tens or hundreds of cases,
we cannot do it one by one.
9.1.3 Acción
Added functionality to make it work
in bulk
➜ Bulk (imported from a CSV)
➜ Load (HTTP, file, json, ....)
➜ Parse (HTML w/ CSS by default)
➜ Transform
Creating a variable that would be the
query we inserted in Google.
9.1.4 CMS
Once we have all the data inserted in our CMS, we
will have to execute another basic scraping
process or with an API such as Amazon to get all
the data of each product (logo, name, images,
description, etc).
Once we have everything, the lists will be sorted
and we can add the editorial content we want,
with very little manual work to do.
9.1.5 Ideas
Examples in which it could be applied:
➜ Amazon Products
➜ Listings of restaurants that are on TripAdvisor
➜ Hotel Listings
➜ Netflix Movie Listings
➜ Best PS4 Games
➜ Better android apps
➜ Best Chromecast apps
➜ Best books
Scraping to generate a
product feed for our websites
9.2 Starting point
Website affiliated to Casa del Libro.
We need to generate a product feed for each of our
product pages.
9.2 Process
We analyze the HTML,
looking for patterns
We generate the script
for an element or URL
We extend it to affect
all data
9.2.0 What do we want to scrape off?
We need the following information:
➜ Titles
➜ Author
➜ Editorial
➜ Prices
*Only from the category of crime novel
9.2.0 What do we want to scrape off?
Title
9.2.0 What do we want to scrape off?
9.2.0 What do we want to scrape off?
Author
9.2.0 What do we want to scrape off?
9.2.0 What do we want to scrape off?
1. Necesitamos los siguientes datos:
a. Titulos
b. Autor
c. Editorial
d. Precios
Editorial
9.2.0 What do we want to scrape off?
9.2.0 What do we want to scrape off?
1. Necesitamos los siguientes datos:
a. Titulos
b. Autor
c. Editorial
d. Precios Price
9.2.0 What do we want to scrape off?
9.2.0 What do we want to scrape off?
9.2.0 What do we want to scrape off?
9.2.0 Pagination
For each page we will have to iterate the same code
over and over again.
You need to find out how paginated URLs are
formed in order to access them:
>>>https://www.casadellibro.com/libros/novela-negra/126000000/p + page
9.2.1 We generated the script to
extract the first book
9.2.1 We Iterate in every container
9.2.1 Time to finish
Now that we have the script to scrape all the books
on the first page we will generate the final script to
affect all the pages.
9.2.2 Let's do the script
We import all the libraries we are going to use
9.2.2 Let's do the script
We create the empty lists in which to host each of
the data.
9.2.2 Let's do the script
We will have a list containing the numbers 1 to
120 for the pages
9.2.2 Let's do the script
We create variables to prevent the server from
banning us due to excessive requests
9.2.2 Let's do the script
9.2.2 Let's do the script
9.2.2 Let's do the script
With Pandas we transform the lists into a
DataFrame that we can work with
9.2.2 Let's do the script
With Pandas we can also transform it into a csv or
an excel
9.2.2 Let's do the script
And finally, we can download the file thanks to
the colab library.
9.2.2 Let's do the script
10.
Bonus
10.1 Sheet2Site
10.1 Sheet2Site
10.1 Sheet2Site
https://coinmarketcap.com/es/api/
You can use Google Sheets to import data from APIs easily.
API's such as Dandelion APIs, which are used for semantic
analysis of texts, can be very useful for the day to day running of
our SEO.
➜ Entity Extraction
➜ Semantic similarity
➜ Keywords extraction
➜ Sentimental analysis
10.2 Dandelion API
Stack Andreas Niessen
Stack Advanced projects
10.3 Stacks for scraping + WP
+ +
+ ++
➜ With this little script you can easily export an entire SERP
into a CSV.
○ bit.ly/2uZCXuL
10.4 Scraping Google SERP
10.5 Scraping Google
10.6 Web Scraping + NLP
10.7 Scraping Breadcrumbs
10.8 Scraping Sitemaps
https://github.com/NachoSEO/extract_urls_from_sitemap_index
10.9 Translating content
Python & Other
➜ Chapter 11 – Web Scraping
https://automatetheboringstuff.com/chapter11/
➜ https://twitter.com/i/moments/949019183181856769
➜ Scraping ‘People Also Ask’ boxes for SEO and content
research
https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-a
nd-content-research/
➜ https://stackoverflow.com/questions/3964681/find-all-files-in-a-di
rectory-with-extension-txt-in-python
➜ 6 Actionable Web Scraping Hacks for White Hat Marketers
https://ahrefs.com/blog/web-scraping-for-marketers/
➜ https://saucelabs.com/resources/articles/selenium-tips-css-selec
tors
EXTRA RESOURCES
EXTRA RESOURCES
Node.js (Thanks @mawrkus)
➜ Web Scraping With Node.js:
https://www.smashingmagazine.com/2015/04/web-scraping-with-node
js/
➜ X-ray, The next web scraper. See through the noise:
https://github.com/lapwinglabs/x-ray
➜ Simple, lightweight & expressive web scraping with Node.js:
https://github.com/eeshi/node-scrapy
➜ Node.js Scraping Libraries:
http://blog.webkid.io/nodejs-scraping-libraries/
➜ https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-ille
gal/
➜ http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-it
s-most-useful-tools/
➜ Web scraping o rastreo de webs y legalidad:
https://www.youtube.com/watch?v=EJzugD0l0Bw
CREDITS
➜ Presentation template by SlidesCarnival
➜ Photographs by Death to the Stock Photo
(license)
➜ Marc Mignonsin for creating Jason The Miner
Thanks!
Any question?
Esteve Castells | @estevecastells
Newsletter: bit.ly/Seopatia
https://estevecastells.com/
Nacho Mascort | @NachoMascort
Scripts: https://github.com/NachoSEO
https://seohacks.es

Mais conteúdo relacionado

Mais procurados

Wrangling Large Scale Frontend Web Applications
Wrangling Large Scale Frontend Web ApplicationsWrangling Large Scale Frontend Web Applications
Wrangling Large Scale Frontend Web Applications
Ryan Roemer
 

Mais procurados (20)

Accelerated Mobile - Beyond AMP
Accelerated Mobile - Beyond AMPAccelerated Mobile - Beyond AMP
Accelerated Mobile - Beyond AMP
 
Rapid and Responsive - UX to Prototype with Bootstrap
Rapid and Responsive - UX to Prototype with BootstrapRapid and Responsive - UX to Prototype with Bootstrap
Rapid and Responsive - UX to Prototype with Bootstrap
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014
 
On-Page SEO EXTREME - SEOZone Istanbul 2013
On-Page SEO EXTREME - SEOZone Istanbul 2013On-Page SEO EXTREME - SEOZone Istanbul 2013
On-Page SEO EXTREME - SEOZone Istanbul 2013
 
Real World Web Standards
Real World Web StandardsReal World Web Standards
Real World Web Standards
 
SEO Tools of the Trade - Barcelona Affiliate Conference 2014
SEO Tools of the Trade - Barcelona Affiliate Conference 2014SEO Tools of the Trade - Barcelona Affiliate Conference 2014
SEO Tools of the Trade - Barcelona Affiliate Conference 2014
 
Moving from Web 1.0 to Web 2.0
Moving from Web 1.0 to Web 2.0Moving from Web 1.0 to Web 2.0
Moving from Web 1.0 to Web 2.0
 
Mitos del SEO
Mitos del SEOMitos del SEO
Mitos del SEO
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
Hardening WordPress - Friends of Search 2014 (WordPress Security)
Hardening WordPress - Friends of Search 2014 (WordPress Security)Hardening WordPress - Friends of Search 2014 (WordPress Security)
Hardening WordPress - Friends of Search 2014 (WordPress Security)
 
SEO for developers in e-commerce business
SEO for developers in e-commerce businessSEO for developers in e-commerce business
SEO for developers in e-commerce business
 
Seozone - 5 tips
Seozone  - 5 tips Seozone  - 5 tips
Seozone - 5 tips
 
HTML5 - techMaine Presentation 5/18/09
HTML5 - techMaine Presentation 5/18/09HTML5 - techMaine Presentation 5/18/09
HTML5 - techMaine Presentation 5/18/09
 
Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)
Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)
Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)
 
Introducing YUI
Introducing YUIIntroducing YUI
Introducing YUI
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010
 
Wrangling Large Scale Frontend Web Applications
Wrangling Large Scale Frontend Web ApplicationsWrangling Large Scale Frontend Web Applications
Wrangling Large Scale Frontend Web Applications
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
 
Interlinking structure for big websites
 Interlinking structure for big websites Interlinking structure for big websites
Interlinking structure for big websites
 

Semelhante a Advanced Web Scraping or How To Make Internet Your Database #seoplus2018

SharePoint Cincy 2012 - jQuery essentials
SharePoint Cincy 2012 - jQuery essentialsSharePoint Cincy 2012 - jQuery essentials
SharePoint Cincy 2012 - jQuery essentials
Mark Rackley
 

Semelhante a Advanced Web Scraping or How To Make Internet Your Database #seoplus2018 (20)

Developing web applications in 2010
Developing web applications in 2010Developing web applications in 2010
Developing web applications in 2010
 
SharePoint Cincy 2012 - jQuery essentials
SharePoint Cincy 2012 - jQuery essentialsSharePoint Cincy 2012 - jQuery essentials
SharePoint Cincy 2012 - jQuery essentials
 
Site Manager rocks!
Site Manager rocks!Site Manager rocks!
Site Manager rocks!
 
Company Visitor Management System Report.docx
Company Visitor Management System Report.docxCompany Visitor Management System Report.docx
Company Visitor Management System Report.docx
 
Shifting Gears
Shifting GearsShifting Gears
Shifting Gears
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
Web technologies part-2
Web technologies part-2Web technologies part-2
Web technologies part-2
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application development
 
WordCamp Greenville 2018 - Beware the Dark Side, or an Intro to Development
WordCamp Greenville 2018 - Beware the Dark Side, or an Intro to DevelopmentWordCamp Greenville 2018 - Beware the Dark Side, or an Intro to Development
WordCamp Greenville 2018 - Beware the Dark Side, or an Intro to Development
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers work
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS MeetupReact JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
 
A Beginner's Guide to Client Side Development with Javascript
A Beginner's Guide to Client Side Development with JavascriptA Beginner's Guide to Client Side Development with Javascript
A Beginner's Guide to Client Side Development with Javascript
 
What is HTML 5?
What is HTML 5?What is HTML 5?
What is HTML 5?
 
Fundamentals of HTML5
Fundamentals of HTML5Fundamentals of HTML5
Fundamentals of HTML5
 
Introduce Django
Introduce DjangoIntroduce Django
Introduce Django
 
Welcome to IE8 - Integrating Your Site With Internet Explorer 8
Welcome to IE8 - Integrating Your Site With Internet Explorer 8Welcome to IE8 - Integrating Your Site With Internet Explorer 8
Welcome to IE8 - Integrating Your Site With Internet Explorer 8
 
dJango
dJangodJango
dJango
 
Light introduction to HTML
Light introduction to HTMLLight introduction to HTML
Light introduction to HTML
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018

  • 1. Advanced Web Scraping or How To Make Internet Your Database by @estevecastells & @NachoMascort
  • 2. I’m Esteve Castells International SEO Specialist @ Softonic You can find me on @estevecastells https://estevecastells.com/ Newsletter: http://bit.ly/Seopatia Hi!
  • 3. Hi! I’m Nacho Mascort SEO Manager @ Grupo Planeta You can find me on: @NachoMascort https://seohacks.es You can see my scripts on: https://github.com/NachoSEO
  • 4. What are we gonna see? 1. What is Web Scraping? 2. Myths about Web Scraping 3. Main use cases a. In our website b. In external websites 4. Understanding the DOM 5. Extraction methods 6. Web Scraping Tools 7. Web Scraping with Python 8. Tips 9. Case studies 10. Bonus by @estevecastells & @NachoMascort
  • 5. 1. What is Web Scraping?
  • 6. 1.1 What is Web Scraping? The scraping or web scraping, is a technique with which through software, information or content is extracted from a website. There are simple 'scrapers' that parse the HTML of a website, to browsers that render JS and perform complex navigation and extraction tasks.
  • 7. 1.2 What are the use cases for Web Scraping? The uses of scraping are infinite, only limited by your creativity and the legality of your actions. The most basic uses can be to check changes in your own or a competitor's website, even to create dynamic websites based on multiple data sources.
  • 9.
  • 10.
  • 11.
  • 13. 3.1 Main use cases in our websites Checking the Value of Certain HTML Tags ➜ Are all elements as defined in our documentation? ○ Deployment checks ➜ Are we sending conflicting signals? ○ HTTP Headers ○ Sitemaps vs goals ○ Duplicity of HTML tags ○ Incorrect label location ➜ Disappearance of HTML tags
  • 14. 3.2 Main use cases in external websites ● Automate processes: what a human would do and you can save money ○ Visual changes ● Are you adding new features? ○ Changes in HTML (goals, etc.) ● Are you adding new Schema tagging or changing your indexing strategy? ○ Content changes ● Do you update/cure your content? ○ Monitor ranking changes in Google
  • 17. 4.1 Document Object Model What is it? It is the structural representation of a document.
  • 18. Defines the hierarchy of each element within each page.
  • 19. Depending on its position a tag can be: ● Child ● Parent ● Sibiling
  • 20. 4.1 Document Object Model Components of a website? Our browser makes a get request to the server and it returns several files that the browser renders. These files are usually: ➜ HTML ➜ CSS ➜ JS ➜ Images ➜ ...
  • 21. 4.2 Código fuente vs DOM They're two different things. You can consult any HTML of a site by typing in the browser bar: view-source: https://www.domain.tld/path *With CSS and JS it is not necessary because the browser does not render them ** Ctrl / Cmd + u
  • 24. 4.2 Source code vs DOM No JS has been executed in the source code. Depending on the behavior of the JS you may obtain "false" data.
  • 25. 4.2 Source code vs DOM If the source code doesn't work, what do we do? We can "see an approximation" to the DOM in the "Elements" tab of the Chrome developer tools (and any other browser).
  • 26. 4.2 Source code vs DOM
  • 27. 4.2 Source code vs DOM Or pressing F12 Shortcuts are cooler!
  • 28. What's on the DOM? >>> F12
  • 29. We can see JS changes in real time
  • 30. 4.3 Google, what do you see? Experiment from a little over a year ago: The idea is to modify the Meta Robots tag (via JS) of a URL to deindex the page and see if Google pays attention to the value found in the source code or in the DOM. URL to experiment with: https://seohacks.es/dashboard/
  • 31. 4.3 Google, what do you see? The following code is added: <script> jQuery('meta[name="robots"]').remove(); var meta = document.createElement('meta'); meta.name = 'robots'; meta.content = 'noindex, follow'; jQuery('head').append(meta); </script>
  • 32. 4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots
  • 33. 4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy)
  • 34. 4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy) 3. It adds the attributes "name" with value "robots" and "content" with value "noindex, follow".
  • 35. 4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy) 3. It adds the attributes "name" with value "robots" and "content" with value "noindex, follow". 4. Adds to the head the meta variable that contains the tag with the values that cause a deindexation
  • 36. 4.3 Google, what do you see? Transforms this: In this:
  • 40. 5. Methods of extraction We can extract the information from each document using different models that are quite similar to each other.
  • 41. 5. Methods of extraction We can extract the information from each document using different models that are quite similar to each other. These are the ones: ➜ Xpath ➜ CSS Selectors ➜ Others such as regex or specific tool selectors
  • 42. 5.1 Xpath Use path expressions to define a node or nodes within a document We can get them: ➜ Writing them ourselves ➜ Through developer tools within a browser
  • 43. 5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’]
  • 44. 5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit” value=”Log In”/>
  • 45. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit” value=”Log In”/> ➜ Tag: input
  • 46. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit” value=”Log In”/> ➜ Tag: input ➜ Attributes: ○ Id ○ Type ○ Value
  • 47. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit” value=”Log In”/> ➜ Tag: input ➜ Attributes: ○ Id = seoplus ○ Type = submit ○ Value = Log In
  • 51. 5.2 CSS Selectors As its name suggests, these are the same selectors we use to write CSS. We can get them: ➜ Writing them ourselves with the same syntax as modifying the styles of a site ➜ Through developer tools within a browser *tip: to select a label we can use the xpath syntax and remove the @ from the attribute
  • 53. 5.3 Xpath vs CSS Xpath CSS Direct Child //div/a div > a Child o Subchild //div//a div a ID //div[@id=”example”] #example Class //div[@clase=”example”] .example Attributes //input[@name='username'] input[name='user name'] https://saucelabs.com/resources/articles/selenium-tips-css-selectors
  • 54. 5.4 Others We can access certain nodes of the DOM by other methods such as: ➜ Regex ➜ Specific selectors of python libraries ➜ Adhoc tools
  • 56. Some of the tens of tools that exist for Web Scraping Plugins Tools Scraper Jason The Miner Here there are more than 30 if you didn’t like these ones. https://www.octoparse.com/blog/top-30-free-web-scraping-software/
  • 57. From basic tools or plugins that we can use to do basic scrapings, in some cases to get data out faster without having to pull out Python or JS to 'advanced' tools. ➜ Scraper ➜ Screaming Frog ➜ Google Sheets ➜ Grepsr 6.1 Web Scraping Tools
  • 58. Scraper is a Google Chrome plugin that you can use to make small scrapings of elements in a minimally well-structured HTML. It is also useful to remove the XPath when sometimes Google Chrome Dev Tools does not remove it well to use it in other tools. As a plus, it works like Google Chrome Dev Tools, on the DOM 6.1.1 Scraper
  • 59. 1. Doble click in the element we want to pull 2. Click on Scrape Similar 3. Done! 6.1.1 Scraper
  • 60. If the elements are well structured, we can get everything pulled extremely easily, without the need to use external programs or programming. 6.1.1 Scraper
  • 61. 6.1.1 Scraper Here we have the Xpath
  • 62. 6.1.1 Scraper List of elements that we are going to take out. Supports multiple columns
  • 63. 6.1.1 Scraper Easily expor to Excel (copypaste)
  • 64. 6.1.1 Scraper Or to GDocs (one-click)
  • 65. Screaming Frog is one of the SEO tools par excellence, which can also be used for basic (and even advanced) scraping. As a crawler you can use Text only (pure HTML) or JS rendering, if your website uses client-side rendering. Its extraction mode is simple but with it you can get much of what you need to do, for the other you can use Python or other tools. 6.1.2 Screaming Frog
  • 66. 6.1.2 Screaming Frog Configuration > Custom > Extraction
  • 67. 6.1.2 Screaming Frog We have various modes - CSS path (CSS selector) - XPath (the main we will use) - Regex
  • 68. 6.1.2 Screaming Frog We have up to 10 selectors, which will generally be sufficient. Otherwise, we will have to use Excel with the SEARCHV function to join two or more scrapings.
  • 69. 6.1.2 Screaming Frog We will then have to decide whether we want to extract the content into HTML, text only or the entire HTML element
  • 70. 6.1.2 Screaming Frog Once we have all the extractors set, we just have to run it, either in crawler mode or ready mode with a sitemap.
  • 71. 6.1.2 Screaming Frog Once we have everything configured perfectly (sometimes we will have to test the correct XPath several times), we can leave it crawling and export the data obtained.
  • 72. 6.1.2 Screaming Frog Some of the most common uses are, both on original websites and competitors. ➜ Monitor changes/lost data in a deploy ➜ Monitor weekly changes in web content ➜ Check quantity increase or decrease or content/thin content ratios The limit of scraping with Screaming Frog. You can do 99% of the things you want to do and with JS-rendering made easy!
  • 73. 6.1.2 Screaming Frog Cutre tip: A 'cutre' use case for removing all URLs from a sitemap index is to import the entire list and then clean it up with Excel. In case you don't (yet) know how to use Python. 1. Go to Download Sitemap index 2. Put the URL of the sitemap index
  • 74. 6.1.2 Screaming Frog 3. Wait for all the sitemaps to download (can take mins) 4. Select all, copy paste to Excel
  • 75. 6.1.2 Screaming Frog Then we replace "Found " and we'll have all the clean URLs of a sitemap index. In this way we can then clean and pull results by URL patterns those that are interesting to us. Ex: a category, a page type, containing X word in the URL, etc. That way we can segment even more our scraping from either our website or a competitor's website.
  • 76. 6.1.3 Cloud version: FandangoSEO If you need to run intensive crawls of millions of pages with pagetype segmentation, with FandangoSEO you can set interesting XPaths with content extraction, count and exists.
  • 77. 6.1.4 Google Sheets With Google Sheets we can also import most elements of a web page, from HTML to JSON with a small external script. ➜ Pro's: ○ It imports HTML, CSV, TSV, XML, JSON and RSS. ○ Hosted in the cloud ○ Free and for the whole family ○ Easy to use with familiar functions ➜ Con’s: ○ It gets caught easily and usually takes thousands of rows to process
  • 78. 6.1.4 Google Sheets ➜ Easily import feeds to create your own Feedly or news aggregator
  • 79. 6.1.5 Grepsr Grepsr is a tool that is based on an extension that facilitates visual extraction, and also offers data export in CSV or API (json) format.
  • 80. First of all we will install the extension in Chrome and run it, loading the desired page to scrape. 6.1.5 Grepsr
  • 81. Then, click on 'Select' and select the exact element you want, by hovering with the mouse you can refine it. 6.1.5 Grepsr
  • 82. Once selected, we will have marked the element and if it is well structured HTML, it will be very easy without having to pull XPath or CSS selectors. 6.1.5 Grepsr
  • 83. Once selected all our fields, we will proceed to save them by clicking on “Next”, we can name each field and extract it in text form or extract the CSS class itself. 6.1.5 Grepsr
  • 84. Finally, we can add pagination for each of our fields, if required, either in HTML with a next link, or if you have load more or infinite scroll (ajax). 6.1.5 Grepsr
  • 85. 6.1.5 Grepsr To select the pagination, we will follow the same process as with the elements to scrape. (Optional part, not everything requires pagination)
  • 86. 6.1.5 Grepsr Finally, we can also configure a login if necessary, as well as additional fields that are close to the extracted field (images, goals, etc.).
  • 87. 6.1.5 Grepsr Finally, we will have the data in both JSON and CSV formats. However, we will need a (free) Grepsr account to export them!
  • 89.
  • 90. 7 Why Python? ➜ It's a very simple language to understand ➜ Easy approach for those starting with programming ➜ Much growth and great community behind it ➜ Core uses for massive data analysis and with very powerful libraries behind it (not just scraping) ➜ We can work on the browser! ○ https://colab.research.google.com
  • 91. 7.1 Type of data To start scraping we must know at least these concepts to program in python: ➜ Variables ➜ Lists ➜ Integers, Floats, Strings, Boolean Values.... ➜ For ➜ Conditional ➜ Imports
  • 92. 7.2 Scrapping Libraries There are several but I will focus on two: ➜ Requests + BeautifulSoup: To scrape data from the source code of a site. Useful for sites with static data. ➜ Selenium: Tool to automate QA that can help us scrape sites with dynamic content whose values are in the DOM but not in the source code. Colab does not support selenium, we will have to work with Jupyter (or any IDE)
  • 93.
  • 94. With 5 lines of code (or less) you can see the parsed HTML
  • 95. Accessing any element of parsed HTML is easy
  • 96. We can create a data frame and process the information as desired
  • 98.
  • 99. 7.3 Process We analyze the HTML, looking for patterns We generate the script for an element or URL We extend it to affect all data
  • 101. There are many websites that serve their pages on a User-agent basis. Sometimes you will be interested in being a desktop device, sometimes a mobile device. Sometimes a Windows, sometimes a Mac. Sometimes a Googlebot, sometimes a bingbot. Adapt each scraping to what you need to get the desired results! 8.1 User-agent
  • 102. To scrape a website like Google with advanced security mechanisms, it will be necessary to use proxies, among other measures. Proxies act as an intermediary between a request made by an X computer and a Z server. In this way, we leave little trace when it comes to being identified. Depending on the website and number of requests we recommend using one quantity or another. Generally, more than one request per second from the same IP address is not recommended. 8.2 Proxies
  • 103. Generally the use of proxies is more recommended than a VPN, since the VPN does the same thing but under a single IP. It is always advisable to use a VPN with another geo for any kind of tracking on third party websites, to avoid possible problems or identifications. Also, if you are caught by IP (e.g. Cloudflare) you will never be able to access the web again from that IP (if it is static). Recommended service: ExpressVPN 8.3 VPN’s
  • 104. 8.4 Concurrency Concurrency consists of limiting the number of requests a network can make per second. We are interested in limiting the requests we always make, in order to avoid saturating the server, be it ours or a competitor's. If we saturate the server, we will have to make the requests again or, depending on the case, start the whole crawling process again. Indicative numbers: ➜ Small websites: 5 req/sec - 5 threads ➜ Large websites: 20 req/sec - 20 threads
  • 105. 8.5 Data cleaning It is common that after a data scraping, we find data that does not fit what we need. Normally, we'll have to work on the data to clean it up. Some of the most common corrections: ➜ Duplicates ➜ Format correction/unification ➜ Spaces ➜ Strange characters ➜ Currencies
  • 107. 9. Casos prácticos Here are 2 case studies: ➜ Using scraping to automate the curation of content listings ➜ Scraping to generate a product feed for our websites
  • 108. Using scraping to automate the curation of content listings
  • 109. 9.1 Using scraping to automate the curation of content listings It can be firmly said that the best search engine at the moment is Google. What if we use Google's results to generate our own listings, based on the ranking (relevancy) that it gives to websites that position for what we want to position?
  • 110. 9.1.1 Jason The Miner To do so, we will use Jason The Miner, a scraping library made by Marc Mignonsin, Principal Software Engineer at Softonic (@mawrkus) at Github and (@crossrecursion) at Twitter
  • 111. 9.1.1 Jason The Miner Jason The Miner is a versatile and modular Node.js based library that can be adapted to any website and need.
  • 112. 9.1.1 Jason The Miner aa
  • 113. 9.1.2 Concept We launched a query as 'best washing machines'. We will enter the top 20-30 results, analyze HTML and extract the link ID from the Amazon links. Then we will do a count and we will be automatically validating based on dozens of websites which is the best washing machine.
  • 114. 9.1.2 Concept Then, we will have a list of IDs with their URL, which we can scrape directly from Google Play or using their API, and semi-automatically fill our CMS (WordPress, or whatever we have). This allows us to automate content research/curing and focus on delivering real value in what we write. Screenshot is an outcome based on Google Play Store
  • 115. 9.1.3 Action First of all we will generate the basis to create the URL, with our user-agent, as well as the language we are interested in.
  • 116. 9.1.3 Action Then we are going to generate a maximum concurrence so that Google does not ban our IP or skip captchas.
  • 117. 9.1.3 Action Finally, let's define exactly the flow of the crawler. If you need to enter links/websites, and what you need to extract from them.
  • 118. 9.1.3 Action Finally, we will transform the output into a.json file that we can use to upload to our CMS.
  • 119. 9.1.3 Action And we can even configure it to be automatically uploaded to the CMS once the processes are finished.
  • 120. 9.1.3 Acción What does Jason the Miner do? ➜ Load (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform But this is ok, but we need to do it in bulk for tens or hundreds of cases, we cannot do it one by one.
  • 121. 9.1.3 Acción Added functionality to make it work in bulk ➜ Bulk (imported from a CSV) ➜ Load (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform Creating a variable that would be the query we inserted in Google.
  • 122. 9.1.4 CMS Once we have all the data inserted in our CMS, we will have to execute another basic scraping process or with an API such as Amazon to get all the data of each product (logo, name, images, description, etc). Once we have everything, the lists will be sorted and we can add the editorial content we want, with very little manual work to do.
  • 123. 9.1.5 Ideas Examples in which it could be applied: ➜ Amazon Products ➜ Listings of restaurants that are on TripAdvisor ➜ Hotel Listings ➜ Netflix Movie Listings ➜ Best PS4 Games ➜ Better android apps ➜ Best Chromecast apps ➜ Best books
  • 124. Scraping to generate a product feed for our websites
  • 125. 9.2 Starting point Website affiliated to Casa del Libro. We need to generate a product feed for each of our product pages.
  • 126. 9.2 Process We analyze the HTML, looking for patterns We generate the script for an element or URL We extend it to affect all data
  • 127. 9.2.0 What do we want to scrape off? We need the following information: ➜ Titles ➜ Author ➜ Editorial ➜ Prices *Only from the category of crime novel
  • 128. 9.2.0 What do we want to scrape off? Title
  • 129. 9.2.0 What do we want to scrape off?
  • 130. 9.2.0 What do we want to scrape off? Author
  • 131. 9.2.0 What do we want to scrape off?
  • 132. 9.2.0 What do we want to scrape off? 1. Necesitamos los siguientes datos: a. Titulos b. Autor c. Editorial d. Precios Editorial
  • 133. 9.2.0 What do we want to scrape off?
  • 134. 9.2.0 What do we want to scrape off? 1. Necesitamos los siguientes datos: a. Titulos b. Autor c. Editorial d. Precios Price
  • 135. 9.2.0 What do we want to scrape off?
  • 136. 9.2.0 What do we want to scrape off?
  • 137. 9.2.0 What do we want to scrape off?
  • 138. 9.2.0 Pagination For each page we will have to iterate the same code over and over again. You need to find out how paginated URLs are formed in order to access them: >>>https://www.casadellibro.com/libros/novela-negra/126000000/p + page
  • 139. 9.2.1 We generated the script to extract the first book
  • 140. 9.2.1 We Iterate in every container
  • 141. 9.2.1 Time to finish Now that we have the script to scrape all the books on the first page we will generate the final script to affect all the pages.
  • 142. 9.2.2 Let's do the script We import all the libraries we are going to use
  • 143. 9.2.2 Let's do the script We create the empty lists in which to host each of the data.
  • 144. 9.2.2 Let's do the script We will have a list containing the numbers 1 to 120 for the pages
  • 145. 9.2.2 Let's do the script We create variables to prevent the server from banning us due to excessive requests
  • 146. 9.2.2 Let's do the script
  • 147. 9.2.2 Let's do the script
  • 148. 9.2.2 Let's do the script With Pandas we transform the lists into a DataFrame that we can work with
  • 149. 9.2.2 Let's do the script
  • 150. With Pandas we can also transform it into a csv or an excel 9.2.2 Let's do the script
  • 151. And finally, we can download the file thanks to the colab library. 9.2.2 Let's do the script
  • 156. You can use Google Sheets to import data from APIs easily. API's such as Dandelion APIs, which are used for semantic analysis of texts, can be very useful for the day to day running of our SEO. ➜ Entity Extraction ➜ Semantic similarity ➜ Keywords extraction ➜ Sentimental analysis 10.2 Dandelion API
  • 157. Stack Andreas Niessen Stack Advanced projects 10.3 Stacks for scraping + WP + + + ++
  • 158. ➜ With this little script you can easily export an entire SERP into a CSV. ○ bit.ly/2uZCXuL 10.4 Scraping Google SERP
  • 164. Python & Other ➜ Chapter 11 – Web Scraping https://automatetheboringstuff.com/chapter11/ ➜ https://twitter.com/i/moments/949019183181856769 ➜ Scraping ‘People Also Ask’ boxes for SEO and content research https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-a nd-content-research/ ➜ https://stackoverflow.com/questions/3964681/find-all-files-in-a-di rectory-with-extension-txt-in-python ➜ 6 Actionable Web Scraping Hacks for White Hat Marketers https://ahrefs.com/blog/web-scraping-for-marketers/ ➜ https://saucelabs.com/resources/articles/selenium-tips-css-selec tors EXTRA RESOURCES
  • 165. EXTRA RESOURCES Node.js (Thanks @mawrkus) ➜ Web Scraping With Node.js: https://www.smashingmagazine.com/2015/04/web-scraping-with-node js/ ➜ X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray ➜ Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy ➜ Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/ ➜ https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-ille gal/ ➜ http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-it s-most-useful-tools/ ➜ Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
  • 166. CREDITS ➜ Presentation template by SlidesCarnival ➜ Photographs by Death to the Stock Photo (license) ➜ Marc Mignonsin for creating Jason The Miner
  • 167. Thanks! Any question? Esteve Castells | @estevecastells Newsletter: bit.ly/Seopatia https://estevecastells.com/ Nacho Mascort | @NachoMascort Scripts: https://github.com/NachoSEO https://seohacks.es