Crawling the world

Nobody
uses parsing in their applications

Many bussinesses

need crawling

Crawling brings you
knowledge
Knowledge is

power

Crawling
We download just an url with a request (HTML, XML…)
We manipulate response by searching the desired data,
like links, headers or any kind of text or label
Once we have needed content, we can just update our
database and take following decisions, for example
parsing some found links.
and that’s it!

“Machines will do what humans
do before they realize”
–Marc Morera, yesterday

Let’s see an example
Step by step-

chicplace.com

Our goal is parse all available products, saving name,
description, price, shop and categories
We will use linear strategy. There are some kind of
strategies when a site must be parsed
Let’s see all available strategies

Parsing Strategies

Linear. Just one script. If any page fails (crawling error,
server timeout, …) some kind of exception could be
thrown and catched.
Advantages: Just an script is needed. Easier? Not even
close…
Problems: Cannot be distributed. Just one script for 1M
requests. Memory problem?

Parsing Strategies

Distributed. One script for each case. If any page fails
can be recovered by simply execute himself again.
Advantages: All cases are encapsulated in an individual
script, low memory. Can be easily distributed by using
queues.
Problems: Any

Crawling steps

Analyzing. Think like Google does. ﬁnd the fastest way
through the labyrinth
Scripting. Build scripts using queues for distributed
strategy. Each queue means one page
Running. keep in mind the impact of your actions. DDOS
attack, copyright

Analyzing

Every parsing process should be evaluated as a simple
crawler. For example Google
How to access to all needed pages with the lowest
server impact
Usually, all serious websites are designed to easily
access to all pages within 3 clicks

Analyzing
We will use category map to just access to
all available products

Analyzing
Each category will list all available products

Analyzing
Do we need also to parse product page?
In fact, we do. We already have name, price and
category, but we also need description and shop
So we have main page to parse all category links, we
have category page with all product ( can be paginated )
and we need also product page to get all information
Product page is responsible for saving all data in DDBB

Scripting

We will use distributed strategy, using queues and
supervisord
Supervisord is responsible for managing X instances of
a process running at the same time.
Using distributed queue system, we will have 3 workers.

Worker?

Yep, worker. Using a queue system, a worker is like a
box ( script ) with a parameters ( input values ), that just
do something.
We have 3 kind of workers. One of them, the
CategoryWorker will just receive a category url, will
parse related content ( HTML ) and will detect all
products. Each product will generate a new instance for
ProductWorker

Running
We just enable all workers and forces ﬁrst to run.
First worker will ﬁnd all categories urls and will
enqueue them into a queue named categories-queue
Second worker ( for example 10 instances ) will just
consume categories-queue looking for urls and parsing
their content.
Their content means just products urls

Running

Each url is enqueued to another queued named
products-queue
Third and last worker ( 50 instances ) just consume this
queue, parses their content and get needed data ( name,
description, shop, category and price.

“Don't shoot the messenger”

–Some bored man

warning!

50 workers requesting chicplace in parallel. This is a big
problem
@Gonzalo (CTO) will be angry and he will detect
something is happening
So, we must be careful to not alert him or just prevent
us discover

Warning
do not try this at home

Be invisible
To be invisible we just can parse all site slowly ( days )
To be faster we just can mask our IP using Proxies
( How about different proxy for every request? )
To be faster we just can user some reversed Proxy, like
TOR.
To be stupid we can just parse chicplace with our IP
( most companies will not even notice )

“
And whatever you ask in prayer,
you will receive, if you have faith”
–Matthew 21:22

My pray!

A good crawling implementation is infallible
Server will receive dozens of requests per second and
will not recognize any pattern to discriminate crawler
requests from simple user requests
So…?

Welcome to amazing world of

Crawling

Crawling the world

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (16)

Destaque

Destaque (13)

Semelhante a Crawling the world

Semelhante a Crawling the world (20)

Mais de Marc Morera

Mais de Marc Morera (6)

Último

Último (20)

Crawling the world