SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Crawling the world
@mmoreram
Apparently...
Nobody
uses parsing in their applications
Not even
Chuck Norris
Many bussinesses

need crawling
Crawling brings you
knowledge
Knowledge is

power
And power is

Money
What is crawling?
Or

parsing
Crawling
We download just an url with a request (HTML, XML…)
We manipulate response by searching the desired data,
like links, headers or any kind of text or label
Once we have needed content, we can just update our
database and take following decisions, for example
parsing some found links.
and that’s it!
“Machines will do what humans
do before they realize”
–Marc Morera, yesterday
Let’s see an example
Step by step-
chicplace.com

Our goal is parse all available products, saving name,
description, price, shop and categories
We will use linear strategy. There are some kind of
strategies when a site must be parsed
Let’s see all available strategies
Parsing Strategies

Linear. Just one script. If any page fails (crawling error,
server timeout, …) some kind of exception could be
thrown and catched.
Advantages: Just an script is needed. Easier? Not even
close…
Problems: Cannot be distributed. Just one script for 1M
requests. Memory problem?
Parsing Strategies

Distributed. One script for each case. If any page fails
can be recovered by simply execute himself again.
Advantages: All cases are encapsulated in an individual
script, low memory. Can be easily distributed by using
queues.
Problems: Any
Crawling steps

Analyzing. Think like Google does. find the fastest way
through the labyrinth
Scripting. Build scripts using queues for distributed
strategy. Each queue means one page
Running. keep in mind the impact of your actions. DDOS
attack, copyright
Analyzing

Every parsing process should be evaluated as a simple
crawler. For example Google
How to access to all needed pages with the lowest
server impact
Usually, all serious websites are designed to easily
access to all pages within 3 clicks
Analyzing
We will use category map to just access to
all available products
Analyzing
Each category will list all available products
Analyzing
Do we need also to parse product page?
In fact, we do. We already have name, price and
category, but we also need description and shop
So we have main page to parse all category links, we
have category page with all product ( can be paginated )
and we need also product page to get all information
Product page is responsible for saving all data in DDBB
Scripting

We will use distributed strategy, using queues and
supervisord
Supervisord is responsible for managing X instances of
a process running at the same time.
Using distributed queue system, we will have 3 workers.
Worker?

Yep, worker. Using a queue system, a worker is like a
box ( script ) with a parameters ( input values ), that just
do something.
We have 3 kind of workers. One of them, the
CategoryWorker will just receive a category url, will
parse related content ( HTML ) and will detect all
products. Each product will generate a new instance for
ProductWorker
Running
We just enable all workers and forces first to run.
First worker will find all categories urls and will
enqueue them into a queue named categories-queue
Second worker ( for example 10 instances ) will just
consume categories-queue looking for urls and parsing
their content.
Their content means just products urls
Running

Each url is enqueued to another queued named
products-queue
Third and last worker ( 50 instances ) just consume this
queue, parses their content and get needed data ( name,
description, shop, category and price.
OK. Call me God
but…
“Don't shoot the messenger”

–Some bored man
warning!

50 workers requesting chicplace in parallel. This is a big
problem
@Gonzalo (CTO) will be angry and he will detect
something is happening
So, we must be careful to not alert him or just prevent
us discover
Warning
do not try this at home
Be invisible
To be invisible we just can parse all site slowly ( days )
To be faster we just can mask our IP using Proxies
( How about different proxy for every request? )
To be faster we just can user some reversed Proxy, like
TOR.
To be stupid we can just parse chicplace with our IP
( most companies will not even notice )
They are attacking
me !
“
And whatever you ask in prayer,
you will receive, if you have faith”
–Matthew 21:22
My pray!

A good crawling implementation is infallible
Server will receive dozens of requests per second and
will not recognize any pattern to discriminate crawler
requests from simple user requests
So…?
Welcome to amazing world of

Crawling
Where no one is

SAVE

Mais conteúdo relacionado

Mais procurados

DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generatorsFelipe Prado
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 
How to become an effective web searcher
How to become an effective web searcherHow to become an effective web searcher
How to become an effective web searcherrangak
 
Seo - Search Engine Optimization seminar
Seo - Search Engine Optimization seminarSeo - Search Engine Optimization seminar
Seo - Search Engine Optimization seminarcooljeba
 
Developing apps for humans & robots
Developing apps for humans & robotsDeveloping apps for humans & robots
Developing apps for humans & robotsNagaraju Sangam
 
Introduction to google hacking database
Introduction to google hacking databaseIntroduction to google hacking database
Introduction to google hacking databaseimthebeginner
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Sanchit Saini
 
Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter salomon dayan
 
Searching techniques
Searching techniquesSearching techniques
Searching techniqueshdpraj
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
An Introduction to Hashing and Salting
An Introduction to Hashing and SaltingAn Introduction to Hashing and Salting
An Introduction to Hashing and SaltingRahul Singh
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in PythonViren Rajput
 

Mais procurados (16)

DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Google Dorks and SQL Injection
Google Dorks and SQL InjectionGoogle Dorks and SQL Injection
Google Dorks and SQL Injection
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
How to become an effective web searcher
How to become an effective web searcherHow to become an effective web searcher
How to become an effective web searcher
 
Seo - Search Engine Optimization seminar
Seo - Search Engine Optimization seminarSeo - Search Engine Optimization seminar
Seo - Search Engine Optimization seminar
 
Developing apps for humans & robots
Developing apps for humans & robotsDeveloping apps for humans & robots
Developing apps for humans & robots
 
Introduction to google hacking database
Introduction to google hacking databaseIntroduction to google hacking database
Introduction to google hacking database
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
An Introduction to Hashing and Salting
An Introduction to Hashing and SaltingAn Introduction to Hashing and Salting
An Introduction to Hashing and Salting
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
 

Destaque

From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...
From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...
From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...Carl Zimmer
 
Divertissement Culturel
Divertissement CulturelDivertissement Culturel
Divertissement CulturelJoke Channel
 
Chuck Norris Keynote 3.0> The Prequel
Chuck Norris Keynote 3.0> The PrequelChuck Norris Keynote 3.0> The Prequel
Chuck Norris Keynote 3.0> The PrequelJon Corippo
 
Codex chuck norris
Codex chuck norrisCodex chuck norris
Codex chuck norrissf_deezer
 
Road show Mood Media des points de vente innovants à Londres
Road show Mood Media des points de vente innovants à LondresRoad show Mood Media des points de vente innovants à Londres
Road show Mood Media des points de vente innovants à LondresConsonaute
 
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"Daniel Bryant
 
04 Blondes
04 Blondes04 Blondes
04 Blondesmonika39
 
Using ReactJS in AngularJS
Using ReactJS in AngularJSUsing ReactJS in AngularJS
Using ReactJS in AngularJSBoris Dinkevich
 
Exemple de plan pour un pitch investisseur v2
Exemple de plan pour un pitch investisseur v2Exemple de plan pour un pitch investisseur v2
Exemple de plan pour un pitch investisseur v2Yves Maurin
 
Décrire une personne
Décrire une personneDécrire une personne
Décrire une personneBoira32
 

Destaque (13)

From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...
From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...
From Page to Pixel (or What Chuck Norris and Tapeworms Taught Me About the Fu...
 
Divertissement Culturel
Divertissement CulturelDivertissement Culturel
Divertissement Culturel
 
06 pan-les-blondes
06 pan-les-blondes06 pan-les-blondes
06 pan-les-blondes
 
Chuck Norris Keynote 3.0> The Prequel
Chuck Norris Keynote 3.0> The PrequelChuck Norris Keynote 3.0> The Prequel
Chuck Norris Keynote 3.0> The Prequel
 
La Blonde
La BlondeLa Blonde
La Blonde
 
Codex chuck norris
Codex chuck norrisCodex chuck norris
Codex chuck norris
 
Road show Mood Media des points de vente innovants à Londres
Road show Mood Media des points de vente innovants à LondresRoad show Mood Media des points de vente innovants à Londres
Road show Mood Media des points de vente innovants à Londres
 
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"
LJC: "Chuck Norris Doesn't Do DevOps...but Java developers might benefit"
 
Quizz pour blondes
Quizz pour blondesQuizz pour blondes
Quizz pour blondes
 
04 Blondes
04 Blondes04 Blondes
04 Blondes
 
Using ReactJS in AngularJS
Using ReactJS in AngularJSUsing ReactJS in AngularJS
Using ReactJS in AngularJS
 
Exemple de plan pour un pitch investisseur v2
Exemple de plan pour un pitch investisseur v2Exemple de plan pour un pitch investisseur v2
Exemple de plan pour un pitch investisseur v2
 
Décrire une personne
Décrire une personneDécrire une personne
Décrire une personne
 

Semelhante a Crawling the world

How To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfHow To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfjimmylofy
 
How To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxHow To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxiwebdatascraping
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018STELIANCREANGA
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideDominic Woodman
 
Algorithms that changed the future
Algorithms that changed the futureAlgorithms that changed the future
Algorithms that changed the futureJohnson Gmail
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşAykut Aslantaş
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsZemanta
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semanticsAndraz Tori
 
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...LazarinaStoyanova
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......Smayara
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchTO THE NEW | Technology
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrongafa reg
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentationganeh17
 
How to get top ranking search engines
How to get top ranking search enginesHow to get top ranking search engines
How to get top ranking search enginesPhenom People
 

Semelhante a Crawling the world (20)

How To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfHow To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdf
 
How To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxHow To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptx
 
Mythology of search engine
Mythology of search engineMythology of search engine
Mythology of search engine
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Os Krug
Os KrugOs Krug
Os Krug
 
Algorithms that changed the future
Algorithms that changed the futureAlgorithms that changed the future
Algorithms that changed the future
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Python Homework Help
Python Homework HelpPython Homework Help
Python Homework Help
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
 
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearch
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrong
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentation
 
How to get top ranking search engines
How to get top ranking search enginesHow to get top ranking search engines
How to get top ranking search engines
 

Mais de Marc Morera

When symfony met promises
When symfony met promises When symfony met promises
When symfony met promises Marc Morera
 
When e-commerce meets Symfony
When e-commerce meets SymfonyWhen e-commerce meets Symfony
When e-commerce meets SymfonyMarc Morera
 
El Efecto "Este código es una basura"
El Efecto "Este código es una basura"El Efecto "Este código es una basura"
El Efecto "Este código es una basura"Marc Morera
 
Dependency injection
Dependency injectionDependency injection
Dependency injectionMarc Morera
 
Gearman bundle, Warszawa 2013 edition
Gearman bundle, Warszawa 2013 editionGearman bundle, Warszawa 2013 edition
Gearman bundle, Warszawa 2013 editionMarc Morera
 
Rsqueue bundle 06.2013
Rsqueue bundle 06.2013Rsqueue bundle 06.2013
Rsqueue bundle 06.2013Marc Morera
 

Mais de Marc Morera (6)

When symfony met promises
When symfony met promises When symfony met promises
When symfony met promises
 
When e-commerce meets Symfony
When e-commerce meets SymfonyWhen e-commerce meets Symfony
When e-commerce meets Symfony
 
El Efecto "Este código es una basura"
El Efecto "Este código es una basura"El Efecto "Este código es una basura"
El Efecto "Este código es una basura"
 
Dependency injection
Dependency injectionDependency injection
Dependency injection
 
Gearman bundle, Warszawa 2013 edition
Gearman bundle, Warszawa 2013 editionGearman bundle, Warszawa 2013 edition
Gearman bundle, Warszawa 2013 edition
 
Rsqueue bundle 06.2013
Rsqueue bundle 06.2013Rsqueue bundle 06.2013
Rsqueue bundle 06.2013
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Crawling the world

  • 3. Nobody uses parsing in their applications
  • 9. Crawling We download just an url with a request (HTML, XML…) We manipulate response by searching the desired data, like links, headers or any kind of text or label Once we have needed content, we can just update our database and take following decisions, for example parsing some found links. and that’s it!
  • 10. “Machines will do what humans do before they realize” –Marc Morera, yesterday
  • 11. Let’s see an example Step by step-
  • 12.
  • 13. chicplace.com Our goal is parse all available products, saving name, description, price, shop and categories We will use linear strategy. There are some kind of strategies when a site must be parsed Let’s see all available strategies
  • 14. Parsing Strategies Linear. Just one script. If any page fails (crawling error, server timeout, …) some kind of exception could be thrown and catched. Advantages: Just an script is needed. Easier? Not even close… Problems: Cannot be distributed. Just one script for 1M requests. Memory problem?
  • 15. Parsing Strategies Distributed. One script for each case. If any page fails can be recovered by simply execute himself again. Advantages: All cases are encapsulated in an individual script, low memory. Can be easily distributed by using queues. Problems: Any
  • 16. Crawling steps Analyzing. Think like Google does. find the fastest way through the labyrinth Scripting. Build scripts using queues for distributed strategy. Each queue means one page Running. keep in mind the impact of your actions. DDOS attack, copyright
  • 17. Analyzing Every parsing process should be evaluated as a simple crawler. For example Google How to access to all needed pages with the lowest server impact Usually, all serious websites are designed to easily access to all pages within 3 clicks
  • 18. Analyzing We will use category map to just access to all available products
  • 19. Analyzing Each category will list all available products
  • 20. Analyzing Do we need also to parse product page? In fact, we do. We already have name, price and category, but we also need description and shop So we have main page to parse all category links, we have category page with all product ( can be paginated ) and we need also product page to get all information Product page is responsible for saving all data in DDBB
  • 21. Scripting We will use distributed strategy, using queues and supervisord Supervisord is responsible for managing X instances of a process running at the same time. Using distributed queue system, we will have 3 workers.
  • 22. Worker? Yep, worker. Using a queue system, a worker is like a box ( script ) with a parameters ( input values ), that just do something. We have 3 kind of workers. One of them, the CategoryWorker will just receive a category url, will parse related content ( HTML ) and will detect all products. Each product will generate a new instance for ProductWorker
  • 23. Running We just enable all workers and forces first to run. First worker will find all categories urls and will enqueue them into a queue named categories-queue Second worker ( for example 10 instances ) will just consume categories-queue looking for urls and parsing their content. Their content means just products urls
  • 24. Running Each url is enqueued to another queued named products-queue Third and last worker ( 50 instances ) just consume this queue, parses their content and get needed data ( name, description, shop, category and price.
  • 25. OK. Call me God
  • 27. “Don't shoot the messenger” –Some bored man
  • 28. warning! 50 workers requesting chicplace in parallel. This is a big problem @Gonzalo (CTO) will be angry and he will detect something is happening So, we must be careful to not alert him or just prevent us discover
  • 29. Warning do not try this at home
  • 30. Be invisible To be invisible we just can parse all site slowly ( days ) To be faster we just can mask our IP using Proxies ( How about different proxy for every request? ) To be faster we just can user some reversed Proxy, like TOR. To be stupid we can just parse chicplace with our IP ( most companies will not even notice )
  • 32. “ And whatever you ask in prayer, you will receive, if you have faith” –Matthew 21:22
  • 33. My pray! A good crawling implementation is infallible Server will receive dozens of requests per second and will not recognize any pattern to discriminate crawler requests from simple user requests So…?
  • 34. Welcome to amazing world of Crawling
  • 35. Where no one is SAVE