Web Scraping

Legal standpoint
• As long robots.txt prohibit scraping - it's illegal

• As long terms of service prohibit scraping - it's illegal
• As long as you're abusing the servers - it's illegal
• As long as you're using the data without crediting the source - it's illegal

Ethic standpoint
• Be reasonable with timeouts and threads

• Let the website know you're bot through the user agent

• Agree the most suitable time for parsing

• Be reasonable with scope

Please, avoid
being an
asshole

Fetching data
• Curl, fetch, request, etc.

• phantomjs, puppeteer

What can we do here?
• Selective crawling

• URL prediction

• Duplicate request prevention (FS / DB access is cheaper than network)

• Smart scheduling

HTML
• Clean RegExp is a mistake in a long run

• https://stackoverﬂow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

• Building AST tree is the default approach (see parse5, himalaya)

Walking through AST
• Cheerio

• jsdom

• x-ray

• traverse through AST manually

Tips #1
• Write set of useful helpers/wrappers upfront

• Keep the parsers granular and reusable

• Spend time to make it fault tolerant

• Always verify the block correctness

• Write tests for target markup

• Keep logs

Tips #2
• Keep the reference of parsed data easily accessible

• Permanently eject parsing results

• Be reasonable. RAM is cheap, time is expensive

• Store image hash sums and get rid of duplicates

• Retain the data even if you don't know how to use it now

• File system is fast, but DB is cheaper "online" updates

Dynamic content
• API

• User emulation (puppeteer)

The problem
• Data taken from multiple sources

• Data which was initially dirty

• Content submitted by customers

• Complex data which can be simpliﬁed

Steps
• Trim, lowercase

• Remove noise symbols with regular expressions

• Identify and remove noise data

• Mark some dataset as reference and go with string similarity algorithms

• Machine learning classiﬁcation algorithms

String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coeﬃcient

• Hamming distance

• Longest Common Substring distance

String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coeﬃcient (string-similarity)

• Hamming distance (fuzzyset.js)

• Longest Common Substring distance

Tips #3
• Strings proximity calculation is expensive operation. Split it.

• Shortening strings dramatically increases performance

• Identify the common diﬀerences and handle them with condition upfront

• Think of ﬁle formats and DB normalization

• Go for mutability while working with a big data structures (In memory
calculations)

Tips #4
• Allow garbage collector to take the data which isn't used anymore (In
memory calculations)

• Go for transducers (Avoid x.ﬁlter().map().map().ﬁlter())

• Use schedulers

• Be creative

References
• Pictures are taken from unsplash.com

• Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624

• Libraries:

• https://github.com/cheeriojs/cheerio

• https://github.com/GoogleChrome/puppeteer

• https://github.com/matthewmueller/x-ray

• https://github.com/request/request-promise

• https://github.com/jsdom/jsdom

• https://github.com/inikulin/parse5

• https://github.com/aceakash/string-similarity

• https://glench.github.io/fuzzyset.js/

• https://www.npmjs.com/package/node-schedule

Thank you!
Questions?
Oleksandr Tryshchenko

@tryshchenko github / twitter

tryshchenko.com

Web Scraping

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Web Scraping

Semelhante a Web Scraping (20)

Mais de Oleksandr Tryshchenko

Mais de Oleksandr Tryshchenko (11)

Último

Último (20)

Web Scraping