14. Legal standpoint
• As long robots.txt prohibit scraping - it's illegal
• As long terms of service prohibit scraping - it's illegal
• As long as you're abusing the servers - it's illegal
• As long as you're using the data without crediting the source - it's illegal
15. Ethic standpoint
• Be reasonable with timeouts and threads
• Let the website know you're bot through the user agent
• Agree the most suitable time for parsing
• Be reasonable with scope
20. What can we do here?
• Selective crawling
• URL prediction
• Duplicate request prevention (FS / DB access is cheaper than network)
• Smart scheduling
23. HTML
• Clean RegExp is a mistake in a long run
• https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)
• Building AST tree is the default approach (see parse5, himalaya)
25. Tips #1
• Write set of useful helpers/wrappers upfront
• Keep the parsers granular and reusable
• Spend time to make it fault tolerant
• Always verify the block correctness
• Write tests for target markup
• Keep logs
26. Tips #2
• Keep the reference of parsed data easily accessible
• Permanently eject parsing results
• Be reasonable. RAM is cheap, time is expensive
• Store image hash sums and get rid of duplicates
• Retain the data even if you don't know how to use it now
• File system is fast, but DB is cheaper "online" updates
35. The problem
• Data taken from multiple sources
• Data which was initially dirty
• Content submitted by customers
• Complex data which can be simplified
36. Steps
• Trim, lowercase
• Remove noise symbols with regular expressions
• Identify and remove noise data
• Mark some dataset as reference and go with string similarity algorithms
• Machine learning classification algorithms
37. Steps
• Trim, lowercase
• Remove noise symbols with regular expressions
• Identify and remove noise data
• Mark some dataset as reference and go with string similarity algorithms
• Machine learning classification algorithms
40. Tips #3
• Strings proximity calculation is expensive operation. Split it.
• Shortening strings dramatically increases performance
• Identify the common differences and handle them with condition upfront
• Think of file formats and DB normalization
• Go for mutability while working with a big data structures (In memory
calculations)
41. Tips #4
• Allow garbage collector to take the data which isn't used anymore (In
memory calculations)
• Go for transducers (Avoid x.filter().map().map().filter())
• Use schedulers
• Be creative
42. References
• Pictures are taken from unsplash.com
• Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624
• Libraries:
• https://github.com/cheeriojs/cheerio
• https://github.com/GoogleChrome/puppeteer
• https://github.com/matthewmueller/x-ray
• https://github.com/request/request-promise
• https://github.com/jsdom/jsdom
• https://github.com/inikulin/parse5
• https://github.com/aceakash/string-similarity
• https://glench.github.io/fuzzyset.js/
• https://www.npmjs.com/package/node-schedule