SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Chapter 1
Harvesting
Chapter 2
Retribution
Legal standpoint
• As long robots.txt prohibit scraping - it's illegal

• As long terms of service prohibit scraping - it's illegal
• As long as you're abusing the servers - it's illegal
• As long as you're using the data without crediting the source - it's illegal
Ethic standpoint
• Be reasonable with timeouts and threads

• Let the website know you're bot through the user agent

• Agree the most suitable time for parsing

• Be reasonable with scope
Please, avoid
being an
asshole
Chapter 3
Provenance
Fetching data
• Curl, fetch, request, etc.

• phantomjs, puppeteer
What can we do here?
• Selective crawling

• URL prediction

• Duplicate request prevention (FS / DB access is cheaper than network)

• Smart scheduling
Chapter 4
Parse
HTML
• Clean RegExp is a mistake in a long run 

• https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

• Building AST tree is the default approach (see parse5, himalaya)
Walking through AST
• Cheerio

• jsdom

• x-ray

• traverse through AST manually
Tips #1
• Write set of useful helpers/wrappers upfront

• Keep the parsers granular and reusable

• Spend time to make it fault tolerant

• Always verify the block correctness

• Write tests for target markup

• Keep logs
Tips #2
• Keep the reference of parsed data easily accessible

• Permanently eject parsing results

• Be reasonable. RAM is cheap, time is expensive

• Store image hash sums and get rid of duplicates

• Retain the data even if you don't know how to use it now

• File system is fast, but DB is cheaper "online" updates
Dynamic content
• API

• User emulation (puppeteer)
Chapter 5
Normalize
The problem
• Data taken from multiple sources

• Data which was initially dirty

• Content submitted by customers

• Complex data which can be simplified
Steps
• Trim, lowercase

• Remove noise symbols with regular expressions

• Identify and remove noise data

• Mark some dataset as reference and go with string similarity algorithms

• Machine learning classification algorithms
Steps
• Trim, lowercase

• Remove noise symbols with regular expressions

• Identify and remove noise data

• Mark some dataset as reference and go with string similarity algorithms

• Machine learning classification algorithms
String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coefficient

• Hamming distance

• Longest Common Substring distance
String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coefficient (string-similarity)

• Hamming distance (fuzzyset.js)

• Longest Common Substring distance
Tips #3
• Strings proximity calculation is expensive operation. Split it.

• Shortening strings dramatically increases performance

• Identify the common differences and handle them with condition upfront

• Think of file formats and DB normalization

• Go for mutability while working with a big data structures (In memory
calculations)
Tips #4
• Allow garbage collector to take the data which isn't used anymore (In
memory calculations)

• Go for transducers (Avoid x.filter().map().map().filter())

• Use schedulers

• Be creative
References
• Pictures are taken from unsplash.com

• Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624

• Libraries:

• https://github.com/cheeriojs/cheerio

• https://github.com/GoogleChrome/puppeteer

• https://github.com/matthewmueller/x-ray

• https://github.com/request/request-promise

• https://github.com/jsdom/jsdom

• https://github.com/inikulin/parse5

• https://github.com/aceakash/string-similarity

• https://glench.github.io/fuzzyset.js/

• https://www.npmjs.com/package/node-schedule
Thank you!
Questions?
Oleksandr Tryshchenko

@tryshchenko github / twitter

tryshchenko.com

Mais conteúdo relacionado

Semelhante a Web Scraping

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Performance optimization - JavaScript
Performance optimization - JavaScriptPerformance optimization - JavaScript
Performance optimization - JavaScript
Filip Mares
 
Static Analysis Primer
Static Analysis PrimerStatic Analysis Primer
Static Analysis Primer
Coverity
 

Semelhante a Web Scraping (20)

Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
Practical Malware Analysis Ch 14: Malware-Focused Network SignaturesPractical Malware Analysis Ch 14: Malware-Focused Network Signatures
Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Performance optimization - JavaScript
Performance optimization - JavaScriptPerformance optimization - JavaScript
Performance optimization - JavaScript
 
Introduction to Computer Networking
Introduction to Computer NetworkingIntroduction to Computer Networking
Introduction to Computer Networking
 
rspamd-fosdem
rspamd-fosdemrspamd-fosdem
rspamd-fosdem
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Static Analysis Primer
Static Analysis PrimerStatic Analysis Primer
Static Analysis Primer
 
Meek and domain fronting public
Meek and domain fronting publicMeek and domain fronting public
Meek and domain fronting public
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 

Mais de Oleksandr Tryshchenko

Mais de Oleksandr Tryshchenko (11)

PWA to React Native migration
PWA to React Native migrationPWA to React Native migration
PWA to React Native migration
 
2018 grai
2018 grai2018 grai
2018 grai
 
Mobile Applications with Angular 4 and Ionic 3
Mobile Applications with Angular 4 and Ionic 3Mobile Applications with Angular 4 and Ionic 3
Mobile Applications with Angular 4 and Ionic 3
 
20 000 Leagues Under The Angular 4
20 000 Leagues Under The Angular 420 000 Leagues Under The Angular 4
20 000 Leagues Under The Angular 4
 
Front end architecture patterns
Front end architecture patternsFront end architecture patterns
Front end architecture patterns
 
How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)
How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)
How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)
 
Angular 2 On Production (IT Talk in Dnipro)
Angular 2 On Production (IT Talk in Dnipro)Angular 2 On Production (IT Talk in Dnipro)
Angular 2 On Production (IT Talk in Dnipro)
 
ES6 Generators On Koa.js Example
ES6 Generators On Koa.js ExampleES6 Generators On Koa.js Example
ES6 Generators On Koa.js Example
 
Angular 2 On Production
Angular 2 On ProductionAngular 2 On Production
Angular 2 On Production
 
How To Tweak Angular 2 Performance
How To Tweak Angular 2 PerformanceHow To Tweak Angular 2 Performance
How To Tweak Angular 2 Performance
 
ES6 basics
ES6 basicsES6 basics
ES6 basics
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Web Scraping

  • 1.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 14. Legal standpoint • As long robots.txt prohibit scraping - it's illegal • As long terms of service prohibit scraping - it's illegal • As long as you're abusing the servers - it's illegal • As long as you're using the data without crediting the source - it's illegal
  • 15. Ethic standpoint • Be reasonable with timeouts and threads • Let the website know you're bot through the user agent • Agree the most suitable time for parsing • Be reasonable with scope
  • 18.
  • 19. Fetching data • Curl, fetch, request, etc. • phantomjs, puppeteer
  • 20. What can we do here? • Selective crawling • URL prediction • Duplicate request prevention (FS / DB access is cheaper than network) • Smart scheduling
  • 22.
  • 23. HTML • Clean RegExp is a mistake in a long run • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) • Building AST tree is the default approach (see parse5, himalaya)
  • 24. Walking through AST • Cheerio • jsdom • x-ray • traverse through AST manually
  • 25. Tips #1 • Write set of useful helpers/wrappers upfront • Keep the parsers granular and reusable • Spend time to make it fault tolerant • Always verify the block correctness • Write tests for target markup • Keep logs
  • 26. Tips #2 • Keep the reference of parsed data easily accessible • Permanently eject parsing results • Be reasonable. RAM is cheap, time is expensive • Store image hash sums and get rid of duplicates • Retain the data even if you don't know how to use it now • File system is fast, but DB is cheaper "online" updates
  • 27. Dynamic content • API • User emulation (puppeteer)
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. The problem • Data taken from multiple sources • Data which was initially dirty • Content submitted by customers • Complex data which can be simplified
  • 36. Steps • Trim, lowercase • Remove noise symbols with regular expressions • Identify and remove noise data • Mark some dataset as reference and go with string similarity algorithms • Machine learning classification algorithms
  • 37. Steps • Trim, lowercase • Remove noise symbols with regular expressions • Identify and remove noise data • Mark some dataset as reference and go with string similarity algorithms • Machine learning classification algorithms
  • 38. String similarity algorithms • Levenshtein distance • Sørensen–Dice coefficient • Hamming distance • Longest Common Substring distance
  • 39. String similarity algorithms • Levenshtein distance • Sørensen–Dice coefficient (string-similarity) • Hamming distance (fuzzyset.js) • Longest Common Substring distance
  • 40. Tips #3 • Strings proximity calculation is expensive operation. Split it. • Shortening strings dramatically increases performance • Identify the common differences and handle them with condition upfront • Think of file formats and DB normalization • Go for mutability while working with a big data structures (In memory calculations)
  • 41. Tips #4 • Allow garbage collector to take the data which isn't used anymore (In memory calculations) • Go for transducers (Avoid x.filter().map().map().filter()) • Use schedulers • Be creative
  • 42. References • Pictures are taken from unsplash.com • Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624 • Libraries: • https://github.com/cheeriojs/cheerio • https://github.com/GoogleChrome/puppeteer • https://github.com/matthewmueller/x-ray • https://github.com/request/request-promise • https://github.com/jsdom/jsdom • https://github.com/inikulin/parse5 • https://github.com/aceakash/string-similarity • https://glench.github.io/fuzzyset.js/ • https://www.npmjs.com/package/node-schedule