SlideShare uma empresa Scribd logo
1 de 25
SQLDay 2019
GOLD SPONSORS
SILVER SPONSOR
BRONZE SPONSOR
PLATINUM SPONSOR STRATEGIC PARTNER
SQLDay 2019
Web scraping
practical guide
Sławomir Drzymała
SQLDay 2019
About me
• Sławomir Drzymała
• Business Intelligence Consultant
• Speaker & member of PLSSUG
• Speaker at the conferences
• Organizer (meetups, hackathon's)
• Cofounder of seequality.net
• Microsoft Technology enthusiast…
sdrzymala
SDrzymala
slawomirdrzymala@outlook.com
sdrzymala seequality
SQLDay 2019
A.L.I.C.E. ?
SQLDay 2019
A.L.I.C.E. ?
(Artificial Linguistic Internet Computer Entity)
also referred to as Alicebot, or simply Alice
is a natural language processing chatterbot (wiki)
SQLDay 2019
Idea
• Web scrapping was always a thing for me…
– Chatbot knowledge database
– Many freelancers websites
– Culinary recipes helper
– Microsoft Ignite twitter analysis
– PowerBI report’s errors
– SQL-Saturday, Channel9, sqlbits, etc…
• Reason 1 - can be used to get the data from web
• Reason 2 - because can be used to automate the boring stuff too
• Wanted to share some thoughts and experience
• It’s getting popular and It’s relatively easy
SQLDay 2019
For what / business cases
• Product and price research
• Market research
• Aggregators
• Comparison engines
• Brand loyalty
• Use cases [here] and [here]
SQLDay 2019
Agenda
• Theory [10 minutes]
• Demo [45 minutes]
• Recap and Q&A [5 minutes]
Goal is to show different methods, tools, techniques and the way
of thinking to scrap the data efficiently. Also to show that it’s easy…
SQLDay 2019
Basics
• Web scraping – web harvesting, or web data extraction is data
scraping used for extracting data from websites [Wiki]
• Web crawling/Crawler – process, (spider, spiderbot) is an
Internet bot that systematically browses the WWW [Wiki]
HTML
Parsing
HTML
HTML Structured
(or not)
data
Insight
API
SQLDay 2019
Process
• Four main steps
• First two really related to the web scraping topic
Get HTML Parse HTML Save the
data
Get
insight
SQLDay 2019
Getting data
• API – limits, price…
• JavaScript – problem with rendering, timing
• Captcha – avoiding, completing
• Login – interaction with page
• Crawling – time, concurrency
• Getting source page html could be done manually as well
SQLDay 2019
Parsing
• Relatively easy, but…
• Be prepared that the web structure might change
• Structure might differ between subpages
• JavaSript…
• Frames…
SQLDay 2019
Save the data
• Easy…
• Save to files
• Save to database
• Save to …
SQLDay 2019
Get insight
• Extract information
• Data quality
–Missing data…
–Incorrect data…
• Data preparation and cleaning
–Programming, T-SQL
–Tools like Microsoft DQS
SQLDay 2019
What’s needed
• Tool and/or programming language + library
• Basic knowledge of
–HTML
–CSS
–JS
SQLDay 2019
DEMO????
Web scraping using Power BI Desktop, Pytohn and .NET
SQLDay 2019
Legal or illegal?
• Not illegal per se, but can lead to…
• Depends on the country….
• Always read “Terms and conditions”
• Create light crawlers
• Follow the guidelines to avoid detection
–Web Scraping: Avoiding Detection
–How to prevent getting blacklisted while scraping
• Be careful, there is many stories…
SQLDay 2019
DEMO
Web scraping using Power BI Desktop, Pytohn and .NET
SQLDay 2019
Demo
• Power BI
– Native
– R (Rvest)
– Python (Requests, BeautifulSoup)
• .NET
– HTML Agility Pack
– Selenium
• Python
– Requests
– BeautifulSoup
– Selenium
SQLDay 2019
There is more tools and libs
• There is many more libraries, frameworks and tools avaliable
• Wikipedia:
• Check it out
cURL
Data Toolbar
Diffbot
Heritrix
HtmlUnit
HTTrack
iMacros
Selenium (software)
Jaxer
Mozenda
nokogiri
OutWit Hub
watirWget
WSO2 Mashup Server
Yahoo! Query Language
SQLDay 2019
Recap
• Web scraping is easy…
– Basic programming skills
– Basic knowledge of HTML, CSS, JavaScript
• If you learn how to scrape the website you will be able to:
– get any data from any website
– automate some boring tasks
– Have fun
• There is plenty of tools and libraries available
SQLDay 2019
Cheatsheet
SQLDay 2019
What to know more?
• Get the presentation
• Seequality – [1]
• Power BI – [1] [2] [3]
• C# - [1] [2] [3]
• Python – [1] [2] [3]
• General - [1] [2] [3]
• Experiment yourself
SQLDay 2019
Web scraping
practical guide
Sławomir Drzymała
sdrzymala SDrzymalaslawomirdrzymala@outlook.com
https://github.com/sdrzymala/WebScrappingPracticalGuide
SQLDay 2019
GOLD SPONSORS
SILVER SPONSOR
BRONZE SPONSOR
PLATINUM SPONSOR STRATEGIC PARTNER

Mais conteúdo relacionado

Semelhante a Web scrapping - practical guide

The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
hernanibf
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
animove
 

Semelhante a Web scrapping - practical guide (20)

Hinting at a better web
Hinting at a better webHinting at a better web
Hinting at a better web
 
Putting together a web app
Putting together a web appPutting together a web app
Putting together a web app
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
 
Untangling spring week1
Untangling spring week1Untangling spring week1
Untangling spring week1
 
MDN Development & Web Documentation
MDN Development & Web DocumentationMDN Development & Web Documentation
MDN Development & Web Documentation
 
December 2020 Microsoft 365 Need to Know Webinar
December 2020 Microsoft 365 Need to Know WebinarDecember 2020 Microsoft 365 Need to Know Webinar
December 2020 Microsoft 365 Need to Know Webinar
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
The Semantic Web: The Why? What? How?
The Semantic Web: The Why? What? How?The Semantic Web: The Why? What? How?
The Semantic Web: The Why? What? How?
 
Semantic Web For Dummies
Semantic Web For DummiesSemantic Web For Dummies
Semantic Web For Dummies
 
SPSDenver - SharePoint & jQuery - What I wish I would have known
SPSDenver - SharePoint & jQuery - What I wish I would have knownSPSDenver - SharePoint & jQuery - What I wish I would have known
SPSDenver - SharePoint & jQuery - What I wish I would have known
 
SharePoint Custom Development
SharePoint Custom DevelopmentSharePoint Custom Development
SharePoint Custom Development
 
2/15/2012 - Wrapping Your Head Around the SharePoint Beast
2/15/2012 - Wrapping Your Head Around the SharePoint Beast2/15/2012 - Wrapping Your Head Around the SharePoint Beast
2/15/2012 - Wrapping Your Head Around the SharePoint Beast
 
SharePoint Development
SharePoint DevelopmentSharePoint Development
SharePoint Development
 
Sharepoint Presentation
Sharepoint PresentationSharepoint Presentation
Sharepoint Presentation
 
HTML 5 & The Modern Web
HTML 5 & The Modern WebHTML 5 & The Modern Web
HTML 5 & The Modern Web
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
The more information Website Design_New.pdf
The more information Website Design_New.pdfThe more information Website Design_New.pdf
The more information Website Design_New.pdf
 
HTML5 features & JavaScript APIs
HTML5 features & JavaScript APIsHTML5 features & JavaScript APIs
HTML5 features & JavaScript APIs
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Web scrapping - practical guide

  • 1. SQLDay 2019 GOLD SPONSORS SILVER SPONSOR BRONZE SPONSOR PLATINUM SPONSOR STRATEGIC PARTNER
  • 2. SQLDay 2019 Web scraping practical guide Sławomir Drzymała
  • 3. SQLDay 2019 About me • Sławomir Drzymała • Business Intelligence Consultant • Speaker & member of PLSSUG • Speaker at the conferences • Organizer (meetups, hackathon's) • Cofounder of seequality.net • Microsoft Technology enthusiast… sdrzymala SDrzymala slawomirdrzymala@outlook.com sdrzymala seequality
  • 5. SQLDay 2019 A.L.I.C.E. ? (Artificial Linguistic Internet Computer Entity) also referred to as Alicebot, or simply Alice is a natural language processing chatterbot (wiki)
  • 6. SQLDay 2019 Idea • Web scrapping was always a thing for me… – Chatbot knowledge database – Many freelancers websites – Culinary recipes helper – Microsoft Ignite twitter analysis – PowerBI report’s errors – SQL-Saturday, Channel9, sqlbits, etc… • Reason 1 - can be used to get the data from web • Reason 2 - because can be used to automate the boring stuff too • Wanted to share some thoughts and experience • It’s getting popular and It’s relatively easy
  • 7. SQLDay 2019 For what / business cases • Product and price research • Market research • Aggregators • Comparison engines • Brand loyalty • Use cases [here] and [here]
  • 8. SQLDay 2019 Agenda • Theory [10 minutes] • Demo [45 minutes] • Recap and Q&A [5 minutes] Goal is to show different methods, tools, techniques and the way of thinking to scrap the data efficiently. Also to show that it’s easy…
  • 9. SQLDay 2019 Basics • Web scraping – web harvesting, or web data extraction is data scraping used for extracting data from websites [Wiki] • Web crawling/Crawler – process, (spider, spiderbot) is an Internet bot that systematically browses the WWW [Wiki] HTML Parsing HTML HTML Structured (or not) data Insight API
  • 10. SQLDay 2019 Process • Four main steps • First two really related to the web scraping topic Get HTML Parse HTML Save the data Get insight
  • 11. SQLDay 2019 Getting data • API – limits, price… • JavaScript – problem with rendering, timing • Captcha – avoiding, completing • Login – interaction with page • Crawling – time, concurrency • Getting source page html could be done manually as well
  • 12. SQLDay 2019 Parsing • Relatively easy, but… • Be prepared that the web structure might change • Structure might differ between subpages • JavaSript… • Frames…
  • 13. SQLDay 2019 Save the data • Easy… • Save to files • Save to database • Save to …
  • 14. SQLDay 2019 Get insight • Extract information • Data quality –Missing data… –Incorrect data… • Data preparation and cleaning –Programming, T-SQL –Tools like Microsoft DQS
  • 15. SQLDay 2019 What’s needed • Tool and/or programming language + library • Basic knowledge of –HTML –CSS –JS
  • 16. SQLDay 2019 DEMO???? Web scraping using Power BI Desktop, Pytohn and .NET
  • 17. SQLDay 2019 Legal or illegal? • Not illegal per se, but can lead to… • Depends on the country…. • Always read “Terms and conditions” • Create light crawlers • Follow the guidelines to avoid detection –Web Scraping: Avoiding Detection –How to prevent getting blacklisted while scraping • Be careful, there is many stories…
  • 18. SQLDay 2019 DEMO Web scraping using Power BI Desktop, Pytohn and .NET
  • 19. SQLDay 2019 Demo • Power BI – Native – R (Rvest) – Python (Requests, BeautifulSoup) • .NET – HTML Agility Pack – Selenium • Python – Requests – BeautifulSoup – Selenium
  • 20. SQLDay 2019 There is more tools and libs • There is many more libraries, frameworks and tools avaliable • Wikipedia: • Check it out cURL Data Toolbar Diffbot Heritrix HtmlUnit HTTrack iMacros Selenium (software) Jaxer Mozenda nokogiri OutWit Hub watirWget WSO2 Mashup Server Yahoo! Query Language
  • 21. SQLDay 2019 Recap • Web scraping is easy… – Basic programming skills – Basic knowledge of HTML, CSS, JavaScript • If you learn how to scrape the website you will be able to: – get any data from any website – automate some boring tasks – Have fun • There is plenty of tools and libraries available
  • 23. SQLDay 2019 What to know more? • Get the presentation • Seequality – [1] • Power BI – [1] [2] [3] • C# - [1] [2] [3] • Python – [1] [2] [3] • General - [1] [2] [3] • Experiment yourself
  • 24. SQLDay 2019 Web scraping practical guide Sławomir Drzymała sdrzymala SDrzymalaslawomirdrzymala@outlook.com https://github.com/sdrzymala/WebScrappingPracticalGuide
  • 25. SQLDay 2019 GOLD SPONSORS SILVER SPONSOR BRONZE SPONSOR PLATINUM SPONSOR STRATEGIC PARTNER