SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Scraping Data from
Documents and the Web
Tommy Tavenner
National Wildlife Federation
What is it?
© 2014 Tommy Tavenner
What is Scraping?
• Converting data from human readable into machine readable
• This data is sometimes referred to as ‘unstructured’ but is really
just not structured properly for systematic parsing
• The data is often embedded in layers of formatting meta data.
Think HTML or PDF formatting like font colors and tables.
• The job of the scraper is to separate the data from the
formatting. In some cases even using the formatting to interpret
the data.
© 2014 Tommy Tavenner
Is it Legal?
© 2014 Tommy Tavenner
Maybe!
© 2014 Tommy Tavenner
Is Scraping Legal?
• It depends
• Most publically available data in the US falls within the sphere
of copyright protection.
> Creativity in producing the source data
> The manner in which the data is presented
> Fair Use on the web
• What is the purpose of the scraping?
© 2014 Tommy Tavenner
Is Scraping Legal?
• Terms of Service
> Does it explicitly prohibit scraping?
> Does it prohibit storing information privately?
© 2014 Tommy Tavenner
Is Scraping Legal?
• Feist v. Rural Telephone (1991)
> Feist, a phone book compiler in Kansas, copied the contents of
Rural Telephone’s directory after Rural refused to license the
information.
> Rural sued Feist for copyright infringement. Because of the nature
of the information, the case eventually made it to the supreme
court.
> The case centered on originality and whether compiling facts
constitutes an original work.
> The court ruled that the phone directory did not constitute and
original compilation because no discretion was exercised in
deciding on contents.
© 2014 Tommy Tavenner
Is Scraping Legal?
• LinkedIn case (2014)
> Suing a group of unknown defendants in California.
> LinkedIn alleges that this group used a series of bots and fake
profiles on the site to scrape content from other member profiles
> The case is based on the Digital Millennium Copyright Act.
© 2014 Tommy Tavenner
Jargon
• Spider – Searches for links within content and follows, building
up a site map or web of content.
• Crawler – Synonym for Spider
• Training Data – Like in supervised machine learning, training
data is used to teach a spider how to interpret the content they
will be processing.
• IP Proxy/Switching – Regular switching of IP address used to
bypass restrictions on the number of connections per client set
by web servers. May be a sign of less than legal or honorable
intent in scraping.
© 2014 Tommy Tavenner
Anatomy of a Scraper
Document Load
• Pull in the
complete web
page, PDF, XML,
etc.
Parsing
• Parse the HTML,
XML, or PDF meta
data into
something the
script can
understand
Extraction
• Use the results of
parsing to extract
the data we are
looking for
Transformation
•Convert the
data into
useful formats,
i.e. currency,
dates, etc.
© 2014 Tommy Tavenner
Anatomy of a Scraper
Document
Load
• Load the entire document or HTML
page. Generally as a string of
characters.
• For larger documents this may involve
splitting it into multiple pages
© 2014 Tommy Tavenner
Anatomy of a Scraper
Parsing
• Interpret the document to make searching
possible.
• Biggest potential failure point
• Specific to the source data.
• HTML Document Object Model
• PDF Grid Model
© 2014 Tommy Tavenner
Anatomy of a Scraper
Extraction
• Search parsed data for particular
pieces of information
• i.e. file name, link, or table
• Separate data into individual pieces for
later processing
© 2014 Tommy Tavenner
Anatomy of a Scraper
Transformation
• Convert data into proper output
• Apply standards
• Change type
• i.e. date string date
© 2014 Tommy Tavenner
Visual Scraping tools
• Require no programming knowledge
• Primarily web-based
• Allow quick access to data
• Because they are not bespoke may require more scrubbing of
the data after scraping
© 2014 Tommy Tavenner
ScraperWiki
• Paid Service with very basic free plan
• Focused on table extraction and Twitter data
• Takes a single page or document as its source
© 2014 Tommy Tavenner
ScraperWiki
• Allows you to quickly access the data or summarize it.
• Works well with PDF’s of tables but struggles with mixed data.
© 2014 Tommy Tavenner
Import.io
• In early stages, currently free with professional accounts
• Downloadable Java app – multi-platform
• Focused more on crawling sites to build up data sources
• Offers limited training or refining abilities to make sure it
extracts data correctly.
• Enables access to the data source either as a downloadable
file or as an API.
© 2014 Tommy Tavenner
Import.io
• Data can be extracted either for a single page or a full site
© 2014 Tommy Tavenner
Import.io
Scrapinghub
• Designed for much larger scraping jobs, including multi-site
© 2014 Tommy Tavenner
Scrapinghub
• Sits somewhere between a visual scraper and a scraping
library.
• Custom scrapers may be developed in Python and hosted by
Scrapinghub
• The autoscraper allows annotating pages and training the
scraper
• The crawler starts with a single page and works out from there
following links on the pages it finds and quickly building large
databases.
© 2014 Tommy Tavenner
Scraping with a scripting language
• Libraries are available in most languages.
• Primarily make it easier to understand a certain format, i.e.
HTML or PDF.
• Require strong knowledge of the language
• Require more fine tuning but result in much higher quality data
© 2014 Tommy Tavenner
R
• scrapeR – for parsing HTML/XML
• XML package – for parsing HTML/XML
• tm – for parsing PDFs using Xpdf or Poppler engines
© 2014 Tommy Tavenner
Python
• ScraperWiki
• Scrapy
• BeautifulSoup – for parsing HTML
• XPath
• PDFMiner – for parsing PDFs
© 2014 Tommy Tavenner
PHP
• Simple HTML DOM
• PDF Parser
© 2014 Tommy Tavenner
Javascript
• NodeJS using Request and Cheerio
• jsPDF
• pdf2json
© 2014 Tommy Tavenner

Mais conteúdo relacionado

Mais procurados

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automationBHAWESH RAJPAL
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSchool of Data
 
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaTop 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaEdureka!
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateKoray Tugberk GUBUR
 

Mais procurados (20)

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automation
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Intro to beautiful soup
Intro to beautiful soupIntro to beautiful soup
Intro to beautiful soup
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Web mining
Web miningWeb mining
Web mining
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaTop 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
 
Search engine
Search engineSearch engine
Search engine
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 

Semelhante a Scraping Data from Documents and the Web

Semelhante a Scraping Data from Documents and the Web (20)

Module 5 and 6
Module 5 and 6Module 5 and 6
Module 5 and 6
 
Internet
InternetInternet
Internet
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basics
 
Internet &web technology
 Internet &web technology Internet &web technology
Internet &web technology
 
Module 3
Module 3Module 3
Module 3
 
Basics concepts of internet.ppt
Basics concepts of internet.pptBasics concepts of internet.ppt
Basics concepts of internet.ppt
 
Internet
InternetInternet
Internet
 
Internet
InternetInternet
Internet
 
Intro. to the internet and web
Intro. to the internet and webIntro. to the internet and web
Intro. to the internet and web
 
An Introduction To World Wide Web
An Introduction To World Wide WebAn Introduction To World Wide Web
An Introduction To World Wide Web
 
Internet
InternetInternet
Internet
 
Internet.ppt
Internet.pptInternet.ppt
Internet.ppt
 
Internet and Web - Week 9.ppt
Internet and Web - Week 9.pptInternet and Web - Week 9.ppt
Internet and Web - Week 9.ppt
 
Internet
InternetInternet
Internet
 
Internet.ppt
Internet.pptInternet.ppt
Internet.ppt
 
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptxIntroduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
 
Internet
InternetInternet
Internet
 
internet
internetinternet
internet
 
Internet
InternetInternet
Internet
 
Web Technology Part 1
Web Technology Part 1Web Technology Part 1
Web Technology Part 1
 

Último

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 

Último (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 

Scraping Data from Documents and the Web

  • 1. Scraping Data from Documents and the Web Tommy Tavenner National Wildlife Federation
  • 2. What is it? © 2014 Tommy Tavenner
  • 3. What is Scraping? • Converting data from human readable into machine readable • This data is sometimes referred to as ‘unstructured’ but is really just not structured properly for systematic parsing • The data is often embedded in layers of formatting meta data. Think HTML or PDF formatting like font colors and tables. • The job of the scraper is to separate the data from the formatting. In some cases even using the formatting to interpret the data. © 2014 Tommy Tavenner
  • 4. Is it Legal? © 2014 Tommy Tavenner
  • 6. Is Scraping Legal? • It depends • Most publically available data in the US falls within the sphere of copyright protection. > Creativity in producing the source data > The manner in which the data is presented > Fair Use on the web • What is the purpose of the scraping? © 2014 Tommy Tavenner
  • 7. Is Scraping Legal? • Terms of Service > Does it explicitly prohibit scraping? > Does it prohibit storing information privately? © 2014 Tommy Tavenner
  • 8. Is Scraping Legal? • Feist v. Rural Telephone (1991) > Feist, a phone book compiler in Kansas, copied the contents of Rural Telephone’s directory after Rural refused to license the information. > Rural sued Feist for copyright infringement. Because of the nature of the information, the case eventually made it to the supreme court. > The case centered on originality and whether compiling facts constitutes an original work. > The court ruled that the phone directory did not constitute and original compilation because no discretion was exercised in deciding on contents. © 2014 Tommy Tavenner
  • 9. Is Scraping Legal? • LinkedIn case (2014) > Suing a group of unknown defendants in California. > LinkedIn alleges that this group used a series of bots and fake profiles on the site to scrape content from other member profiles > The case is based on the Digital Millennium Copyright Act. © 2014 Tommy Tavenner
  • 10. Jargon • Spider – Searches for links within content and follows, building up a site map or web of content. • Crawler – Synonym for Spider • Training Data – Like in supervised machine learning, training data is used to teach a spider how to interpret the content they will be processing. • IP Proxy/Switching – Regular switching of IP address used to bypass restrictions on the number of connections per client set by web servers. May be a sign of less than legal or honorable intent in scraping. © 2014 Tommy Tavenner
  • 11. Anatomy of a Scraper Document Load • Pull in the complete web page, PDF, XML, etc. Parsing • Parse the HTML, XML, or PDF meta data into something the script can understand Extraction • Use the results of parsing to extract the data we are looking for Transformation •Convert the data into useful formats, i.e. currency, dates, etc. © 2014 Tommy Tavenner
  • 12. Anatomy of a Scraper Document Load • Load the entire document or HTML page. Generally as a string of characters. • For larger documents this may involve splitting it into multiple pages © 2014 Tommy Tavenner
  • 13. Anatomy of a Scraper Parsing • Interpret the document to make searching possible. • Biggest potential failure point • Specific to the source data. • HTML Document Object Model • PDF Grid Model © 2014 Tommy Tavenner
  • 14. Anatomy of a Scraper Extraction • Search parsed data for particular pieces of information • i.e. file name, link, or table • Separate data into individual pieces for later processing © 2014 Tommy Tavenner
  • 15. Anatomy of a Scraper Transformation • Convert data into proper output • Apply standards • Change type • i.e. date string date © 2014 Tommy Tavenner
  • 16. Visual Scraping tools • Require no programming knowledge • Primarily web-based • Allow quick access to data • Because they are not bespoke may require more scrubbing of the data after scraping © 2014 Tommy Tavenner
  • 17. ScraperWiki • Paid Service with very basic free plan • Focused on table extraction and Twitter data • Takes a single page or document as its source © 2014 Tommy Tavenner
  • 18. ScraperWiki • Allows you to quickly access the data or summarize it. • Works well with PDF’s of tables but struggles with mixed data. © 2014 Tommy Tavenner
  • 19. Import.io • In early stages, currently free with professional accounts • Downloadable Java app – multi-platform • Focused more on crawling sites to build up data sources • Offers limited training or refining abilities to make sure it extracts data correctly. • Enables access to the data source either as a downloadable file or as an API. © 2014 Tommy Tavenner
  • 20. Import.io • Data can be extracted either for a single page or a full site © 2014 Tommy Tavenner
  • 22. Scrapinghub • Designed for much larger scraping jobs, including multi-site © 2014 Tommy Tavenner
  • 23. Scrapinghub • Sits somewhere between a visual scraper and a scraping library. • Custom scrapers may be developed in Python and hosted by Scrapinghub • The autoscraper allows annotating pages and training the scraper • The crawler starts with a single page and works out from there following links on the pages it finds and quickly building large databases. © 2014 Tommy Tavenner
  • 24. Scraping with a scripting language • Libraries are available in most languages. • Primarily make it easier to understand a certain format, i.e. HTML or PDF. • Require strong knowledge of the language • Require more fine tuning but result in much higher quality data © 2014 Tommy Tavenner
  • 25. R • scrapeR – for parsing HTML/XML • XML package – for parsing HTML/XML • tm – for parsing PDFs using Xpdf or Poppler engines © 2014 Tommy Tavenner
  • 26. Python • ScraperWiki • Scrapy • BeautifulSoup – for parsing HTML • XPath • PDFMiner – for parsing PDFs © 2014 Tommy Tavenner
  • 27. PHP • Simple HTML DOM • PDF Parser © 2014 Tommy Tavenner
  • 28. Javascript • NodeJS using Request and Cheerio • jsPDF • pdf2json © 2014 Tommy Tavenner